Reverse Engineering Book Stories with Neo4j and GraphAware NLP

· 10 min read

A book tells us a story, but for a computer it is a wall of text. How can we use graphs and NLP to help our machines make more sense of a story?

Our example comes from the A Song of Ice and Fire books, aka Game of Thrones. We converted the e-books (epub) to text-files and used a small python program to split them into chapters, paragraphs, and sentences.

So a book turned into this model :


GraphAware NLP

GraphAware NLP Framework is a project that integrates NLP processing capabilities available in several software packages like Stanford NLP and OpenNLP, existing data sources, such as ConceptNet5 and WordNet, and GraphAware’s knowledge about search, graphs, and Recommendation Engines. GraphAware NLP is developed as plugin for Neo4j and an external frontend for interacting with a Spark Cluster. It provides a set of tools, by means of procedures, background process, and APIs, that all together provide a Domain Specific Language for Natural Language Processing on top of Cypher. The available interesting features are:

  • Information Extraction (IE) - processing textual information for extracting main components and relationships
  • Extracting sentiments
  • Enriching basic data with ontologies and concepts (ConceptNet 5)
  • Computing similarities between text elements in a corpus using base data and ontology information
  • Enriching knowledge using external sources (Alchemy)
  • Providing basic search capabilities
  • Providing complex search capabilities leveraging enriched knowledge like ontology, sentiments, and similarity
  • Providing recommendations based on a combination of content/ontology-based recommendations, social tags, and collaborative filtering
  • Unsupervised corpus clustering using LDA
  • Semi-supervised corpus clustering using Label Propagation
  • Word2Vec computation and importing

You can read more about it on the NLP section of our blog.

Text processing phase

Custom text processing pipeline with custom stopwords

CALL ga.nlp.addPipeline({textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', name: 'customStopwords', stopWords: '+,i,me,my,myself,we,our,ours,ourselves,you,your,yours,yourself,yourselves,he,him,his,himself,she,her,hers,herself,it,its,itself,they,them,their,theirs,themselves,what,which,who,whom,this,that,these,those,am,is,are,was,were,be,been,being,have,has,had,having,do,does,did,doing,a,an,the,and,but,if,or,because,as,until,while,of,at,by,for,with,about,against,between,into,through,during,before,after,above,below,to,from,up,down,in,out,on,off,over,under,again,further,then,once,here,there,when,where,why,how,all,any,both,each,few,more,most,other,some,such,no,nor,not,only,own,same,so,than,too,very,s,t,can,will,just,don,should,now', threadNumber: 20})

Text extraction, tokenization, lemmification and named entity recognition

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result

The two above steps lead us to the following graph :


The original graph and its NLP processed representation are related with the HAS_ANNOTATED_TEXT relationship. Every paragraph is split into sentences which themselves are extracted to Tag nodes. Optionally, tags have an additional Named Entity Recognition label like NER_Person, NER_Location, etc..

First insights from text

Which tags/concepts were found ?

MATCH (n:Tag) WHERE n:NER_Person OR n:NER_Location OR n:NER_Organization RETURN n.value,, size((n)<-[:HAS_TAG]-()) AS f ORDER BY f DESC
│"n.value"   │""      │"f"  │
│"Jon"       │["PERSON"]  │"510"│
│"Jaime"     │["PERSON"]  │"384"│
│"Tyrion"    │["PERSON"]  │"372"│
│"Arya"      │["PERSON"]  │"358"│
│"Robb"      │["PERSON"]  │"299"│
│"Dany"      │["PERSON"]  │"280"│
│"Sam"       │["PERSON"]  │"240"│
│"Davos"     │["PERSON"]  │"229"│
│"Joffrey"   │["PERSON"]  │"210"│
│"Catelyn"   │["PERSON"]  │"206"│
│"Sansa"     │["LOCATION"]│"193"│
│"Cersei"    │["PERSON"]  │"162"│
│"Tywin"     │["PERSON"]  │"161"│
│"Brienne"   │["PERSON"]  │"138"│
│"Robert"    │["PERSON"]  │"126"│
│"Ygritte"   │["PERSON"]  │"101"│
│"Meera"     │["PERSON"]  │"101"│
│"Beric"     │["PERSON"]  │"98" │
│"Jojen"     │["PERSON"]  │"94" │

It is very impressive that on this artificial text, with no real people, locations or organizations, the NLP approach still yields such correct results. (Michael Hunger - Neo4j, Inc).

Which chapter has the most different tags?

WHERE size( > 0
RETURN c.number, c.title, count(*) as f ORDER BY f DESC LIMIT 5
│"c.number"│"c.title" │"f"   │
│"61"      │"TYRION"  │"4453"│
│"5"       │"TYRION"  │"3693"│
│"24"      │"DAENERYS"│"3678"│
│"34"      │"SAMWELL" │"3659"│
│"57"      │"BRAN"    │"3564"│

Making sense of thrones

We can now use this base model to make more advanced (fun) queries on the data

First occurrences of persons in the book

MATCH (t:Tag) WHERE t:NER_Person
WITH t, size((t)<-[:HAS_TAG]-()) AS f
WITH t.value as person,s.sentenceNumber as num, para.position as paraNum, c.number as chapterNum, c.title as chapter, s.text as sentence, apoc.text.format('C%03d-P%03d-S%03d',[c.number,para.position,s.sentenceNumber]) as position
WHERE chapterNum > 1
WITH * ORDER BY position ASC 
RETURN person, collect({chapter: chapter, sentence: sentence, position: position})[0]

While the query might look complex, it is just traversing the graph. Sentence nodes contain their position in the paragraph which make it easy to retrieve apparitions in the right order.

│"person" │"collect({chapter: chapter, se│
│         │ntence: sentence, position: po│
│         │sition})[0]"                  │
│"Jaime"  │{"chapter":"JAIME","sentence":│
│         │"Not that Jaime had ever seen │
│         │her smiling.","position":"C002│
│         │-P003-S002"}                  │
│"Robb"   │{"chapter":"JAIME","sentence":│
│         │"She stood at Robb’s left hand│
│         │ beside the high seat, and for│
│         │ a moment felt almost as if sh│
│         │e were looking down at her own│
│         │ dead, at Bran and Rickon.","p│
│         │osition":"C002-P003-S000"}    │
│"Tyrion" │{"chapter":"CATELYN","sentence│
│         │":"Tyrion felt their eyes on h│
│         │im as he rode past; chilly eye│
│         │s, angry and unsympathetic.","│
│         │position":"C003-P002-S001"}   │
│         │"The moonstones Joffrey gave h│
│         │er.”","position":"C002-P005-S0│
│         │02"}                          │
│"Jon"    │{"chapter":"JAIME","sentence":│
│         │"Jon had never met anyone so s│
│         │tubborn, except maybe for his │
│         │little sister Arya.","position│
│         │":"C002-P005-S001"}           │
│         │"A silence fell across the tor│
│         │chlit hall, and in the quiet C│
│         │atelyn could hear Grey Wind ho│
│         │wling half a castle away.","po│
│         │sition":"C002-P002-S001"}     │
│"Arya"   │{"chapter":"JAIME","sentence":│
│         │"Jon had never met anyone so s│
│         │tubborn, except maybe for his │
│         │little sister Arya.","position│
│         │":"C002-P005-S001"}           │
│"Davos"  │{"chapter":"JAIME","sentence":│
│         │"But Davos could not complain │
│         │of chill.","position":"C002-P0│
│         │04-S000"}                     │
│"Dany"   │{"chapter":"JAIME","sentence":│
│         │"The harpy of Ghis, Dany thoug│
│         │ht.","position":"C002-P003-S00│
│         │0"}                           │
│"Sam"    │{"chapter":"CATELYN","sentence│
│         │":"Sam was trying to feed him │
│         │onion broth, but he could not │
│         │swallow.","position":"C003-P00│
│         │3-S003"}                      │

Interaction graph

The idea here is that we are looking for who is interacting with another person in the same sentence and create a new OCCURS_WITH relationship between the entities

MATCH (a:AnnotatedText)-[:CONTAINS_SENTENCE]->(s:Sentence)-[:SENTENCE_TAG_OCCURRENCE]->(to:TagOccurrence)-[:TAG_OCCURRENCE_TAG]->(tag)
WHERE tag:NER_Person 
WITH a, to, tag
ORDER BY, to.startPosition
WITH a, collect(tag) as tags
UNWIND range(0, size(tags) - 2) as i
WITH a, tags[i] as tag1, tags[i+1] as tag2 WHERE tag1 <> tag2
MERGE (tag1)-[r:OCCURS_WITH]-(tag2)
ON CREATE SET r.freq = 1
ON MATCH SET r.freq = r.freq + 1


Tag Occurrence nodes are an intermediate representation between a sentence and a tag found in it. Because Tag nodes are unique, this offers the possibility to have extra information about the occurrence of a tag in a specific sentence.

Find the queen/lord/king/lady

MATCH (n:Tag) WHERE n.value IN ['lord','queen','king','lady']
MATCH (n)<-[:TAG_OCCURRENCE_TAG]-(to)<-[:SENTENCE_TAG_OCCURRENCE]-(s)-[:SENTENCE_TAG_OCCURRENCE]->(to2)-[:TAG_OCCURRENCE_TAG]->(t2) WHERE to2.startPosition = to.endPosition + 1
AND t2:NER_Person
WITH n.value AS v, toLower(t2.value) AS person, count(*) AS f ORDER BY f DESC
RETURN v, collect(person)[0..20]
│"v"    │"collect(person)[0..20]"      │
│"lady" │["lysa","smallwood","catelyn",│
│       │"brienne","frey","leonette","o│
│       │lenna","bulwer","melisandre","│
│       │ashara dayne","roslin","roslin│
│       │ frey","lynesse","jeyne wester│
│       │ling","janna fossoway","jeyne │
│       │swann","joanna","whent","walda│
│       │ frey","shae"]                │
│"king" │["robert","joffrey","robb","ba│
│       │lon","tommen","aerys","stannis│
│       │","daeron","maegor","joffrey b│
│       │aratheon","robert i baratheon"│
│       │,"andahar","renly","aegon v ta│
│       │rgaryen","tristifer","harren t│
│       │he black","harren","baelor","t│
│       │orrhen"]                      │
│       │rgaery","alysanne","myrcella",│
│       │"rhaella"]                    │
│"lord" │["tywin","beric","walder","bol│
│       │ton","tyrell","hoster","petyr"│
│       │,"alester","rickard","eddard",│
│       │"frey","vargo","karstark","ren│
│       │ly","redwyne","mormont","edmur│
│       │e","stannis","davos","tully"] │

Which organizations and locations were found

MATCH (n:Tag) WHERE n:NER_Organization OR n:NER_Location
WITH n.value AS entity, AS types, size((n)<-[:HAS_TAG]-()) AS f
RETURN types, collect(entity)[0..25] AS entities
│"types"         │"entities"                    │
│["PERSON"]      │["Tyrion","Arya","Robb","Davos│
│                │","Joffrey","Catelyn","Cersei"│
│                │,"Tywin","Brienne","Ygritte","│
│                │Jojen","Edmure","Grenn","Mormo│
│                │nt","Stannis","Gendry","Riverr│
│                │un","Oberyn","Shae","Thoros","│
│                │Bronn","Margaery","Clegane","P│
│                │etyr","Craster"]              │
│["MISC"]        │["Dornish","Valyrian","Eyrie",│
│                │"Thenn","Dywen","Westerosi","T│
│                │argaryen","Greyjoys","Cerwyn",│
│                │"Baratheon","Golden Tooth","Qa│
│                │rtheen","Queen of Thorns","Iro│
│                │n Islands"]                   │
│["ORGANIZATION"]│["King","Castle Black","Tyrell│
│                │","Aegon","House","Salladhor S│
│                │aan","Blackwater","Mace Tyrell│
│                │","Twins","Kraznys","Irri","Fr│
│                │eys","Jinglebell","Pentos","Wh│
│                │itetree","Vale","Hobb","Red Vi│
│                │per","Myr","Ser Addam","Mummer│
│                │s","Acorn Hall","Kingswood Bro│
│                │therhood","Bloody Mummers","Fo│
│                │rd"]                          │
│["LOCATION"]    │["Sansa","Winterfell","Dorne",│
│                │"Highgarden","Casterly Rock","│
│                │Greenbeard","Oldtown","Dragons│
│                │tone","Pylos","Lannister","Tar│
│                │th","Duskendale","Qarth","Gran│
│                │d Maester Pycelle","Willas","W│
│                │hite Harbor","Longclaw","Eastw│
│                │atch","Green Fork","Nightfort"│
│                │,"Braavos","Baelor","Khaleesi"│
│                │,"Tyroshi","Shadow Tower"]    │

Entity merging

Finding names and merging them together for creating person nodes.

This is done in 2 steps, the first step is to extract name parts into their own nodes and create a hierarchy of name parts :

match (t:Tag:NER_Person)
merge (n1:Name {name:toLower(t.value)}) set n1:FromTag
with n1
unwind split(,' ') as part
with n1, part where length(part) > 2
merge (n2:Name {name:part}) set n2:Single
with n1,n2 where n1 <> n2
merge (n2)-[:PART_OF]->(n1)

name extraction

The second step is to traverse from a single name entity up to its top PART_OF relationship and traverse its own cluster in order to determine the entities to merge together.

Let’s first check what the results will look like :

MATCH (n:Single) WITH n, size( (n)-[:PART_OF]->() ) as degree WHERE degree < 5 
MATCH path = (n)-[:PART_OF*..3]->(m)
WITH n, collect(m) as ms
WHERE size(ms) > 1
UNWIND ms as m
WITH n, m WHERE size([ (m)<-[:PART_OF*..3]-(o)-[:PART_OF*..3]->(e:FromTag) WHERE e IN ms | e]) = size(ms)
WITH n, collect( as names
return, names, reduce(x = '', name in names | case when size(name) > size(x) then name else x end)

│""     │"names"                       │"reduce(x = '', name in names │
│             │                              │| case when size(name) > size(│
│             │                              │x) then name else x end)"     │
│"tallad"     │["rowan ser tallad","ser talla│"rowan ser tallad"            │
│             │d"]                           │                              │
│"grace"      │["his grace","his grace king j│"his grace king joffrey"      │
│             │offrey"]                      │                              │
│"walder"     │["walder frey","theon greyjoy │"theon greyjoy walder frey tyw│
│             │walder frey tywin lannister"] │in lannister"                 │
│"kettleblack"│["oswald kettleblack","osmund │"oswald kettleblack"          │
│             │kettleblack"]                 │                              │
│"dolorous"   │["dolorous edd","dolorous edd │"dolorous edd dywen"          │
│             │dywen"]                       │                              │
│"tallhart"   │["ser helman tallhart","helman│"ser helman tallhart"         │
│             │ tallhart"]                   │                              │
│"bronn"      │["ser bronn","philip foote ser│"philip foote ser bronn jalabh│
│             │ bronn jalabhar xho"]         │ar xho"                       │
│"hempen"     │["bodger hempen dan","hempen d│"bodger hempen dan"           │
│             │an"]                          │                              │
│"rhaegar"    │["prince rhaegar","rhaegar pri│"rhaegar prince of dragonstone│
│             │nce of dragonstone"]          │"                             │
│"ryman"      │["ser ryman","ser ryman frey"]│"ser ryman frey"              │
│"mandon"     │["ser mandon moore","ser mando│"ser mandon moore"            │
│             │n"]                           │                              │
│"lorch"      │["ser amory lorch","amory lorc│"ser amory lorch"             │
│             │h"]                           │                              │
│"red"        │["red alyn","dornish dilly red│"dornish dilly red alyn"      │
│             │ alyn"]                       │                              │
│"donal"      │["clydas donal noye","donal no│"clydas donal noye"           │
│             │ye"]                          │                              │
│"thorne"     │["alliser thorne","ser alliser│"ser alliser thorne"          │
│             │ thorne"]                     │                              │

We can now create the Person entities and relate them to their corresponding Tag nodes :

MATCH (n:Single) WITH n, size( (n)-[:PART_OF]->() ) as degree WHERE degree < 5 
MATCH path = (n)-[:PART_OF*..3]->(m)
WITH n, collect(m) as ms
WHERE size(ms) > 1
UNWIND ms as m
WITH n, m WHERE size([ (m)<-[:PART_OF*..3]-(o)-[:PART_OF*..3]->(e:FromTag) WHERE e IN ms | e]) = size(ms)
WITH n, collect( as names
with as single, names, reduce(x = '', name in names | case when size(name) > size(x) then name else x end) as longestName
MERGE (p:Person {name: longestName})
WITH p, names
UNWIND names as name
MATCH (n:Name {name: name})
MERGE (p)-[:KNOWN_AS]->(n)


And clean the temporary Name nodes



The last step is a very interesting step, because it offers us the possibility to compute again the interaction graph but with more precise Person entities than what we have done previously.

MATCH (a:AnnotatedText)-[:CONTAINS_SENTENCE]->(s:Sentence)-[:SENTENCE_TAG_OCCURRENCE]->(to:TagOccurrence)-[:TAG_OCCURRENCE_TAG]->(tag)<-[:KNOWN_AS]-(p:Person)
WITH a, to, p
ORDER BY, to.startPosition
WITH a, collect(p) as persons
UNWIND range(0, size(persons) - 2) as i
WITH a, persons[i] as person1, persons[i+1] as person2 WHERE person1 <> person2
MERGE (person1)-[r:OCCURS_WITH_PERSON]-(person2)
ON CREATE SET r.freq = 1
ON MATCH SET r.freq = r.freq + 1


We’ve made the database available (without the original text) in read-only mode here and a backup can be downloaded from Dropbox here. The Neo4j version used is 3.1.3.

Graph Structures are ideal for storing enriched representations of textual data. Combine this with NLP capabilities and a powerful query language for graphs and you are able to discover really interesting insights from books.

GraphAware is a Gold Sponsor of GraphConnect New York, feel free to pass by our booth to learn more about the GraphAware NLP Framework.

Christophe Willemsen

Technology & Infrastructure | Neo4j certification

Christophe Willemsen brings almost 15 years of experience as a Communication Systems Engineer for the Belgian Navy to the table. Highly skilled in software engineering, he is one of the primary authors of Hume and an active participant in the Neo4j community, dedicating his efforts towards driving the development of cutting-edge technologies in the field.