Reverse engineering book stories with Neo4j and GraphAware NLP

July 24, 2017 · 9 min read

A book tells us a story, but for a computer, it is a wall of text. How can we use graphs and NLP to help our machines make more sense of a story?

Our example comes from the A Song of Ice and Fire books, aka Game of Thrones. We converted the e-books (epub) to text files and used a small Python program to split them into chapters, paragraphs, and sentences.

So a book turned into this model :

GraphAware NLP

GraphAware NLP Framework is a project that integrates NLP processing capabilities available in several software packages like Stanford NLP and OpenNLP, existing data sources, such as ConceptNet5 and WordNet, and GraphAware’s knowledge about search, graphs, and Recommendation Engines. GraphAware NLP is developed as a plugin for Neo4j and an external frontend for interacting with a Spark Cluster. It provides a set of tools, by means of procedures, background processes, and APIs, that all together provide a Domain Specific Language for Natural Language Processing on top of Cypher. The available interesting features are:

Information Extraction (IE) – processing textual information for extracting main components and relationships
Extracting sentiments
Enriching basic data with ontologies and concepts (ConceptNet 5)
Computing similarities between text elements in a corpus using base data and ontology information
Enriching knowledge using external sources (Alchemy)
Providing basic search capabilities
Providing complex search capabilities leveraging enriched knowledge like ontology, sentiments, and similarity
Providing recommendations based on a combination of content/ontology-based recommendations, social tags, and collaborative filtering
Unsupervised corpus clustering using LDA
Semi-supervised corpus clustering using Label Propagation
Word2Vec computation and importing

Text processing phase

Custom text processing pipeline with custom stopwords

CALL ga.nlp.addPipeline({textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', name: 'customStopwords', stopWords: '+,i,me,my,myself,we,our,ours,ourselves,you,your,yours,yourself,yourselves,he,him,his,himself,she,her,hers,herself,it,its,itself,they,them,their,theirs,themselves,what,which,who,whom,this,that,these,those,am,is,are,was,were,be,been,being,have,has,had,having,do,does,did,doing,a,an,the,and,but,if,or,because,as,until,while,of,at,by,for,with,about,against,between,into,through,during,before,after,above,below,to,from,up,down,in,out,on,off,over,under,again,further,then,once,here,there,when,where,why,how,all,any,both,each,few,more,most,other,some,such,no,nor,not,only,own,same,so,than,too,very,s,t,can,will,just,don,should,now', threadNumber: 20})

Text extraction, tokenisation, lemmification and named entity recognition

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

The two above steps lead us to the following graph :

The original graph and its NLP-processed representation are related to the HAS_ANNOTATED_TEXT relationship. Every paragraph is split into sentences, which themselves are extracted to Tag nodes. Optionally, tags have an additional Named Entity Recognition label like NER_Person, NER_Location, etc..

First insights from text

Which tags/concepts were found?

MATCH (n:Tag) WHERE n:NER_Person OR n:NER_Location OR n:NER_Organization RETURN n.value, n.ne, size((n)<-[:HAS_TAG]-()) AS f ORDER BY f DESC

╒════════════╤════════════╤═════╕
│"n.value"   │"n.ne"      │"f"  │
╞════════════╪════════════╪═════╡
│"Jon"       │["PERSON"]  │"510"│
├────────────┼────────────┼─────┤
│"Jaime"     │["PERSON"]  │"384"│
├────────────┼────────────┼─────┤
│"Tyrion"    │["PERSON"]  │"372"│
├────────────┼────────────┼─────┤
│"Arya"      │["PERSON"]  │"358"│
├────────────┼────────────┼─────┤
│"Robb"      │["PERSON"]  │"299"│
├────────────┼────────────┼─────┤
│"Dany"      │["PERSON"]  │"280"│
├────────────┼────────────┼─────┤
│"Sam"       │["PERSON"]  │"240"│
├────────────┼────────────┼─────┤
│"Davos"     │["PERSON"]  │"229"│
├────────────┼────────────┼─────┤
│"Joffrey"   │["PERSON"]  │"210"│
├────────────┼────────────┼─────┤
│"Catelyn"   │["PERSON"]  │"206"│
├────────────┼────────────┼─────┤
│"Sansa"     │["LOCATION"]│"193"│
├────────────┼────────────┼─────┤
│"Cersei"    │["PERSON"]  │"162"│
├────────────┼────────────┼─────┤
│"Tywin"     │["PERSON"]  │"161"│
├────────────┼────────────┼─────┤
│"Brienne"   │["PERSON"]  │"138"│
├────────────┼────────────┼─────┤
│"Robert"    │["PERSON"]  │"126"│
├────────────┼────────────┼─────┤
│"Winterfell"│["LOCATION"]│"101"│
├────────────┼────────────┼─────┤
│"Ygritte"   │["PERSON"]  │"101"│
├────────────┼────────────┼─────┤
│"Meera"     │["PERSON"]  │"101"│
├────────────┼────────────┼─────┤
│"Beric"     │["PERSON"]  │"98" │
├────────────┼────────────┼─────┤
│"Jojen"     │["PERSON"]  │"94" │
└────────────┴────────────┴─────┘

It is very impressive that on this artificial text, with no real people, locations or organizations, the NLP approach still yields such correct results. (Michael Hunger – Neo4j, Inc).

Which chapter has the most different tags?

MATCH (c:Chapter)-[:HAS_PARAGRAPH]->(p)-[:HAS_ANNOTATED_TEXT]->(at)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(t)
WHERE size(t.ne) > 0
RETURN c.number, c.title, count(*) as f ORDER BY f DESC LIMIT 5

╒══════════╤══════════╤══════╕
│"c.number"│"c.title" │"f"   │
╞══════════╪══════════╪══════╡
│"61"      │"TYRION"  │"4453"│
├──────────┼──────────┼──────┤
│"5"       │"TYRION"  │"3693"│
├──────────┼──────────┼──────┤
│"24"      │"DAENERYS"│"3678"│
├──────────┼──────────┼──────┤
│"34"      │"SAMWELL" │"3659"│
├──────────┼──────────┼──────┤
│"57"      │"BRAN"    │"3564"│
└──────────┴──────────┴──────┘

Making sense of thrones

We can now use this base model to make more advanced (fun) queries on the data

First occurrences of persons in the book

MATCH (t:Tag) WHERE t:NER_Person
WITH t, size((t)<-[:HAS_TAG]-()) AS f
ORDER BY f DESC LIMIT 10
MATCH (t)<-[:HAS_TAG]-(s)<-[:CONTAINS_SENTENCE]-(at)<-[:HAS_ANNOTATED_TEXT]-(para)<-[:HAS_PARAGRAPH]-(c)
WITH t.value as person,s.sentenceNumber as num, para.position as paraNum, c.number as chapterNum, c.title as chapter, s.text as sentence, apoc.text.format('C%03d-P%03d-S%03d',[c.number,para.position,s.sentenceNumber]) as position
WHERE chapterNum > 1
WITH * ORDER BY position ASC 
RETURN person, collect({chapter: chapter, sentence: sentence, position: position})[0]

While the query might look complex, it is just traversing the graph. Sentence nodes contain their position in the paragraph, which makes it easy to retrieve paragraphs in the right order.

╒═════════╤══════════════════════════════╕
│"person" │"collect({chapter: chapter, se│
│         │ntence: sentence, position: po│
│         │sition})[0]"                  │
╞═════════╪══════════════════════════════╡
│"Jaime"  │{"chapter":"JAIME","sentence":│
│         │"Not that Jaime had ever seen │
│         │her smiling.","position":"C002│
│         │-P003-S002"}                  │
├─────────┼──────────────────────────────┤
│"Robb"   │{"chapter":"JAIME","sentence":│
│         │"She stood at Robb’s left hand│
│         │ beside the high seat, and for│
│         │ a moment felt almost as if sh│
│         │e were looking down at her own│
│         │ dead, at Bran and Rickon.","p│
│         │osition":"C002-P003-S000"}    │
├─────────┼──────────────────────────────┤
│"Tyrion" │{"chapter":"CATELYN","sentence│
│         │":"Tyrion felt their eyes on h│
│         │im as he rode past; chilly eye│
│         │s, angry and unsympathetic.","│
│         │position":"C003-P002-S001"}   │
├─────────┼──────────────────────────────┤
│"Joffrey"│{"chapter":"JAIME","sentence":│
│         │"The moonstones Joffrey gave h│
│         │er.”","position":"C002-P005-S0│
│         │02"}                          │
├─────────┼──────────────────────────────┤
│"Jon"    │{"chapter":"JAIME","sentence":│
│         │"Jon had never met anyone so s│
│         │tubborn, except maybe for his │
│         │little sister Arya.","position│
│         │":"C002-P005-S001"}           │
├─────────┼──────────────────────────────┤
│"Catelyn"│{"chapter":"JAIME","sentence":│
│         │"A silence fell across the tor│
│         │chlit hall, and in the quiet C│
│         │atelyn could hear Grey Wind ho│
│         │wling half a castle away.","po│
│         │sition":"C002-P002-S001"}     │
├─────────┼──────────────────────────────┤
│"Arya"   │{"chapter":"JAIME","sentence":│
│         │"Jon had never met anyone so s│
│         │tubborn, except maybe for his │
│         │little sister Arya.","position│
│         │":"C002-P005-S001"}           │
├─────────┼──────────────────────────────┤
│"Davos"  │{"chapter":"JAIME","sentence":│
│         │"But Davos could not complain │
│         │of chill.","position":"C002-P0│
│         │04-S000"}                     │
├─────────┼──────────────────────────────┤
│"Dany"   │{"chapter":"JAIME","sentence":│
│         │"The harpy of Ghis, Dany thoug│
│         │ht.","position":"C002-P003-S00│
│         │0"}                           │
├─────────┼──────────────────────────────┤
│"Sam"    │{"chapter":"CATELYN","sentence│
│         │":"Sam was trying to feed him │
│         │onion broth, but he could not │
│         │swallow.","position":"C003-P00│
│         │3-S003"}                      │
└─────────┴──────────────────────────────┘

Interaction graph

The idea here is that we are looking for who is interacting with another person in the same sentence and creating a new OCCURS_WITH relationship between the entities

MATCH (a:AnnotatedText)-[:CONTAINS_SENTENCE]->(s:Sentence)-[:SENTENCE_TAG_OCCURRENCE]->(to:TagOccurrence)-[:TAG_OCCURRENCE_TAG]->(tag)
WHERE tag:NER_Person 
WITH a, to, tag
ORDER BY s.id, to.startPosition
WITH a, collect(tag) as tags
UNWIND range(0, size(tags) - 2) as i
WITH a, tags[i] as tag1, tags[i+1] as tag2 WHERE tag1 <> tag2
MERGE (tag1)-[r:OCCURS_WITH]-(tag2)
ON CREATE SET r.freq = 1
ON MATCH SET r.freq = r.freq + 1

Tag Occurrence nodes are an intermediate representation between a sentence and a tag found in it. Because Tag nodes are unique, this offers the possibility to have extra information about the occurrence of a tag in a specific sentence.

Find the queen/lord/king/lady

MATCH (n:Tag) WHERE n.value IN ['lord','queen','king','lady']
MATCH (n)<-[:TAG_OCCURRENCE_TAG]-(to)<-[:SENTENCE_TAG_OCCURRENCE]-(s)-[:SENTENCE_TAG_OCCURRENCE]->(to2)-[:TAG_OCCURRENCE_TAG]->(t2) WHERE to2.startPosition = to.endPosition + 1
AND t2:NER_Person
WITH n.value AS v, toLower(t2.value) AS person, count(*) AS f ORDER BY f DESC
RETURN v, collect(person)[0..20]

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

Which organisations and locations were found

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

Entity merging

Finding names and merging them together to create person nodes.

This is done in 2 steps. The first step is to extract name parts into their own nodes and create a hierarchy of name parts :

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

The second step is to traverse from a single name entity up to its top PART_OF relationship and traverse its own cluster in order to determine the entities to merge together.

Let’s first check what the results will look like :

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

We can now create the Person entities and relate them to their corresponding Tag nodes :

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

And clean the temporary Name nodes

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

Conclusion

The last step is a very interesting step, because it offers us the possibility to compute the interaction graph again, but with more precise Person entities than what we have done previously.

MATCH (p:Paragraph) WHERE NOT p:Processed WITH p LIMIT 500 SET p:Processed
WITH p
CALL ga.nlp.annotate({text: p.text, id: p.ref, pipeline: "customStopwords"})
YIELD result
MERGE (p)-[:HAS_ANNOTATED_TEXT]->(result)

We’ve made the database available (without the original text) in read-only mode here, and a backup can be downloaded from Dropbox here. The Neo4j version used is 3.1.3.

Graph Structures are ideal for storing enriched representations of textual data. Combine this with NLP capabilities and a powerful query language for graphs, and you are able to discover really interesting insights from books.

Meet the authors

Christophe Willemsen

Technology & Infrastructure

Christophe is the CTO at GraphAware, with nearly 15 years of experience in graph technologies. A published O’Reilly author, he leads technology innovation, security, and best practices at GraphAware, where he also co-created Hume, the company’s flagship product. Before joining GraphAware, Christophe served as a communications engineer in the Belgian Navy for over a decade, taking part in multiple international missions.