Deep text understanding combining Graph Models, Named Entity Recognition and Word2Vec

· 15 min read

One of the key components of Information Extraction (IE) and Knowledge Discovery (KD) is Named Entity Recognition, which is a machine learning technique that provides us with generalization capabilities based on lexical and contextual information. Named Entities are specific language elements that belong to certain predefined categories, such as persons names, locations, organizations, chemical elements or names of space missions. They are not easy to find and subsequently classify (for example, organizations and space missions share similar formatting and sometimes even context), but having them is of significant help for various tasks:

  • improving search capabilities
  • relating documents among themselves or with external information (such as connecting people in a financial document with information from a business registry)
  • relating causes (e.g weather conditions, accidents, regulatory changes) with effects (e.g. flight or tram delay, stock price changes)

GraphAware Hume (formerly known as GraphAware Knowledge Platform, GKP) provides not only full support of Stanford NLP training and classification facilities, but also optimization techniques out-of-the-box. This blog post highlights how combining multiple techniques can provide higher accuracy than the pure Stanford NLP NER especially when trained on a small corpus.

What’s out there?

There are several approaches to Named Entity Recognition (NER). Among the popular ones are maximum entropy Markov models [1], Conditional Random Fields (CRFs) [2] and neural networks, such as sequence-based Long Short-Term Memory Recurrent Neural Networks (LSTM) [3]. In this blog post we focus on CRFs. While they are not necessarily the most efficient ones - the computational complexity of the training phase is high - but many experiments show that they notably outperform Markov models and provide state-of-the-art accuracy. And it is indeed CRF classifier that has been adopted by Stanford CoreNLP [2], which Hume integrates.

The CRF is a discriminative undirected probabilistic graphical model which performs well on sequence tagging tasks. In the case of text assigning class labels, such as Named Entity or part-of-speech tags to tags (words), in an input sequence (sentence). Under the hood, it is basically a sequential version of logistic regression: it defines several feature functions, assigns them weights, which can be trained using stochastic gradient descent, adds them together and calculates probabilities of different classes. The feature functions depend typically on the current sentence, the position of a tag in the sentence, nearby labels or tag structure. Both contextual and lexical information is thus used to make predictions.

Customisation of Named Entities

Let’s demonstrate the utility of Named Entity Recognition in a specific use case. In order to do so, we have created our own training and testing dataset by scraping Wikipedia. We selected a well defined set of categories, considered the number of documents, the orthogonality and the similarity of the documents. The resulting list is the following:

Classical music Musical instruments Composers for violin Italian musicians Artificial satellites orbiting Earth Terrestrial planets Physics Physicists Quantum mechanics Particle physics Theory of relativity Organic chemistry Nuclear chemistry Graph algorithms Machine learning algorithms Artificial neural networks Natural language processing German painters French painters Dutch painters Impressionism Expressionism Art Nouveau architects Islamic architecture Neoclassical architecture Foods Drinks Countries in Europe Countries in Africa Countries in Asia

Altogether, over 120k Wikipedia articles have been collected with an average of 185 unique lemmatized words. Starting from this corpus we:

  • select a specific class of named entities: Musical Instruments
  • create a training data set, annotating automatically, content in the pages
  • train a custom NER model in Stanford NLP
  • evaluate the classification performance

After the first set of tests we decided to improve the baseline quality of the classification provided by Stanford with other tools present out-of-the-box in Hume.

Create a proper training dataset

The most frequent obstacle while training your own supervised machine learning model - Named Entity Recognizers included - is a lack of labeled training data. Supervised learning usually requires massive labeled datasets. Doing it in the traditional way (manual labeling by humans) is both time and cost ineffective. There often are, however, viable alternatives.

Wikipedia is an example of a community-annotated open knowledge base which can be used as a source of training data for Named Entities, as is discussed for example in [4]. Moreover, thanks to the interlanguage links in Wikipedia articles, it is simple enough to build multilingual labeled datasets, which is a significant bonus point for non-English business use cases.

In our use case, what we need from Wikipedia is the ability to identify articles talking about musical instruments. We can then use these articles to build a training dataset with labeled mentions of musical instruments. It can be done by running some well tuned classification algorithms, which would identify musical instrument articles that could be used to build domain specific knowledge (dictionary). Alternatively, we can leverage the info-boxes in top right corners of Wikipedia articles (see figure): they briefly summarize the most important knowledge and they usually have the same or very similar structure in given domain. It is then simple to crawl through relevant pages by following outgoing hyperlinks which lead to other musical instruments.

Wikipedia page of mandolin instrument with the info-box.

We implemented a python script for the training dataset creation. It crawls a specific category and converts each relevant article into training data, which means to tokenize it and place each token along with its entity type, separated by tab space, on a separate line. The result will look like the following:

A   O
concertina  MUSICAL_INSTRUMENT
is  O
a   O
free-reed   O
musical O
instrument  O
,   O
like    O
the O
various O
accordions  MUSICAL_INSTRUMENT
and O
the O
harmonica   MUSICAL_INSTRUMENT
.   O

The entity type (MUSICAL_INSTRUMENT, in our case) is identified based on current outgoing hyperlinks as well as past explored articles, because no article is perfect - sometimes it contains entity mentions without the hyperlink to their dedicated Wikipedia page.

During the crawling process, each outgoing link of the MUSICAL_INSTRUMENT type is analysed to determine the instrument name, it’s alternative names and possible abbreviations. During this phase, we can call upon our old and reliable friend - Stanford CoreNLP - which performs tokenization and dependency parsing from both a command line and Python (take your pick). We use it to analyse article title and the first sentence of the first paragraph, which typically contains additional information such as alternative names and abbreviations. These are identifiable by Stanford universal dependencies: APPOS (appositional modifier) points to abbreviation and CONJ (conjunction) to alternative names. For example, the first sentence of a Wikipedia article about ‘cornett’ starts with “The cornett, cornetto, or zink is an early wind instrument that dates from …”: we want to be able to label all three terms - cornett, cornetto, zink - as MUSICAL_INSTRUMENT in our training corpus.

Training procedure

Thanks to the method briefly described in the previous section and implemented in Python, we’ve created our training dataset - 900k labeled tokens in 6.6 MB TSV file (tab-separated values), containing 412 unique musical instruments.

We then used GraphAware Neo4j NLP plugins, part of the Hume infrastructure, to train the Stanford CoreNLP CRF classifier. Before the training, we split the dataset into two parts, training and test datasets, using the 80-20 approach, i.e. use 80% of the labeled data for training and 20% for testing. The queries to train custom models in Stanford NLP are:

// First query: define workdir where your train & test data are located
CALL ga.nlp.config.model.workdir("/Users/DrWho/workdir/data/nasa")

// Second query: run actual training
CALL ga.nlp.processor.train({textProcessor: "com.graphaware.nlp.processor.stanford.StanfordTextProcessor", alg: "ner", modelIdentifier: "musical-instruments", inputFile: "ner-musical_instruments.train.tsv"})

After successful training, we can run the evaluation procedure:

CALL ga.nlp.processor.test({textProcessor: "com.graphaware.nlp.processor.stanford.StanfordTextProcessor", alg: "ner", modelIdentifier: "musical-instruments", inputFile: "ner-musical_instruments.test.tsv"})

which returns precision (How many identified entities are relevant?), recall (How many relevant entities are identified?) and F1 score (harmonic mean of precision and recall).

In our case, the algorithm achieved an F1 score 95% - a stunning result, given the modest training dataset size and unsupervised labelling approach!

Looking for musical instruments

The result of the tasks above is a Stanford model for recognizing musical instruments in whichever text. Such a model can be exported and made available to be used in any project. We are working to create a set of really valuable NER models using multiple data sources for the most common use cases.

In our musical instruments scenario, the trained model is now serialised to the workdir as defined before training. To use it, it is necessary to create a text processing pipeline which uses it as a custom NER, for example:

CALL ga.nlp.processor.addPipeline({name: 'misicalInstrumentsNER', textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', processingSteps: {tokenize: true, ner: true, sentiment: false, dependency: true, customNER: 'musical-instruments'}, stopWords: '+,have, use, can, should, shall, from, for, may, all, during, more, make, between, do, about, above, after, again, against'})

The musicalInstrumentsNER pipeline can now be used to annotate the full Wikipedia corpus (120k articles):

CALL apoc.periodic.iterate("MATCH (n:Wikipage) WHERE NOT (n)-[:HAS_ANNOTATED_TEXT]-() RETURN n",
"CALL ga.nlp.annotate({text: n.text, id: id(n), pipeline: musicalInstrumentsNER}) YIELD result
MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)",
{batchSize: 1, iterateList: false, parallel: false})

We got 978 identified musical instruments. This means that the NER algorithm, using a custom model trained with 412 musical instruments, was able to infer what the “musical instrument” class represents. And it identified 566 new entities not seen in the training phase!

Among the newly identified instruments were dulciaan, geophone flute, waterphone, banjolin. There are of course also fakes: star, block, flux or helium are clearly not musical instruments. But as we will see, they are relatively rare. This shows the value of the CRF classifier even though trained with a small (and not complete) dataset.

The top 20 reconstructed musical instruments are shown in table below. n_docs is number of documents where the tag occurred and n_docs_NE is number of documents where the tag was identified as a musical instrument. As with every machine learning algorithm, there are cases of false positives, such as star, block, helium:

Entity n_docs n_docs_NE Precision
star 11000 6 0.0
piano 4521 4206 0.93
block 3521 1651 0.47
violin 2628 2381 0.91
cells 2113 2112 1.0
organ 1845 1750 0.95
guitar 1693 1095 0.65
drum 1232 1058 0.86
cello 1225 1087 0.89
flux 1195 1 0.0
flute 1180 947 0.8
helium 1152 2 0.0
horn 960 826 0.86
drums 952 933 0.98
bell 854 699 0.82
pipe 849 172 0.2
organs 780 777 1.0
clarinet 771 693 0.9
violins 707 700 0.99
trumpet 670 603 0.9

We can see that some of the fake entities occur quite often in the corpus, but they were misidentified only very rarely: for example, helium occurs in 1152 documents, but only in 2 of them was tagged as instrument. This observation leads us to the design of a cleaning procedure, discussed later.

To evaluate it more quantitatively, we can calculate precision and recall in the real dataset. This would normally include dedicating significant human time for going through the dataset and labeling identified entities as true or false. Instead, we decided to inspect only occurrences of those musical instruments that were present in the training corpus. Precision is found to be 93% and recall 81% (F1 score 86%). Not bad at all!

Play the right instrument

The 93% precision reached by a pure CRF approach is valuable, but it can be improved. Let’s now discuss the multiple ways in which this can be done and how Hume supports them.

The obvious solution to further refine the named entities is to go back and attempt to improve the training dataset. For example by labelling many more articles from some other related categories, such as “Classical music”, “Composers” or “Italian musicians”. Or by improving the quality of the automated labelling procedure, especially regarding missed musical instruments.

Alternatively, or in combination with the previous, we can refine it using word embeddings.

Word embeddings to the rescue

Word embeddings [5] (word2vec) are vector representations of words designed to capture general word meaning from analysing the context in which words occur. The concept is the same as with document embeddings discussed in this blog post. The model is a shallow neural network whose power is in betting on simplicity (no hidden layers), which allows it to be deployed effectively on larger volumes of data, in the end qualitatively surpassing the deep neural network architectures.

The power of word2vec is that these dense vector representations indeed behave like mathematical vectors: they can be added, averaged and multiplied while obtaining meaningful results. For example:

vec(Berlin) - vec(Germany) + vec(France) ≈ vec(Paris) vec(Japan) - vec(sushi) + vec(Germany) ≈ vec(bratwurst) vec(Einstein) - vec(scientist) + vec(Picasso) ≈ vec(painter)

The embeddings of similar words (similar based on the context in which words occur) project them into the same area of vector space.

There are various projects that offer pre-trained word embeddings for download and use. This is particularly useful if our corpus is not large enough to train decent quality vectors ourselves.

Hume provides comprehensive support for word2vec, including:

  • Computing word2vec from the imported corpus
  • Importing word2vec (tested with Numberbatch and Facebook’s fasttext)
  • Computing similarity between words

For example, we can download ConceptNet Numberbatch vectors and import them:

CALL ga.nlp.ml.word2vec.addModel(‘/Users/DrWho/workdir/numberbatch’, ‘/Users/DrWho/neo4j/import/word2vec_idx’, ‘en-numberbatch’)

where the first argument is a path to the downloaded vectors, second argument specifies where to store the word2vec index and lastly “en-numberbatch” is a custom identifier that will be used to refer to this specific model.

We can now run this query in Hume:

WITH ga.nlp.ml.word2vec.wordVector('mandolin', 'en-numberbatch') AS vec1,
ga.nlp.ml.word2vec.wordVector('banjo', 'en-numberbatch') AS vec2
RETURN ga.nlp.ml.similarity.cosine(vec1, vec2) AS similarity

This retrieves vectors for words mandolin and banjo and returns their cosine similarity 0.81.

Unfortunately, the pre-trained word vectors are not comprehensive: for example, bass guitar has no existing vector in Numberbatch database, but both bass and guitar do have! We can bypass this shortcoming by running annotation of all our named entities, which breaks them into lemmatised tokens. The annotation can be done like this:

match (t:NER_Musical_instrument)
with t, ga.nlp.processor.annotate(t.value, {name: 'tokenizer_enterprise'}).sentences[0].tagOccurrences as tos
unwind keys(tos) as i
with t, tos, toInt(i) as ii
order by ii asc
with t, collect(tos[toString(ii)][0].element.lemma) as ts
set t.value_tags = ts

Tag nodes now contains value_tags property, which is an array of lemmatized words. We quickly noticed that this helps even with single-word instruments:

match (n:NER_Musical_instrument)
where size(n.value_tags) = 1 and not exists(n.word2vec_array) and size(ga.nlp.ml.word2vec.wordVector(n.value_tags[0], 'en-numberbatch')) > 0
return n.value, n.value_tags[0]

This returns 85 musical instruments where lemmatization (removing plurals) helped to get vectors:

n.value (no word2vec available) n.value_tags[0] (word2vec available)
viols viol
cornetts cornett
lyres lyre
heckelphones heckelphone
virginals virginal
 

Similarly for multi-word instruments: the simplest thing to do is to check whether the right-most token has an embedding and if so, treat the whole entity as being represented by that vector. This comes from the observation that the right-most word in musical instruments is the most relevant one, for example: bass guitar, electric guitar, pipe organ, tin whistle, tenor saxophone.

In total, among 707 instruments that don’t have an embedding, there are 357 instruments that can get embedding of at least one of its lemmatized words. Together with 271 entities that have embeddings out-of-the-box, that makes 628 vectorised musical instruments. This significantly improves our ability to use word2vec for cleaning final list of identified named entities.

The process of cleaning works in this way:

  • retrieve embeddings where possible, if not possible run annotation and use the vector of the right-most word or average of all vectors
  • for each entity, calculate average cosine similarity to all other entities of the same class (in parallel using APOC periodic iterate for efficiency reasons)
  • inspect the results, choose appropriate threshold (we used 0.2) and label entities below it as fakes (or remove them completely)

Using this cleaning approach the precision jumps from 93% to 99%! Recall drops slightly from 81% to 77%. This means we have also inadvertently removed some good instrument candidates - a price to pay for getting a significant precision boost. We can adjust the threshold and eventually combine this technique with other techniques, like ontology, that will be described in future blog posts.

Conclusions

In this blog post, we discussed one of the key components on our way towards knowledge discovery and navigation: identification of named entities. Having a proper “class” which defines a portion of the text (a single word or a small set of words) not only improves information retrieval tasks but also helps to discover more information from the text, extracting relationships between elements and so on.

Hume provides customers with full support in performing the following tasks:

  • Train and evaluate a custom NER model
  • Configure text processing pipeline to use the custom model to infer entities in the text
  • Refine the results by combining them with other techniques like word2vec and ontology

As a collateral result of this effort, the Hume team has developed mechanisms for creating specific training dataset using multiple resources. We are currently working on the creation of multiple models aligned to key business scenarios, such as

  • healthcare,
  • finance,
  • law,

to cite a few.

Such models will be available to Hume users that would like to recognize more entities from their textual data.

Want to know more? Get in touch with GraphAware to see what Hume can do for you.

Bibliography

[1] A. McCallum, “Maximum Entropy Markov Models for Information Extraction and Segmentation”, http://www.ai.mit.edu/courses/6.891-nlp/READINGS/maxent.pdf

[2] J. R. Finkel et al., “Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling”, Proceedings of the 43nd Annual Meeting of the ACL 2005, pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf

[3] G. Lample et al., “Neural architectures for named entity recognition”, Proceedings of NAACL-HLT 2016

[4] Jian Ni, Radu Florian, “Improving Multilingual Named Entity Recognition with Wikipedia Entity Type Mapping”, arXiv:1707.02459 [cs.CL] (https://arxiv.org/abs/1707.02459)

[5] T. Mikolov et al., “Efficient Estimation of Word Representations in Vector Space”, arXiv:1301.3781 [cs.CL]

Dr. Vlasta Kůs

Dr. Alessandro Negro

Research & Development | Neo4j certification

Dr. Alessandro Negro holds a Ph.D. in Computer Science and is a leading authority on graph-based AI and Machine Learning. Dr. Negro is an expert in computer science, graphs, and data science, specialising in natural language processing, recommendation engines, fraud detection, and knowledge graphs. He has written two books on these topics: Graph-Powered Machine Learning (Manning, 2021) and Knowledge Graphs Applied (Manning, estimated publication in 2024) and his expertise is highly sought after within the industry.