Phonetic matching attempts to match words by pronunciation instead of spelling. Words are typically misspelled and exact matches result in them not being found. Algorithms such as Soundex and Metaphone were developed to address this problem and they have found usage in the areas of voice assistants, search, record linking and fraud detection, misspelled names of things (for example, medical records) etc.
Custom analyzers
In 2019, we blogged about creating a Czech analyzer to address accents in the language.
With Neo4j 4, a few things have changed. This short blog post was inspired by a StackOverflow question on phonetic searches and resulted in me discovering what had to change to register an analyzer in Neo4j 4.
First, we create our Phonetic analyzer by extending org.neo4j.graphdb.schema.AnalyzerProvider
. The @Service.Implementation
annotation has been replaced by just a @ServiceProvider
. This particular implementation just uses a DoubleMetaphoneFilter.
@ServiceProviderpublicclassPhoneticAnalyzerextendsAnalyzerProvider{publicstaticfinalintMAX_CODE_LENGTH=6publicPhoneticAnalyzer(){super("phonetic"}@OverridepublicStringdescription(){return"Phonetic analyzer using the DoubleMetaphoneFilter"}@OverridepublicAnalyzercreateAnalyzer(){returnnewAnalyzer(){@OverrideprotectedTokenStreamComponentscreateComponents(Strings){Tokenizertokenizer=newStandardTokenizerTokenStreamstream=newDoubleMetaphoneFilter(tokenizer,MAX_CODE_LENGTH,truereturnnewTokenStreamComponents(tokenizer,stream}}}
Pretty simple. Package into a jar and put it into Neo4j’s plugins directory along with Lucene’s phonetic jar, restart the server and then verify that our new analyzer is registered by inspecting the results of call db.index.fulltext.listAvailableAnalyzers
– we should see the Phonetic analyzer listed.
Now, as before, create an index using the new analyzer:
CALLdb.index.fulltext.createNodeIndex('jobs',['Job'],['name'],{analyzer:'phonetic'})
And query in the same manner:
CALLdb.index.fulltext.queryNodes('jobs','fynansial')
╒══════════════════════════════════╤══════════════════╕ │"node" │"score" │ ╞══════════════════════════════════╪══════════════════╡ │{"name":"Financial Administrator"}│0.2163023203611374│ └──────────────────────────────────┴──────────────────┘
This code is available on github
References:
Tissot, H., Dobson, R. Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese. J Biomed Semant 10, 17 (2019). https://doi.org/10.1186/s13326-019-0216-2