We have already blogged about fulltext search available in Neo4j 3.5. The list of available analyzers covers many languages and fits various use cases. However once you expose the search to real users they will start pointing out edge cases and complain about the search not being google-like.
Speakers of languages using accents in their written form quite often leave out the accents. This has various reasons, the most common ones are
- historical, when different character encodings caused problems and users find it hard to change their habits
- using a different default keyboard layout (e.g. en_US); switching the layout just for a search keyword is annoying
- the accented letters are in the top keyboard row and are slightly harder to reach, reducing WPM/CPM (words per minute, characters per minute)
A common complaint among such users is that the search doesn’t ignore the accents. Let’s look at an example with Czech names and provided Czech analyzer. We will create some sample data and a fulltext index for name property.
CREATE(:Person{name:'Petr Černý'})CREATE(:Person{name:'Ivana Černá'})CALLdb.index.fulltext.createNodeIndex('person-name-czech',['Person'],['name'],{analyzer:'czech'})
We can see that querying with accents returns expected result, the Czech analyzer even handles inflection, so we get other results containing the same word root.
CALLdb.index.fulltext.queryNodes('person-name-czech','černý')YIELDnodeASpersonRETURNperson.name╒═════════════╕│"person.name"│╞═════════════╡│"Černý"│├─────────────┤│"Černá"│└─────────────┘
But querying without accents returns nothing
CALLdb.index.fulltext.queryNodes('person-name-czech','cerny')YIELDnodeASpersonRETURNperson.name(nochanges,norecords)
Custom analyzer
Let’s use the power of open source and see what the Czech analyzer does exactly in the source code. For a detailed explanation, see already mentioned blog post and javadoc.
- uses StandardTokenizer – tokenizes words using whitespaces and punctuation
- LowerCaseFilter – converts letters to lowercase
- StopFilter – filters out standard Czech stopwords, it is possible to provide a custom list, but not through the Neo4j fulltext index
- SetKeywordMarkerFilter – preparation for the next step – for a given set of keywords the stemming won’t be done
- CzechStemFilter – applies czech specific stemming, which handles the inflection
What is missing is a step which would remove the accents. Lucene already provides classes for this step, such as ASCIIFoldingFilter or ICUFoldingFilter (from lucene-analyzers-icu package). Because the CzechStemFilter expects the tokens with accents, we will add the filter as the last step. The new custom CzechAnalyzer will look as follows:
TokenStreamresult=newStandardFilter(sourceresult=newLowerCaseFilter(resultresult=newStopFilter(result,stopwordsif(!this.stemExclusionTable.isEmpty())result=newSetKeywordMarkerFilter(result,stemExclusionTableresult=newCzechStemFilter(resultresult=newASCIIFoldingFilter(resultreturnnewTokenStreamComponents(source,result
We also need to tell Neo4j to add this to a list of available analyzers by implementing a custom AnalyzerProvider
@Service.Implementation(AnalyzerProvider.class)publicclassCustomCzechextendsAnalyzerProvider{publicCustomCzech(){super("czech-custom"}@OverridepublicAnalyzercreateAnalyzer(){returnnewCustomCzechAnalyzer}@OverridepublicStringdescription(){return"Czech analyzer with stemming, stop word filtering and accents removal."}}
Packaged into a jar, this is then to be deployed to plugins directory. We can now create an index with our custom analyzer:
CALLdb.index.fulltext.createNodeIndex('person-name-czech-custom',['Person'],['name'],{analyzer:'czech-custom'})
The result of the original accented query hasn’t changed and we see that the query without accents now brings the desired result:
CALLdb.index.fulltext.queryNodes('person-name-czech-custom','cerny')YIELDnodeASpersonRETURNperson.name
╒═════════════╕ │"person.name"│ ╞═════════════╡ │"Petr Černý" │ ├─────────────┤ │"Ivana Černá"│ └─────────────┘
Conclusion
The modification of the CzechAnalyzer was rather simple, but the approach can be used to leverage a wide range of use cases. You can checkout the whole example project on github or drop us line if you need help with more sophisticated requirements.