So you have followed the Deep Dive into Neo4j’s Full Text Search tutorial, learned even how to create custom analyzers and finally watched the Full Text Search tips and tricks talk at the Nodes19 online conference?
Still, searching for boat
does not yield results containing yacht
or ship
, and you’re wondering how to make your search engine a bit more relevant for your users?
Don’t go any further, you’ll learn how to do it, now!
Synonyms
A synonym is a word or phrase that means exactly or nearly the same as another word or phrase.
Why synonyms ?
It’s all about recall! In other words, to help your users find the content they’re interested in without them having to know specific terms.
A user searching for coffee
should probably be seeing results containing latte macchiato
, espresso
or even ristretto
.
## Lists of synonyms
You can find 3rd party word lists for synonyms, such as WordNet or ConceptNet5, howeveer, appropriate word lists are domain/application/use-case dependent, and the best fit is generally a self-curated synonyms word list.
How to use them ?
The first thing to do, is to create a word list with the following format :
coffee,latte macchiato,espresso,ristretto boat,yacht,sailing vessel,ship fts,full text search, fulltext search
The next step is to create a custom analyzer using the synonym filter. Since we’re using an analyzer the first question that might come to mind is :
Do I have to reindex all the documents when my synonyms list change ?
The answer is yes, using a query time synonym filter is very bad(TM), for the following reasons :
- The QueryParser tokenizes before giving the text to the analyzer, so if a user searches for
sailing vessel
, the analyzer will be given the wordssailing
andvessel
separately, and will not know they match a synonym - Multi-Word synonyms will also not work in phrase queries
- The IDF of rare synonyms will be boosted
More information can be found in the Solr documentation.
Let’s create our custom analyzer for synonyms then :
@Service.Implementation(AnalyzerProvider.class)publicclassSynonymAnalyzerextendsAnalyzerProvider{publicstaticfinalStringANALYZER_NAME="synonym-custom"publicSynonymAnalyzer(){super(ANALYZER_NAME}@OverridepublicAnalyzercreateAnalyzer(){try{StringsynFile="synonyms.txt"Analyzeranalyzer=CustomAnalyzer.builder().withTokenizer(StandardTokenizerFactory.class).addTokenFilter(StandardFilterFactory.class).addTokenFilter(SynonymFilterFactory.class,"synonyms",synFile).addTokenFilter(LowerCaseFilterFactory.class).buildreturnanalyzer}catch(Exceptione){thrownewRuntimeException("Unable to create analyzer",e}}@OverridepublicStringdescription(){return"The default, standard analyzer with a synonyms file. This is an example analyzer for educational purposes."}}
A very important note is that the LowerCaseFilter
comes after the SynonymFilter
, in some use cases it causes synonyms to not be recognized, for example with the following list :
GB,gibabyte
If the lowercase filter is applied before synonyms, then the tokens will not match.
Create a synonyms.txt
file with your synonyms list in the conf/
directory of your Neo4j instance :
conf/synonyms.txt
coffee,latte macchiato,espresso,ristretto boat,yacht,sailing vessel,ship fts,full text search, fulltext search
Build your analyzer jar and put it in the plugins
directory of Neo4j and restart the database if needed.
Create the Index
CALLdb.index.fulltext.createNodeIndex('syndemo',['Article'],['text'],{analyzer:'synonym-custom'})
Create an Article node with some text :
CREATE(n:Article{text:"This is an article about Full Text Search and Neo4j, let's go !"})
Query the index :
CALLdb.index.fulltext.queryNodes('syndemo','fts')
╒══════════════════════════════════════════════════════════════════════╤══════════════════╕ │"node" │"score" │ ╞══════════════════════════════════════════════════════════════════════╪══════════════════╡ │{"text":"This is an article about Full Text Search and Neo4j, let's go│1.2616268396377563│ │ !"} │ │ └──────────────────────────────────────────────────────────────────────┴──────────────────┘
Similarly, a search for fulltext
will return the result as well. But let’s get fancy, heuu fuzzy !
:
CALLdb.index.fulltext.queryNodes('syndemo','fullt*')Noresults,norecords
Prefix and synonyms ?
There is one limitation : prefix,fuzzy,.. queries do not use the analyzer, they produce term or multiterm queries instead.
But there is a trick you can use, add an NgramFilter
to your analyzer and use a phrase query, so fts and its synonyms will have their ngrams tokenized and stored/retrieved in the index :
Analyzeranalyzer=CustomAnalyzer.builder()//....addTokenFilter(NGramFilterFactory.class,"minGramSize","2","maxGramSize","5").buildreturnanalyzer
The NgramTokenFilter
will tokenize the inputs into n-grams of the given sizes, here min 3 and max 5. So for the following input :
fulltext search
The index will contain the n-grams ful, full, fullt, ull, ullt, ullte, lte, ltex, ltext
.
You can also use the EdgeNgramFilter
will will produce n-grams only from the beginnig of the token, for the same example as above the n-grams will be ful, full, fullt
.
Re-deploy your plugin, restart the database, drop and recreate the index and now :
CALLdb.index.fulltext.queryNodes('syndemo','"fullt*"')╒══════════════════════════════════════════════════════════════════════╤═══════════════════╕│"node"│"score"│╞══════════════════════════════════════════════════════════════════════╪═══════════════════╡│{"text":"This is an article about Full Text Search and Neo4j, let's go│0.04872262850403786│ │ !"}││└──────────────────────────────────────────────────────────────────────┴───────────────────┘
To finalize, let’s try some other phrase queries :
CALLdb.index.fulltext.queryNodes('syndemo','"article fullte*"~2')╒══════════════════════════════════════════════════════════════════════╤══════════════════╕│"node"│"score"│╞══════════════════════════════════════════════════════════════════════╪══════════════════╡│{"text":"This is an article about Full Text Search and Neo4j, let's go│2.3429081439971924│ │ !"}││└──────────────────────────────────────────────────────────────────────┴──────────────────┘
Conclusion
Synonyms are a valuable asset when building search engines, offering a better recall and thus a better user experience.
GraphAware specializes in relevance engineering, be it for search or recommender systems, don’t hesitate to get in touch with us if you need help!