NODES2022 – Keyword Disambiguation Using Transformers and Clustering to Build Cleaner Knowledge

Federica Ventruto and Alessia Melania Lonoce are Junior Data Scientists at GraphAware who spoke at NODES2022. Natural language processing is an indispensable toolkit to build knowledge graphs from unstructured data. However, it comes with a price. Keywords and entities in unstructured texts are ambiguous – the same concept can be expressed by many different linguistic variations. The resulting knowledge graph would thus be polluted with many nodes representing the same entity without any order. In this session, we show how the semantic similarity based on transformer embeddings and agglomerative clustering can help in the domain of academic disciplines and research fields and how Neo4j improves the browsing experience of this knowledge graph.