A Knowledge Graph-based Perspective on Named Entity Disambiguation in the Healthcare Domain

· 13 min read

Knowledge Graphs (KGs) have become the backbone of multiple applications, including search engines, chatbots, and question and answering tools, where interactivity plays a crucial role.

For critical domains, such as healthcare, the goal is to develop Intelligent Advisory Systems (IASs) that can support the activity and drive multiple stakeholders’ decisions. The development of IASs requires performing intensive natural language processing tasks, including Named Entity Disambiguation (NED). The main goal of NED is to map a continuous span of text representing a named entity, such as “Type 2 diabetes”, to a ground truth entity in a given knowledge base, such as the “Type 2 Diabetes Mellitus (CUI C0011860)” located in the Unified Medical Language System (UMLS). This article aims to show how KGs are suitable to support the NED task from different perspectives for creating IASs that can be valuable in the healthcare domain.

Intelligent Advisory Systems

Intelligent advisory systems (IAS) have become fundamental to support the activities of diverse stakeholders in critical areas, such as the healthcare field. These systems must deliver enriched, personalised, and precise information to clinicians, researchers, insurance companies, patients, and governments, which generally have distinct goals and responsibilities. For instance, pharmaceutical companies develop and provide medications prescribed by physicians to treat patients. Employers can offer health insurance coverage with varying deductibles to their employees. And governments may create policies to subsidise services implemented by the social care system for the elderly, disabled, and needy patients.

One of the critical attributes of the IAS is interactivity, which is the practical ability to exchange information with humans in a fruitful way. A subset of the critical features to enable this exchange includes:

  1. the capacity of the system to detect meaningful entities in natural language;
  2. the definition of an informational context to ground the interaction with the user.

For instance, to support the medical practitioner in the correct diagnosis, a system should be able to recognize specific symptoms mentioned by the patients. It should provide additional background knowledge of the diseases that can be related to these symptoms and consider factors such as genetics, environment, and the diagnosis compliance concerning specific protocols.

Knowledge graphs and graph-based ecosystems

To build IAS, we require specific elements. In particular, we need access to reference knowledge bases focused on the application domain to detect meaningful entities. In the case of healthcare, we need to create conditions to add, incorporate, and organise information extracted from multiple sources.

The Knowledge Graph (KG) represents a flexible solution to meet these requirements. On one hand, it can function as reference knowledge, in which nodes represent unambiguous real-world entities and edges define meaningful connections between them. On the other hand, it can set the background context to drive, for instance, the interaction flow, integrating contents from multiple sources in a flexible way. To fully exploit the power of a KG, we also need a graph-powered technological ecosystem that can manipulate all the KG features. The Hume platform developed by GraphAware can ingest, using an advanced orchestration mechanism, structured and unstructured contents from multiple data sources and allows you to visualise, interact, and perform graph-based analysis on the KG to detect specific patterns.

From entity recognition to disambiguation

As we previously mentioned, reference knowledge bases play a critical role in collecting a structured representation of entities belonging to a specific domain. However, in dealing with natural language, we need to identify an approach to link specific mentions within the text to the entities in the reference knowledge.

The entity-linking phase is composed of 2 main subtasks. The first subtask is defined as Named Entity Recognition (NER), and the second subtask is defined as Named Entity Disambiguation (NED). The NER goal is to detect the mentions of specific named entities in the text. To better understand this process, consider the following sentence:

  • “Glitazones may be used for treating this type of diabetes that is mainly related to lifestyle.”

The NER component must identify “Glitazones” and “diabetes” as surface forms or mentions that refer to specific named entities such as “chemical substance” and “disease”, respectively. However, In many cases, recognizing named entities is not enough because expressions such as “diabetes” can refer to different forms of the disease, including Type 1 Diabetes (T1D), Type 2 Diabetes (T2D), and Gestational diabetes. For this reason, using only NER, we can not distinguish the correct form of “diabetes” among all these possibilities.

The NED goal is to remove the uncertainty of the “diabetes” meaning by examining the context of the mention and by connecting such mention to an unambiguous entity within the knowledge base. In general, a NED system includes two main phases. The first phase consists of the selection of potential candidates. Based on our example, multiple candidates can be detected and associated with the mention of diabetes, including Type 1 Diabetes (T1D), Type 2 Diabetes (T2D), and Gestational diabetes. The second main phase is the ranking of the detected candidates. In this example, based on the context, T2D achieves the best score, and the system recognizes it as the target entity of the NED task. The reason is that keywords such as “glitazones” and “lifestyle” commonly refer to this TD2. While for instance, “genetic” and “pregnancy” could be signals for T1D and the gestational form of diabetes. Figure 1 summarises the full NED process.

Figure 1 - Architecture of a NED System

NED - Limits of Pure Language-based Models

The recent literature in the field describes approaches that exploit pure language models for NED. The most modern methods leverage transformer-based architectures, such as BERT, to learn contextual information from a text to improve the performance of NED models. In particular, these approaches are based on a dual encoder [1][3] that includes a context encoder, used to learn representations of mentions in the text, and an entity encoder that exploits the entity’s structural information, including descriptions, semantic types, and broader-related concepts.

The current limitations of these approaches are related to the quality of the structural information associated with entities in the healthcare domain. Such structural information is coarse-grained and incomplete: Varma et al. [3] observed that over 65% of entities in UMLS are associated with just ten semantic types, which do not represent fine-grained disambiguation signals. In addition, they noticed that over 93% of entities in UMLS have no associated description.

This is one of the cases where “pure textual context may not be sufficient,” as more generally observed by Mulang et al. [5]. For all these reasons, the graph-based context provided by the KG is practical for the NED task in scenarios where the quality of structural information is low.

NED - The Virtuous Circle of KGs

NED systems can receive tremendous benefits from the adoption of KGs. In particular, we can state that the KG representation enables a virtuous cycle around NED systems.

On the first side of this circle, KGs can empower NED systems, providing contextual information for entities from the graph structure. Knowledge bases such as UMLS can be processed to construct a graph, in which entities are connected with different types of relationships. Contextual information derived from these connections can be exploited by KG embedding techniques, which allow learning a vector representation (embedding) of the entities and relationships. To better understand how a NED system leverages these embeddings, consider the high-level architecture reported in Figure 2.

Figure 2 - Model for computing word and KG embeddings for the ranking phase

As mentioned in the previous section, the modern approaches leverage a context encoder, which is used to learn representations of mentions in the text, and an entity encoder to learn the representation of the entities leveraging the metadata, including their semantic types and the description.

The architecture’s left side (1) describes an approach based on a “Bidirectional Long-Short term memory” (Bi-LSTM) to learn contextual word embeddings. This step aims to assign a vector representation to the mention of “diabetes” in the sentence “Glitazones may be used for treating this type of diabetes that is mainly related to lifestyle”. The right side (2) of the Figure shows that entity representation is based on the embeddings learned from the graph structure. This representation could be suitable for coarse-grained and missing structural information because it is fully based on the connections between the entities within the KG.

The last phase (3) combines the mention and entity representations through a vector concatenation step and uses such a concatenated vector to train a binary classification model built using a feed-forward neural network. Positive training samples are created by concatenating the word embedding of the mention and the node embedding of the target entity. In our running example, a positive sample includes the T2D entity. The negative samples are created by concatenating the representation of the mention and the candidate entities, in our running example represented by T1D and Gestational Diabetes entities.

On the other side of the virtuous circle, the output of the NED systems can be published into a new KG for effectively building our IAS. From this perspective, the target (or the candidate) entities can be used to create new KGs, and, by incorporating external ontologies, such entities become the entry point for exploring and discovering more tailored information. Integrating information from different sources, we create the conditions to build non-trivial connections between natural language contents. For example, a clinical note mentioning specific symptoms is connected to scientific articles about the most common diseases that cause those symptoms. Finally, when we represent the result of a NED system in a KG, we can visualise all the processed contents in a unique representation.

Implementing the KG Virtuous Circle using Hume

We can easily implement the KG virtuous circle using GraphAware Hume and one of its features called Orchestra, which offers an intuitive user interface to decompose a complex workflow into a linear structure of distinct elements. Each component reads incoming information (structured or unstructured), processes the data, and produces an enriched version of the incoming data. With Orchestra, we can create a workflow for applying NED to natural language text. The NED model leveraging KG embeddings can be wrapped into an Orchestra component and used to disambiguate named entities detected by the NER component. Figure 3 shows an example of how to use an Orchestra workflow for processing unstructured text and publish the results of such processing into Neo4j in graph form.

Figure 3 - Hume Orchestra workflow for NED

Highlighted components 1 and 2 represent the NER and the NED stages, respectively. The NED output is a collection of UMLS ids, such as C0011860, that represent the candidate and target entities for the mentions detected by NER. Such UMLS ids by themselves are not helpful for the user that will visualise and explore the KG generated by the workflow. For this reason, we introduce the component labelled with the number 3, which outlines a remote Neo4j instance storing a graph-based version of the complete UMLS database. Leveraging such external information, we can attach multiple properties to the C0011860 entity, such as a canonical name, multiple identifiers, and descriptions:

  • canonical name: Diabetes Mellitus, Non-Insulin-Dependent
  • MeSH_id: D003924
  • HPO_id: HP:0005978
  • HPO_desc: A type of diabetes mellitus initially characterised by insulin resistance and hyperinsulinemia and subsequently by glucose intolerance and hyperglycemia.

Adding this metadata has a twofold benefit. On one hand, the user can better understand the result of the NED phase using natural language information. On the other hand, the new identifiers related to different databases, such as MeSH (Medical Subject Headings) and HPO (Human Phenotype Ontology), are helpful for creating a bridge between the UMLS knowledge and the data located in these databases. As a main consequence, the result of the NED becomes an entry point for knowledge discovery and exploration.

Let’s show the result of the NED process in the Hume Visualisation Canvas.

Figure 4 - Hume Visualisation of NED results

For each mention detected in the document, you have a collection of plausible candidates connected using the “HAS_CANDIDATE” relationship. The dimension of the arrow representing the relationship is related to the score associated with the candidate. Therefore, the user achieves an overview of the model output and can better investigate wrong or biased results. The details of the target entity, identified by the relationship “HAS_TARGET_ENTITY”, are located in the node properties box on the right side of Figure 4. The user can explore the properties of each candidate detected by the model. As you can see from the “Glitazones” node in the canvas, the model fully exploits all the data available in the UMLS database, including all the possible namings of a chemical substance. Therefore, the mention “Glitazones” is connected to the canonical entity identified by the main name “Thiazolidinediones”.

Moreover, you can perform federated queries from the Hume Visualisation Canvas to get information from different sources, including other Neo4j instances, enriching the available data. We can explore the genes associated with the C0011860 target entity, retrieved from the Neo4j UMLS database, and diseases related to this entity from the Neo4j HPO database. We can easily implement this mechanism by creating two local actions in Hume. An example of this action is reported in Figure 5.

Figure 5 - (Hume) Panel for defining local actions

From this window, you can see that the defined resource for this query is neo4j-remote-umls, which represents a Neo4j server located in another machine. The result of the defined local actions is depicted in Figure 6, which includes the genes and the rare syndromes connected to “Diabetes Mellitus, Non-Insulin-Dependent”. The properties box located on the right side of Figure 6 shows that the node “SLC2A4 gene” does not belong to the local database, but it has been retrieved from the neo4j-remote-umls.

Figure 6 - Hume Visualisation of enriched data

This example shows how the KG virtuous circle can be directly implemented in Hume. Using Orchestra, we can wrap the NED component employing the UMLS KG embeddings for the disambiguation task. Therefore, the UMLS KG is directly employed to support the NED process. On the other hand, through the federated query mechanism, we can build a new KG from the results of the NED process that can be useful for exploring more tailored information. Moreover, we are able to achieve this result by shaping integrated information in a unique view and maintaining the original data within its source databases.

Conclusions

Knowledge Graphs (KGs) play a fundamental role in developing Intelligent Advisory Systems (IASs). KGs are able to support natural language processing techniques, enabling a virtuous circle around tasks such as the Named Entity Disambiguation (NED). We can leverage KG embedding techniques to learn a vector representation of the entities. These vectors are used for the disambiguation phase among multiple candidates available for a specific mention. Moreover, the target entities detected by the NED model can be enriched by incorporating information from domain ontologies. Integrating multiple sources allows for building background information related to the target entities that enable interactivity, driving the interaction flow in IASs systems.

References

[1] Bhowmik, R., Stratos, K., & de Melo, G. (2021). Fast and effective biomedical entity linking using a dual encoder. arXiv preprint arXiv:2103.05028.

[2] Mulang’, I. O., Singh, K., Prabhu, C., Nadgeri, A., Hoffart, J., & Lehmann, J. (2020, October). Evaluating the impact of knowledge graph context on entity disambiguation models. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (pp. 2157-2160).

[3] Varma, M., Orr, L., Wu, S., Leszczynski, M., Ling, X., & Ré, C. (2021). Cross-domain data integration for named entity disambiguation in biomedical text. arXiv preprint arXiv:2110.08228.

Giuseppe Futia

Data Science | Neo4j certification

Dr. Giuseppe Futia holds a Ph.D. in Computer Engineering, where he explored techniques for building Knowledge Graphs. Over his career, which spans over a decade, he has gained experience in various areas, including research and software development. He has been leveraging his expertise in Graph Representation Learning to support various complex efforts and initiatives.