Skip to main content

A hybrid framework with large language models for rare disease phenotyping

Abstract

Purpose

Rare diseases pose significant challenges in diagnosis and treatment due to their low prevalence and heterogeneous clinical presentations. Unstructured clinical notes contain valuable information for identifying rare diseases, but manual curation is time-consuming and prone to subjectivity. This study aims to develop a hybrid approach combining dictionary-based natural language processing (NLP) tools with large language models (LLMs) to improve rare disease identification from unstructured clinical reports.

Methods

We propose a novel hybrid framework that integrates the Orphanet Rare Disease Ontology (ORDO) and the Unified Medical Language System (UMLS) to create a comprehensive rare disease vocabulary. SemEHR, a dictionary-based NLP tool, is employed to extract rare disease mentions from clinical notes. To refine the results and improve accuracy, we leverage various LLMs, including LLaMA3, Phi3-mini, and domain-specific models like OpenBioLLM and BioMistral. Different prompting strategies, such as zero-shot, few-shot, and knowledge-augmented generation, are explored to optimize the LLMs’ performance.

Results

The proposed hybrid approach demonstrates superior performance compared to traditional NLP systems and standalone LLMs. LLaMA3 and Phi3-mini achieve the highest F1 scores in rare disease identification. Few-shot prompting with 1-3 examples yields the best results, while knowledge-augmented generation shows limited improvement. Notably, the approach uncovers a significant number of potential rare disease cases not documented in structured diagnostic records, highlighting its ability to identify previously unrecognized patients.

Conclusion

The hybrid approach combining dictionary-based NLP tools with LLMs shows great promise for improving rare disease identification from unstructured clinical reports. By leveraging the strengths of both techniques, the method demonstrates superior performance and the potential to uncover hidden rare disease cases. Further research is needed to address limitations related to ontology mapping and overlapping case identification, and to integrate the approach into clinical practice for early diagnosis and improved patient outcomes.

Peer Review reports

Introduction

Rare diseases, defined as those affecting less than 200,000 individuals in the United States and fewer than 1 in 2,000 people in Europe, pose significant challenges to patients, healthcare systems, and society at large [1]. These conditions, often chronic and life-threatening, collectively impact millions of people worldwide, with an estimated 300 million individuals living with a rare disease globally. The low prevalence and diverse clinical manifestations of rare diseases create substantial hurdles in diagnosis, treatment, and research efforts [2].

The journey to diagnosis for rare disease patients is often long and arduous, with many experiencing a “diagnostic odyssey” that can last years or even decades [3]. This delay in diagnosis has profound consequences for patients, including inappropriate treatments, unnecessary medical procedures, and missed opportunities for early intervention [4]. The emotional and psychological toll on patients and their families is immense, as they struggle with uncertainty, isolation, and the challenges of navigating complex healthcare systems ill-equipped to address their unique needs [5, 6].

Underdiagnosis of rare diseases not only impacts individual patients but also places a significant burden on healthcare systems [7]. Misdiagnoses and delayed diagnoses lead to increased healthcare utilization, with patients often consulting multiple specialists and undergoing numerous tests before receiving an accurate diagnosis. This inefficiency strains healthcare resources and contributes to escalating costs. Moreover, the absence of timely and accurate diagnoses hinders the development of targeted therapies and limits patients’ access to appropriate care and support services [3, 8].

The challenges posed by rare diseases extend beyond clinical care to research and drug development. The scarcity of patients with a positive diagnosis for a given rare disease complicates the conduct of clinical trials and the collection of robust epidemiological data [7]. This, in turn, hampers efforts to understand disease mechanisms, identify potential therapeutic targets, and develop effective treatments. The underdiagnosis of rare diseases further exacerbates this issue by limiting the pool of identified patients who could participate in research studies or clinical trials [3].

In light of these challenges, there is an urgent need for innovative approaches to improve rare disease identification and diagnosis [9, 10]. Unstructured clinical text data, such as clinical notes, offers a rich source of information for rare disease identification, but manual curation is laborious and subjective. Automated natural language processing (NLP) tools that can effectively extract symptoms or diagnoses from unstructured clinical text data play a crucial role in improving rare disease patient diagnosis, treatment, and research. These tools have the potential to revolutionize the field by enabling large-scale rare disease identification, facilitating better medical outcomes for these vulnerable patients in the healthcare system.

Traditional approaches to disease identification often rely on manual curation by domain experts, which is time-consuming, labor-intensive, and subject to human biases. To address these challenges, there has been a growing interest in developing computational methods for automated identification. Dictionary-based NLP tools have been widely used to extract structured information from unstructured clinical narratives, such as electronic health records (EHRs) and scientific literature [11, 12]. These tools leverage pre-defined rules and dictionaries to identify and normalize phenotypic concepts. However, dictionary-based systems often struggle with the complexity and variability of natural language, leading to suboptimal performance in capturing the nuances of rare disease phenotypes.

Recent advancements in large language models (LLMs), such as the GPT [13] and LLaMA [14] series, have revolutionized the field of NLP. These models, pretrained on vast amounts of text data, have demonstrated remarkable capabilities in understanding and generating human-like language. By leveraging the knowledge embedded in LLMs, researchers can potentially enhance the performance of dictionary-based NLP tools for rare disease identification. However, LLMs are susceptible to biases present in their training data and can struggle with factual accuracy, particularly in specialized domains like medicine. Furthermore, their “black box” nature makes it difficult to understand their reasoning and identify potential errors [15].

To address these limitations, we propose a novel hybrid approach that combines the strengths of ontology and dictionary-based NLP tools with the capabilities of LLMs. This approach leverages the interpretability and control of dictionary-based systems to guide the LLM’s analysis, potentially improving its accuracy and focus on identifying rare diseases within unstructured clinical data. Our main contributions can be concluded as follows:

  • We propose a novel hybrid framework that integrates ontology and dictionary-based NLP tools with fine-tuned LLMs to enhance the accuracy of rare disease identification from clinical notes. This approach leverages the strengths of both techniques: the high recall of dictionary-based systems guided by a comprehensive vocabulary derived from ORDO/UMLS and the contextual understanding of LLMs.

  • To optimize contextual reasoning within the hybrid framework, we conduct extensive experiments with diverse LLMs, exploring various prompt methods (zero-shot, few-shot, knowledge-augmented generation) and context lengths. These experiments provide valuable insights into the impact of these factors on rare disease identification accuracy.

  • We further apply our methods to the large scale real-world patient notes. Our analysis reveals a substantial number of potential rare disease cases that are not currently documented in structured diagnostic records. This finding highlights the immense potential of our method for uncovering hidden rare disease cases, facilitating early diagnosis, and ultimately improving patient outcomes and treatment development.

Related work

Rare disease identification as text phenotyping with ontologies

Cohort identification, the process of identifying cases of disease from clinical records [16], is a crucial task in healthcare research and clinical practice. This task is typically accomplished through the use of clinical codes, such as the International Classification of Diseases (ICD), or by analyzing unstructured data, such as clinical notes. When free text clinical notes are used as the primary source for cohort identification, the task is referred to as text phenotyping. Text phenotyping involves extracting relevant information about patient phenotypes, including symptoms, signs, and diagnoses, from the unstructured text data.

Ontologies play a pivotal role in rare disease identification, as they provide structured and standardized information on rare diseases and their associated phenotypes. The key ontologies used in this domain include the Human Phenotype Ontology (HPO) [17], Orphanet [18], and the Online Mendelian Inheritance in Man (OMIM) [19]. HPO is a standardized vocabulary of phenotypic abnormalities encountered in human disease, enabling the consistent description of phenotypic information across different databases and applications. Orphanet is a comprehensive resource for information on rare diseases and orphan drugs, providing a nomenclature and classification of rare diseases. OMIM is a compendium of human genes and genetic disorders, offering detailed information on the molecular basis of inherited diseases. These specialized ontologies are often linked to more general ontologies, such as the Unified Medical Language System (UMLS) [20] and the International Classification of Diseases, 10th Revision (ICD-10) [21]. UMLS is a comprehensive collection of biomedical vocabularies and standards, facilitating the integration of information from various sources. ICD-10 is a widely used system for classifying diseases, injuries, and health conditions, providing a standardized coding scheme for clinical and research purposes.

The integration of these ontologies allows for a more comprehensive and accurate representation of rare diseases and their associated phenotypes. By leveraging the structured information provided by these ontologies, researchers and clinicians can improve the efficiency and accuracy of rare disease identification from clinical texts.

Natural language processing for text phenotyping

Dictionary-based NLP tools, such as cTAKES [22], SemEHR [11], and MedCAT [23], have been widely adopted to extract information from clinical narratives, including electronic health records (EHRs) and scientific literature. These tools utilize predefined rules and dictionaries to identify and normalize phenotypic concepts, transforming raw textual data into a structured format for further analysis. By leveraging extensive vocabularies and ontologies, such as UMLS and HPO, these tools can recognize a wide range of medical terms, abbreviations, and synonyms, facilitating accurate concept identification within the text.

However, dictionary-based NLP tools often struggle with the variability and ambiguity inherent in natural language, as the semantic meaning of phrases and sentences can be highly context-dependent. For instance, the phrase “cold” can refer to a viral infection, a sensation of low temperature, or a personality trait, depending on the context. Dictionary-based tools, which primarily rely on predefined rules and dictionaries, may lack the ability to fully capture and disambiguate such semantic nuances.

Moreover, the performance of dictionary-based NLP tools can be limited by the comprehensiveness and quality of the underlying dictionaries and ontologies. While resources like UMLS and HPO provide extensive coverage of medical concepts, they may not always include the most up-to-date terminology or capture the full spectrum of rare disease phenotypes, leading to the omission of important information or misclassification of concepts, particularly in the context of rare diseases where the associated phenotypes may be atypical or poorly characterized.

Another challenge faced by dictionary-based NLP tools is the handling of negation and uncertainty, as clinical narratives often contain negated or hypothetical statements. These expressions can significantly alter the meaning of the associated concepts and require careful consideration during information extraction. Dictionary-based tools may struggle to accurately identify and interpret such negation and uncertainty, potentially leading to errors in the extracted information.

To address these limitations, researchers have explored various approaches to enhance the performance of dictionary-based NLP tools, including the incorporation of machine learning techniques. Machine learning and deep learning models have emerged as powerful tools for automating phenotyping from EHRs and other clinical data sources, with the ability to learn complex patterns and relationships from large volumes of data, enabling them to identify and extract relevant phenotypic information with minimal human intervention.

Traditional machine learning algorithms, such as logistic regression, support vector machines (SVMs), and random forests, have been successfully applied to phenotyping tasks [24, 25]. These models can learn from labeled training data to classify patients based on the presence or absence of specific phenotypes, and have shown promising results in predicting the risk of developing certain conditions based on a combination of clinical and demographic features.

In recent years, deep learning models, particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have gained significant attention for their ability to learn hierarchical representations from complex and unstructured data [26]. CNNs are well-suited for processing grid-like data, such as medical images or two-dimensional representations of clinical notes, while RNNs are designed to handle sequential data, such as time series or natural language. These models have been applied to various phenotyping tasks, such as detecting abnormalities in medical images or identifying specific clinical events from EHR data, achieving high accuracy and outperforming traditional machine learning approaches [27].

More recently, pre-trained language models, such as BERT (Bidirectional Encoder Representations from Transformers), have shown remarkable performance in various natural language processing tasks, including phenotyping from unstructured clinical notes [28]. BERT-based models can capture the contextual information and long-range dependencies in the text, allowing for a more accurate understanding of the semantic meaning and relationships between clinical concepts, and have achieved state-of-the-art results in tasks such as named entity recognition, relation extraction, and document classification. BERT-based models can be effectively adapted by adding an extra layer fined-tuned for Named Entity Recognition using token-level labels, an example is BioBERT-NER [29] fine-tuned on top of BioBERT [30]. A recent example of fine-tuning BERT for rare disease phenotyping is PhenoBERT [31], matching text fragments to concepts in the HPO ontology, and a hierarchy-aware CNN-based model was used for narrowing the candidates from the large number of concepts in the ontology.

However, despite the progress made by machine learning and deep learning models in phenotyping, their performance on rare disease phenotyping remains limited by several factors, particularly the scarcity of labeled training data for rare diseases. Due to the low prevalence of rare diseases, obtaining a large enough dataset with reliable labels can be extremely difficult and costly, hindering the ability of models to generalize and capture the complex phenotypic expressions of rare diseases.

Recently, Large language models (LLMs), such as the GPT [13] and LLaMA [14] series, have revolutionized the field of natural language processing with their impressive performance across a wide range of tasks, including named entity recognition, relation extraction, and question answering [30, 32]. These models, pre-trained on vast amounts of textual data, have the ability to capture complex linguistic patterns and generate human-like responses, making them a promising tool for various applications in the biomedical domain.

LLMs like GPT or LLAMA can be prompted for named entity recognition and concept normalization, and this usually works after fine-tuning or instruction-tuning. The study [33] shows that ChatGPT-3.5 model cannot match the results of BERT-based approach for rare disease phenotype extraction. The work [34] explored the fine-tuning of GPT and LLAMA models for phenotype concept recognition from HPO ontology, using data annotations created with an NER tool and supported by manual experts; results showing their comparable performances to BERT-based fine-tuning. [35] uses concept names, identifiers, and synonyms in HPO ontology as data to fine-tune LLAMA 2 and greatly enhance the performance for concept normalisation, compared to ChatGPT-3.5. Both works provide useful models (PhenoGPT [34] and PhenoHPO [35]) for rare disease concept normalization to HPO ontology and they only use LLMs alone in the inference. We will compare both studies with our dictionary and LLM-based hybrid approach for rare disease identification.

Two other related works use LLMs for textual data with rare diseases [36, 37]. The work [36] applied LLMs to a specific rare disease [36], and introduced a zero-shot LLM-based method enriched by retrieval-augmented generation and MapReduce for identifying pulmonary hypertension (PH) from clinical notes. By combining the strengths of LLMs with retrieval-based methods and distributed computing techniques like MapReduce, they were able to accurately identify PH cases without the need for labeled training data. The other work [37] investigated prompting strategies for text identification using LLMs. By carefully crafting prompts that target specific rare diseases or phenotypic characteristics, the authors were able to elicit accurate and informative responses from the LLMs. However, their work focused on the four most frequent rare diseases in the MIMIC-IV dataset, rather than the full set of thousands of rare diseases defined in an ontology, highlighting the need for further investigation into the scalability and generalizability of these approaches.

Our work aims to combine the dictionary-based method and LLM for a comprehensive rare disease phenotyping approach. By leveraging the strengths of both methods, we seek to address the limitations of each individual approach and develop a more accurate and robust system for identifying rare diseases from clinical narratives. This hybrid approach has the potential to improve the efficiency and effectiveness of rare disease phenotyping, ultimately leading to better patient care and research outcomes.

Methods

Figure 1 summarizes our overall work.

Fig. 1
figure 1

Overview of our work. In the given example report, there are some abbreviations that are mistakenly extracted as rare diseases (eg. PNA: Pneumonia, SAR: Subacute rehabilitation), and also some negated mentioned are extracted (eg. no signs of xxx). We propose leveraging LLMs for enhanced contextual filtering, enabling more precise determinations of relevance and validity within the extracted information

Problem statement

Given a set of unstructured clinical notes \(\mathcal {D} = {d_1, d_2, \ldots , d_N}\), and a comprehensive rare disease ontology \(\mathcal {O}\) that defines a set of rare diseases \(\mathcal {R} = {r_1, r_2, \ldots , r_K}\) along with their associated phenotypic characteristics \(\mathcal {P} = {p_1, p_2, \ldots , p_L}\), the objective is to develop a hybrid approach that combines dictionary-based methods and LLMs to accurately identify and extract rare disease mentions and their corresponding phenotypic information from the clinical notes. Let \(f: \mathcal {D} \rightarrow \mathcal {R} \times \mathcal {P}\) be a function that maps each clinical note \(d_i\) to a set of rare disease phenotypes \({r_j}\) or \({p_k}\), where \(r_j \in \mathcal {R}\) and \(p_k \in \mathcal {P}\). The goal is to optimize the function f by leveraging the strengths of both dictionary-based methods and LLMs.

Rare disease terminology from ontology

To construct a comprehensive vocabulary of rare diseases, we leverage the Orphanet Rare Disease Ontology (ORDO) [18]. ORDO provides a structured nomenclature and ontological representation of rare disease concepts, serving as a valuable resource for standardizing and organizing information related to rare diseases. We extract all disease concepts and their associated synonyms from ORDO to form our initial rare disease term list. This process ensures that we capture a wide range of rare disease names and their variations, establishing a solid foundation for our vocabulary.

However, rare diseases often suffer from name variations and inconsistent terminology usage [38], which can hinder effective identification and extraction from clinical text. This challenge arises due to the complex nature of rare diseases, the evolving understanding of their underlying mechanisms, and the lack of consensus in naming conventions. As a result, the same rare disease may be referred to by multiple names or acronyms across different sources, making it difficult to develop a comprehensive and unified vocabulary.

To address this challenge and enhance the coverage of our rare disease vocabulary, we utilize the Unified Medical Language System (UMLS) as an intermediary dictionary [20]. UMLS is a metathesaurus that integrates numerous biomedical vocabularies, including standard disease terminologies such as SNOMED CT, ICD-10, and MeSH. By mapping ORDO disease concepts to their corresponding UMLS concept unique identifiers (CUIs), we leverage the extensive synonymy information available in UMLS to expand our rare disease term coverage.

The mapping process involves several steps. First, we use string matching techniques to identify potential UMLS concepts that align with the ORDO disease concepts. This initial mapping is then refined using semantic type information and hierarchical relationships within UMLS to ensure the accuracy and specificity of the mapped concepts. By incorporating the synonyms and alternate names associated with each mapped UMLS concept, we substantially increase the breadth and depth of our rare disease vocabulary.

This comprehensive mapping process results in a final vocabulary consisting of 4,064 rare disease phenotype mappings from ORDO to UMLS. This expanded vocabulary not only includes the primary names of rare diseases but also encompasses a wide range of synonyms, acronyms, and alternate terms, thereby enhancing our ability to identify rare diseases mentioned in clinical text data. To further align our rare disease phenotypes with the diagnostic coding systems used in the dataset for this study, we map the identified phenotypes to their corresponding ICD-9 [39] and ICD-10 [21] codes. This additional mapping step is crucial because patients in the dataset are coded using either or both of these classification systems. By establishing a link between our rare disease vocabulary and the standard diagnostic codes, we facilitate the integration of our findings with existing clinical workflows and enable seamless comparison with structured diagnostic information.

The ICD-9 and ICD-10 mapping process involves leveraging the inherent relationships between UMLS concepts and these coding systems. Many UMLS concepts are already associated with their corresponding ICD codes, allowing for a direct mapping. In cases where a direct mapping is not available, we employ a combination of automated and manual techniques to establish the appropriate connections. This includes utilizing existing mapping resources, such as the UMLS ICD-9/10 mappings, as well as consulting domain experts to validate and refine the mappings for specific rare disease phenotypes.

Dictionary-based text phenotyping

To initiate the process of extracting relevant clinical entities from unstructured electronic health records (EHRs), we employ SemEHR [11], a state-of-the-art dictionary-based natural language processing (NLP) tool. SemEHR has demonstrated exceptional performance in extracting and normalizing a wide range of clinical concepts, including diseases, medications, and procedures, making it an ideal choice for our rare disease identification pipeline.

One of the key strengths of SemEHR lies in its ability to effectively handle the complexities and variability of clinical language. It employs advanced techniques such as named entity recognition (NER) and entity linking (EL) to accurately identify and extract relevant clinical information from unstructured text. NER focuses on recognizing and classifying named entities, such as diseases, drugs, and anatomical terms, within the clinical narrative.

Once the named entities are identified, SemEHR performs entity linking to map the extracted entities to standardized terminologies, such as the Unified Medical Language System (UMLS). This mapping process involves disambiguating the extracted entities and linking them to their corresponding concept unique identifiers (CUIs) in the UMLS metathesaurus. By leveraging the rich semantic network and hierarchical relationships within UMLS, SemEHR can normalize the extracted entities to a common representation, enabling consistent and standardized analysis across different EHR systems and data sources.

To ensure that the mentions extracted by SemEHR are relevant to rare diseases, we employ the rare disease concept mappings as a filtering mechanism. By cross-referencing the extracted mentions with the rare disease concept mappings derived in the previous step, we can effectively identify and retain only those entities that are pertinent to rare diseases.

LLM-based text phenotyping

After obtaining the initial results from SemEHR, we observe that the system produces a significant number of false positive extractions. This issue can be attributed to two main factors. First, the dictionary-based approach often struggles with the appropriate extraction of clinical abbreviations. For instance, the abbreviation “PID” can refer to “Primary Immunodeficiency,” which is a rare disorder, but it can also be used to denote “Pelvic Inflammatory Disease,” a more common condition. Without considering the context in which the abbreviation appears, SemEHR may incorrectly identify it as a rare disease mention.

Second, SemEHR occasionally extracts rare disease mentions that are expressed in a hypothetical or negated context. For example, a clinical note might state, “the patient does not have Huntington’s disease,” indicating the absence of the condition. However, SemEHR may still identify “Huntington’s disease” as a positive mention, leading to a false positive extraction. These contextual nuances are challenging for dictionary-based systems to handle, as they primarily rely on pattern matching and lack the ability to understand the surrounding linguistic context.

To address these limitations and improve the accuracy of rare disease phenotype extraction, we propose exploring the use of large language models (LLMs) with contextual reasoning capabilities. LLMs, such as LLaMA [14], which has demonstrated remarkable performance in natural language understanding tasks, particularly in capturing the semantic meaning and contextual information within text. By leveraging the advanced linguistic knowledge and reasoning abilities of these models, we aim to reduce the number of false positive identifications and enhance the overall quality of rare disease phenotype extraction.

The task of contextual reasoning for rare disease mention classification can be formulated as a binary classification problem. Given a rare disease mention m extracted by SemEHR and its surrounding context information c from the clinical text T, the LLM aims to predict the label y:

$$\begin{aligned} y = f(m, c) \in {0, 1} \end{aligned}$$

where y represents the predicted label for the rare disease mention. A label of \(y = 1\) indicates a true positive, meaning that the mention refers to the presence of the rare disease in the given context. Conversely, a label of \(y = 0\) indicates a false positive, suggesting that the mention does not actually indicate the presence of the rare disease based on the contextual information.

The function f represents the LLM’s classification mechanism, which maps the mention m and its context c to a binary label. This function encapsulates the model’s ability to understand and reason about the linguistic context surrounding the mention, enabling it to make informed predictions about the mention’s validity. This contextual reasoning step is crucial for improving the accuracy of rare disease phenotype extraction. By analyzing the surrounding context c, the LLM can filter out false positives caused by ambiguities in clinical language, such as abbreviations, negations, and other complexities.

Experiments design

Dataset

For this study, we use the real-world free-text EHR data, MIMIC-IV. Our study mainly focuses on the discharge summaries from this database, which contains 331,794 reports from 145,915 patients [40]. Discharge summaries are a critical component of the electronic health record, as they provide a detailed recap of a patient’s hospital stay, including the reasons for admission, the treatments and procedures performed, the patient’s progress, and the final diagnosis and recommendations for ongoing care. The narrative structure and detailed clinical descriptions in discharge summaries can offer valuable contextual cues to identify rare disease patients. Figure 1 illustrates the highly skewed distribution of report lengths, with an average of 1,669 words, ranging from a minimum of 87 to a maximum of 29,684 words, which may provide the challenge of long contextual understanding (shown in Fig. 2).

Fig. 2
figure 2

Data distribution of MIMIC-IV’s discharge summary lengths

Data annotation

To evaluate the performance of our rare disease phenotype identification approach, we create a gold standard dataset by randomly selecting 200 discharge summaries from the MIMIC-IV database. Two domain experts are tasked with manually annotating each mention of a rare disease in these summaries, classifying them as either a true positive (indicating the patient indeed has the rare disease) or a false positive (the mention does not actually indicate the presence of the rare disease). To ensure consistency and accuracy in the annotation process, the annotators are provided with a detailed guideline that includes specific examples on how to handle various scenarios, such as hypothetical or negated mentions. This guideline helps to standardize the annotation process and minimize subjectivity.

Based on the annotations provided by the two experts, we calculate the inter-annotator agreement using Cohen’s kappa, a widely used statistical measure that assesses the level of agreement between two raters while accounting for chance agreement. The resulting Cohen’s kappa score of 0.77 indicates a substantial level of agreement between the annotators, suggesting that the annotations are reliable and consistent. To further enhance the quality of the gold standard dataset, any disagreements between the two annotators are resolved by a third annotator. This third annotator reviews the mentions where the initial annotators had differing opinions and makes a final decision on the classification of those mentions.

In total, the gold standard dataset comprises 362 rare disease related mentions identified from the 200 discharge summaries. The gold standard dataset serves as a valuable resource for assessing the accuracy of our contextual rare disease phenotype identification model. By comparing the model’s predictions against the expert-annotated labels, we can measure its performance in distinguishing true positive rare disease mentions from false positives in real-world clinical text.

Experiment setup

Baseline

Our methodology begins with the implementation of SemEHR as our baseline approach for extracting ontology-based rare diseases. We leverage SemEHR’s capability to identify and normalize clinical concepts using predefined dictionaries and rules, which provides an initial set of potential rare disease mentions from the clinical text.

To provide a comprehensive evaluation, we also compare our approach with several state-of-the-art models, categorized as follows:

BERT-based models:

As these models are not specifically fine-tuned for rare diseases, we employ a two-step process. First, we utilize them for medical Named Entity Recognition (NER), followed by a dictionary filtering step using the rare disease ontology described in Rare disease terminology from ontology section. The models evaluated are:

  1. 1.

    PhenoBERT [31]: A BERT-based model fine-tuned for disease phenotyping.

  2. 2.

    BioBERT-NER [29]: Another BERT-based model fine-tuned for biomedical named entity recognition and normalization of diseases.

Large language model (LLM)-based approaches:

We also evaluate two advanced LLM-based models specifically designed for phenotype and rare disease recognition:

  1. 1.

    PhenoGPT [34]: A fine-tuned model for phenotype recognition. As recommended, we utilize their LLaMA2-based version for this comparison.

  2. 2.

    PhenoHPO [35]: A LLaMA2-based model fine-tuned for rare disease concept recognition and normalization.

LLM

Our proposed approach also includes leveraging large language models (LLMs) to perform further contextual reasoning and filter out negative mentions from SemEHR.

For LLM selection, we focus on models within the 8 billion parameter range due to computational resource limitations, which often resemble resource-constrained clinical environments. This choice ensures that our approach remains feasible and practical for real-world clinical settings. We choose three state-of-the-art LLMs: LLaMA3-8BFootnote 1, Mistral-7BFootnote 2, and Phi3-miniFootnote 3. These models have shown strong performance in various natural language processing tasks and provide a diverse set of architectures and training approaches. To compare the performance between general domain LLMs and medical fine-tuned LLMs, we also select three medical LLMs: OpenBioLLMFootnote 4, BioMistralFootnote 5, and AlpacareFootnote 6. These medical LLMs have been specifically fine-tuned on biomedical and clinical texts, and their inclusion allows us to assess the impact of domain-specific knowledge on the rare disease identification task. OpenBioLLM-8B builds upon the latest LLaMA3-8B model and incorporates the DPO dataset and fine-tuning recipe along with a custom diverse medical instruction dataset. This fine-tuning process adapts the general-purpose LLM to the biomedical domain, potentially improving its performance on rare disease identification. Alpaca builds on LLaMA2-7B and is tuned on medical instructions, providing another perspective on the effectiveness of domain-specific fine-tuning. BioMistral utilizes Mistral-7B as its foundation model and is further pre-trained on PubMed Central, a large corpus of biomedical literature, which may enhance its ability to capture rare disease-related information.

To ensure robust and stable responses from the LLMs, we set the temperature parameter to 0 during inference. This setting reduces the variability in the generated outputs and promotes more deterministic behavior, which is desirable for the rare disease identification task.

In terms of prompt engineering, we experiment with both zero-shot and few-shot prompting, as well as knowledge-augmented generation. Zero-shot prompting involves providing the LLMs with a task description without any examples, relying on their inherent knowledge and understanding to generate appropriate responses. Few-shot prompting, on the other hand, includes a small number of exemplary rare disease mentions and their corresponding labels to guide the LLMs’ predictions. Knowledge-augmented generation involves incorporating additional domain-specific information, such as rare disease definitions or phenotypic characteristics, into the prompts to enhance the LLMs’ contextual understanding.

To investigate the impact of contextual information on the LLMs’ ability to accurately identify rare diseases, we vary the amount of surrounding text provided to the models. We start with the full discharge summary paragraph as the input context, allowing the LLMs to consider the entire narrative when making predictions. However, processing the full paragraph may be computationally expensive and potentially introduce noise. Therefore, we gradually reduce the context to shorter lengths, such as a few sentences or a fixed window size around the rare disease mention.

By systematically evaluating the LLMs’ performance across different context window sizes, we aim to identify the optimal balance between computational efficiency and accuracy. This analysis helps us determine the minimum amount of contextual information required for the LLMs to make accurate predictions, considering the trade-off between model performance and computational resources in resource-constrained clinical settings.

Through this comprehensive evaluation, we assess the effectiveness of different LLMs, prompting strategies, and context window sizes for the rare disease identification task. The use of prompt templates can be found in Additional file 1. By comparing the performance of general domain LLMs and medical fine-tuned LLMs, we gain insights into the impact of domain-specific knowledge on the task. The exploration of zero-shot, few-shot, and knowledge-augmented prompting allows us to identify the most effective approach for eliciting accurate responses from the LLMs. Furthermore, the analysis of context window sizes provides valuable information on the optimal balance between contextual information and computational efficiency.

The findings from this evaluation will inform the development of an optimized rare disease identification pipeline that leverages the strengths of both SemEHR and LLMs. By combining the initial extraction capabilities of SemEHR with the contextual reasoning abilities of LLMs, we aim to achieve high accuracy in identifying rare diseases from clinical texts while considering the practical constraints of resource-limited clinical environments. This hybrid approach has the potential to significantly improve the efficiency and effectiveness of rare disease identification, ultimately benefiting patient care and research in the field of rare diseases.

Table 1 Overall model performance for rare disease identification

Results

Overall performance

The results presented in Tables 1 and 2 demonstrate varying performance across different models for rare disease identification. Each model exhibits unique strengths and limitations, highlighting the complexity of this task.

SemEHR, our baseline model utilizing a dictionary-based approach, achieves an F1 score of 0.4866. Its standout feature is a notably high recall of 0.8458, significantly outperforming other models in this metric. This high recall indicates that SemEHR is particularly adept at comprehensively capturing entities within the text, suggesting a superior ability to identify a wide range of relevant information.

However, SemEHR’s precision (0.3415) is comparatively lower, indicating a substantial number of false positive predictions. This suggests that many of the rare disease phenotypes identified by SemEHR are not actually present in the clinical text, highlighting the limitations of solely relying on dictionary-based methods for accurate rare disease identification.

To address the limitations of the dictionary-based approach and further refine the results obtained from SemEHR, we employed various LLMs as an additional filtering step. Table 1 presents a comprehensive comparison of our proposed approach. The baseline model, SemEHR, which relies on a dictionary-based approach, achieves an F1 score of 0.4866. It achieves a high recall of 0.8458. However, its low precision score of 0.3415 indicates a substantial number of false positive predictions, suggesting that many of the rare disease phenotypes identified by SemEHR are not actually present in the clinical text. This highlights the limitations of solely relying on dictionary-based methods for accurate rare disease identification.

When using zero-shot prompting, Phi3-mini demonstrates the best performance, achieving an F1 score of 0.6921, closely followed by LLaMA3. These results suggest that LLMs can effectively leverage their pre-trained knowledge to identify rare diseases in clinical text without the need for task-specific fine-tuning. The improved performance of these models compared to the baseline SemEHR emphasizes the potential of LLMs in enhancing the accuracy of rare disease identification.

Interestingly, AlpaCare exhibits the least improvement in precision among the LLMs evaluated, indicating that the model may struggle to effectively discriminate between true and false positive rare disease mentions. This observation underscores the importance of carefully selecting and evaluating LLMs for the specific task of rare disease identification, as their performance can vary significantly.

Further, we also compare our hybrid method with other models for another baseline comparisons. As shown in Table 2, BioBERT-NER and PhenoBERT demonstrate similar performance profiles. Their precision scores (0.3145 and 0.3465 respectively) are comparable to SemEHR. However, their recall scores (0.2928 and 0.3750) are substantially lower. This suggests that while these BERT-based models maintain a similar level of accuracy in the entities they do identify, they are more prone to missing relevant entities compared to SemEHR. The lower recall implies a higher rate of false negatives, indicating a more conservative approach to entity recognition.

Table 2 Comparison of models for rare disease identification from texts

The two LLM fine-tuning models, PhenoGPT and PhenoHPO show better performance than BERT based models and SemEHR. PhenoGPT shows the best F1 score among the five models excluding our method, coupled with a strong precision (0.4673), suggesting a well-balanced performance in both accurately identifying entities and capturing a good proportion of them. This indicates the potential of Large Language Models (LLMs) in enhancing the accuracy of rare disease identification. Despite being fine-tuned with rare disease phenotyping data, PhenoGPT and PhenoHPO demonstrate marginal performance improvements, highlighting the persistent challenges in this domain. Finally, our hybrid method achieves the best performance, evidenced by a significantly higher F1 score compared to traditional approaches. Moreover, these methods achieve a more balanced precision-recall trade-off, underscoring their robustness and applicability across diverse scenarios.

The overall results demonstrate the effectiveness of combining dictionary-based methods with LLMs for improved rare disease identification. By leveraging the strengths of both approaches – the comprehensive coverage of dictionary-based methods and the contextual understanding of LLMs – we can achieve a more accurate and reliable identification of rare diseases in clinical text. This hybrid approach holds promise for facilitating the early detection and management of rare diseases, ultimately leading to better patient outcomes.

Few-shot prompting.

Few-shot prompting is a promising approach for enhancing the performance of LLMs in rare disease identification tasks. To investigate the impact of few-shot prompting, we conduct experiments with varying numbers of shots, ranging from 1 to 10. The examples used for few-shot prompting are randomly selected, ensuring no overlap with the test dataset to maintain the integrity of the evaluation. Here is the template used for few-shot prompting:

figure a

Table 1 reports the best performance achieved by each model using few-shot prompting. Notably, all models demonstrate improvements in F1 score compared to their zero-shot counterparts, with increases ranging from 0.01 to 0.12. OpenBioLLM exhibits the most substantial improvement, with an impressive 0.1158 increase in F1 score. This significant improvement suggests that OpenBioLLM effectively leverages the additional information provided by the few-shot examples to refine its predictions and better identify rare diseases in clinical text. On the other hand, Mistral shows the least improvement, with a modest 0.0148 increment in F1 score, indicating that the model may have a limited ability to capitalize on the few-shot examples for this specific task.

Figure 3 provides a more comprehensive analysis of the impact of the number of few-shot examples on the performance of different LLMs. The plot reveals an intriguing trend: increasing the number of few-shot examples initially leads to improved performance for all LLMs, as evidenced by the upward trend in F1 scores when adding 1-3 shots. This observation suggests that providing a small set of representative examples can effectively guide the LLMs to better understand the task and capture the relevant patterns associated with rare disease mentions.

Fig. 3
figure 3

Few-shot prompting performance (F1) of LLMs

However, the results also highlight an important consideration: more examples do not necessarily guarantee better performance. As the number of few-shot examples increases beyond a certain point, the F1 scores begin to decline, indicating that an excessive number of examples can actually hinder the models’ ability to accurately identify rare diseases in clinical text. This finding emphasizes the importance of striking the right balance in the number of few-shot examples to optimize the effectiveness of few-shot learning for rare disease identification. Among the LLMs evaluated, LLaMA3 consistently demonstrates the best performance across different numbers of few-shot examples, closely followed by Mistral and Phi3-mini. These models exhibit a steady improvement in F1 score as the number of few-shot examples increases, showcasing their strong capacity to leverage the additional information provided by the examples to refine their predictions. This observation highlights the potential of these LLMs to effectively adapt to the specific task of rare disease identification through few-shot learning.

In conclusion, few-shot prompting proves to be a valuable technique for enhancing the performance of LLMs in rare disease identification. By providing a small set of representative examples, LLMs can better grasp the nuances of the task and improve their ability to accurately identify rare diseases in clinical text. However, it is crucial to find the optimal balance in the number of few-shot examples to maximize the effectiveness of this approach. The results also underscore the varying capabilities of different LLMs in leveraging few-shot examples, with LLaMA3, Mistral, and Phi3-mini demonstrating particularly strong performance in this regard.

Knowledge augmented generation.

To further explore the potential of enhancing the performance of LLMs in rare disease identification, we investigate the impact of incorporating external knowledge into the prompts through Knowledge Augmented Generation (KAG). Specifically, for each rare disease mention encountered in the clinical text, we extract its corresponding definition from the Unified Medical Language System (UMLS) and integrate it into the prompts. The prompt template can be found in the additional files. By providing this additional contextual information, we aim to evaluate whether the LLMs can effectively leverage this new knowledge alongside their pre-existing training data to generate improved responses and accurately identify rare diseases.

Here is the template used for KAG prompts:

figure b

Our findings reveal mixed results regarding the effectiveness of KAG in this task. Notably, the Phi3-mini and BioMistral models demonstrate marginally higher F1 scores when compared to their zero-shot counterparts. This slight improvement suggests that these models are capable of incorporating the external knowledge provided by UMLS definitions to some extent, leading to a modest enhancement in their ability to identify rare diseases accurately. However, it is important to note that the performance of these models in the KAG setting still falls short of their few-shot prompting results, indicating that the incorporation of external knowledge alone may not be as effective as providing task-specific examples for guiding the models’ predictions.

Interestingly, the other LLMs evaluated in this study fail to benefit from the incorporation of knowledge into the prompts, as evidenced by the lack of improvement in their F1 scores compared to the zero-shot setting. This observation suggests that these models may already possess sufficient intrinsic knowledge about rare diseases, rendering the additional information provided by UMLS definitions redundant or less impactful. It is plausible that the extensive pre-training of these LLMs on large-scale biomedical and clinical text corpora has equipped them with a comprehensive understanding of rare diseases, making the external knowledge less crucial for their performance in this specific task. Another reason can be that UMLS definitions are often broad and may not capture the specific nuances of how rare diseases present in clinical text. Also, the formal language of UMLS definitions might not align well with the more varied and informal language used in clinical notes.

Moreover, the superior performance of few-shot prompting compared to KAG hints at the importance of task-specific examples in facilitating the models’ comprehension and adaptation to the rare disease identification task. By providing a small set of representative examples, few-shot prompting allows the LLMs to grasp the nuances and patterns associated with rare disease mentions more effectively than the incorporation of general definitional knowledge. This finding highlights the significance of carefully curating task-specific examples to guide the models’ learning process and optimize their performance in specialized domains like rare disease identification.

In conclusion, while Knowledge Augmented Generation shows promise in enhancing the performance of LLMs by incorporating external knowledge, its effectiveness in the task of rare disease identification appears to be limited. The marginal improvements observed in the Phi3-mini and BioMistral models suggest that these LLMs can benefit from the integration of UMLS definitions to a certain extent. However, the overall results indicate that the models may already possess sufficient intrinsic knowledge about rare diseases, and that few-shot prompting, with its task-specific examples, proves to be a more effective approach for improving their performance in this domain. These findings underscore the importance of carefully considering the specific characteristics of the task and the pre-existing knowledge of the LLMs when designing strategies to enhance their performance in specialized applications like rare disease identification.

Medical LLMs vs general LLMs.

One of the most intriguing findings of our study is the comparative performance of medical fine-tuned LLMs and general domain LLMs in the task of rare disease identification. Contrary to expectations, the medical fine-tuned LLMs, such as OpenBioLLM, BioMistral, and AlpaCare, demonstrate inferior performance compared to their general domain counterparts. This observation highlights the need for further research and development efforts to optimize the fine-tuning process and effectively incorporate medical knowledge into these specialized LLMs.

The suboptimal performance of medical fine-tuned LLMs can be attributed to several key challenges. One major issue is the lack of robustness and limited understanding of the given instructions and clinical reports exhibited by these models. Despite undergoing fine-tuning on medical corpora, these LLMs struggle to fully grasp the nuances and complexities of the clinical language and context, leading to subpar performance in identifying rare diseases accurately. This finding suggests that the current fine-tuning strategies employed for medical LLMs may not be sufficient to capture the intricate relationships and domain-specific knowledge required for this task.

Moreover, the superior performance of general domain LLMs in rare disease identification underscores the importance of carefully evaluating and adapting LLMs for specific medical tasks. While medical fine-tuning aims to equip LLMs with domain-specific knowledge, the results of our study indicate that generic fine-tuning alone may not guarantee optimal performance in specialized tasks like rare disease identification. This observation calls for a more nuanced approach to fine-tuning, taking into account the unique characteristics and requirements of each medical task.

To address these challenges and improve the performance of medical LLMs, further research is necessary to develop more sophisticated fine-tuning strategies that can effectively leverage medical knowledge. This may involve exploring novel techniques for incorporating domain-specific information, such as ontology-based fine-tuning or knowledge graph integration, to enhance the LLMs’ understanding of medical concepts and their relationships. Additionally, the development of task-specific fine-tuning approaches tailored to rare disease identification could help bridge the performance gap between medical and general domain LLMs.

Furthermore, the evaluation and adaptation of LLMs in the medical domain should go beyond generic fine-tuning and focus on comprehensive testing across a range of clinical tasks and datasets. By thoroughly assessing the performance of LLMs in various medical scenarios, researchers can identify the strengths and weaknesses of different models and fine-tuning strategies, enabling the development of more robust and reliable LLMs for clinical applications.

In conclusion, the inferior performance of medical fine-tuned LLMs compared to general domain LLMs in rare disease identification highlights the need for further advancements in the fine-tuning process and the incorporation of medical knowledge. The lack of robustness and limited understanding exhibited by these specialized LLMs underscores the importance of developing more sophisticated fine-tuning strategies and conducting comprehensive evaluations to ensure optimal performance in clinical tasks. By addressing these challenges through targeted research efforts, we can unlock the full potential of LLMs in the medical domain and enhance their accuracy and reliability in critical applications like rare disease identification.

The impact of context length.

Figure 4 illustrates the impact of context length on the performance of the LLMs in the task of rare disease identification. The context length is systematically varied from 100 to 4000 words, as some of the LLMs do not support input length larger than 4096. A key observation from the figure is that as the context length increases, the performance of the models generally decreases, as indicated by the downward trend in the plot. This was primarily due to the introduction of noise and irrelevant information in longer contexts, which can confuse the models. This suggests that smaller context lengths may be more effective for LLMs to make accurate inferences about the presence of rare diseases in clinical text. The models appear to perform better when focusing on more localized context, such as the most relevant sections of the clinical report, rather than co35nsidering the entire document. However, it is important to note that there were indeed specific cases where longer context lengths proved advantageous. These cases typically involved complex clinical scenarios where the rare disease mention was dependent on information spread across a larger portion of the clinical note. For instance, in cases where the rare disease diagnosis was contingent on a combination of symptoms, family history, and test results described throughout the discharge summary, the longer context allowed the model to capture these relationships more effectively. This finding highlights the importance of selecting an appropriate context window size to optimize the performance of LLMs in rare disease identification tasks and strikes a balance between providing sufficient contextual information and avoiding the inclusion of irrelevant or potentially confounding details.

Fig. 4
figure 4

Impact of context length on the performance (F1) of LLMs.

Furthermore, the results indicate that the current state-of-the-art LLMs still have difficulties in reasoning long sequences. Despite their impressive capabilities in various natural language processing tasks, the contextual reasoning ability of these models is not yet robust enough to effectively handle the full length of clinical reports. This observation underscores the need for further research and development of LLMs to improve their ability to reason over longer contexts and capture relevant information from comprehensive clinical narratives.

Analysis on the full dataset.

To assess the real-world impact of our approach, we apply the best-performing method to the full dataset, consisting of over 331,000 discharge summaries from 145,915 patients. This comprehensive analysis reveals the presence of 1,143 distinct rare diseases within the cohort, affecting a total of 24,593 individuals. The scale of this analysis highlights the potential of our approach to identify a significant number of rare disease cases from a large, diverse patient population.

One of the most striking findings from this analysis is the discovery of 495 rare diseases that were not previously captured in the structured diagnostic data (i.e., ICD codes). These diseases were identified solely through the analysis of patients’ discharge summaries using our NLP-based approach. This finding underscores the limitations of relying exclusively on structured diagnostic codes for rare disease identification and emphasizes the untapped potential of unstructured clinical narratives. By leveraging advanced NLP techniques, we can uncover previously unrecognized patients with rare diseases, potentially leading to improved diagnosis, treatment, and research opportunities for these individuals.

Furthermore, our analysis reveals that 337 rare diseases have a higher number of patients identified from free-text EHRs compared to structured diagnostic codes. This discrepancy suggests the presence of potentially undiagnosed or misdiagnosed individuals with rare diseases within the cohort. The higher prevalence of these diseases in the unstructured clinical narratives highlights the importance of comprehensive phenotyping approaches that go beyond traditional coding methods. By leveraging the rich information contained within free-text EHRs, we can identify patients who may have been overlooked or misclassified, enabling targeted interventions and specialized care for these individuals.

To illustrate the significance of these findings, Fig. 5 presents a selection of rare diseases that were identified through our NLP-based analysis of free-text EHRs but were absent from the structured diagnostic codes. These examples serve as compelling evidence for the utility of applying advanced text mining techniques to unstructured clinical narratives. By uncovering these previously undetected rare diseases, we can gain valuable insights into their epidemiology, natural history, and potential treatment strategies. Moreover, the identification of these patients can facilitate their referral to specialized care centers and support their inclusion in relevant clinical trials and research initiatives.

Fig. 5
figure 5

Case identification of rare diseases identified by NLP-based (free-text) and ICD-based (structured) data

The successful application of our approach to the full dataset demonstrates the power of combining large-scale clinical datasets with sophisticated NLP methods for rare disease identification. Our results are consistent with previous findings by Dong et al. [12] and Ford et al. [16], who have highlighted the value of incorporating unstructured clinical texts alongside structured coded data for disease identification. By leveraging the wealth of information contained within unstructured clinical narratives, we can enhance our understanding of rare diseases, identify patients who may have gone undiagnosed, and ultimately improve outcomes for this often-overlooked patient population.

The identification of a substantial number of previously unrecognized rare disease cases through our analysis underscores the potential impact of our approach on rare disease research and patient care. By uncovering these hidden cases, we can expand our knowledge of rare disease epidemiology, natural history, and potential therapeutic targets. Furthermore, the identification of these patients can facilitate their timely diagnosis, referral to specialized care, and inclusion in relevant research studies and clinical trials. This, in turn, can lead to improved outcomes, quality of life, and the development of targeted interventions for individuals with rare diseases.

In conclusion, our comprehensive analysis of the full dataset demonstrates the real-world applicability and impact of our NLP-based approach for rare disease identification. The discovery of a significant number of previously unrecognized rare disease cases highlights the limitations of relying solely on structured diagnostic codes and emphasizes the untapped potential of unstructured clinical narratives. This represents a significant advancement in clinical utility compared to traditional methods, offering improved case detection and more comprehensive phenotyping. By analyzing unstructured clinical narratives, we capture a more nuanced and complete picture of a patient’s condition, including subtle symptoms and clinical observations often overlooked in structured data. This can support clinical decision-making by prompting clinicians to consider less common diagnoses. However, it is crucial to emphasize that our method is intended to complement, not replace, traditional diagnostic approaches, and the rare disease mentions identified should be viewed as signals for further clinical investigation rather than definitive diagnoses. By leveraging the power of large-scale clinical datasets and advanced NLP techniques, we can enhance our understanding of rare diseases, identify patients who may have gone undiagnosed, and ultimately improve outcomes for this often-overlooked patient population. These findings underscore the importance of integrating unstructured clinical texts with structured coded data for comprehensive rare disease identification and research.

Conclusion

In conclusion, this study introduces a novel hybrid approach that synergistically combines dictionary-based natural language processing (NLP) tools with large language models (LLMs) to enhance the identification of rare diseases from unstructured clinical reports. By capitalizing on the complementary strengths of these two techniques, the proposed method exhibits superior performance compared to traditional NLP systems and standalone LLMs. Our comprehensive experiments investigate various strategies, including zero-shot and few-shot prompting, knowledge-augmented generation (KAG), and the impact of context length on the model’s performance. The results demonstrate that LLaMA3 and Phi3-mini consistently achieve the highest F1 scores in the task of rare disease identification.

Moreover, our analysis reveals the potential of the proposed approach to uncover previously unidentified rare disease cases that are not yet documented in structured medical records. This finding underscores the significant impact of our method in facilitating the early detection and diagnosis of rare diseases, which can ultimately lead to improved patient outcomes and the development of targeted treatment strategies.

However, it is important to acknowledge the limitations of our approach and the areas that require further advancements to ensure its optimal effectiveness and seamless integration into medical practice. One key factor influencing the performance of our method is the quality of ontology matching among the Unified Medical Language System (UMLS), the International Classification of Diseases, 10th Revision (ICD-10), and the Orphanet Rare Disease Ontology (ORDO). Inaccuracies in the ontology mappings, such as the example of “Congenital pulmonary airway malformation” from UMLS being mapped to “Q330 Congenital cystic lung” in ICD-10, can impact the precision of rare disease identification. To address these challenges, we have conducted manual reviews of complex mappings, seeking validation from domain experts, utilizing intermediate ontologies to improve alignment. Despite these efforts, we acknowledge that some level of imperfection in ontology mapping persists. As medical terminologists continuously refine and update these mappings with the aid of ontology mapping tools, we anticipate an improvement in the method’s performance over time.

Furthermore, our analysis primarily concentrates on comparing the case identification of rare diseases derived from NLP-based (free-text) and ICD-based (structured) data. It is crucial to recognize that there may be overlapping cases between these two approaches, and future research should focus on investigating the extent and implications of such overlap. By conducting a thorough analysis of the convergence and divergence between NLP-based and ICD-based rare disease identification, we can gain valuable insights into the complementary nature of these approaches and develop strategies to harmonize and integrate their findings.

Despite these limitations, our study establishes a solid foundation for the application of hybrid NLP and LLM approaches in the field of rare disease identification. The promising results obtained from our experiments highlight the immense potential of leveraging advanced language models and dictionary-based tools to unlock the wealth of information contained within unstructured clinical reports. By continually refining and expanding upon our methodology, we can work towards overcoming the challenges posed by ontology mapping inconsistencies and overlapping case identification, ultimately paving the way for more accurate and comprehensive rare disease detection.

In future research endeavors, it is essential to focus on enhancing the robustness and adaptability of our approach to accommodate the evolving nature of medical ontologies and terminologies. Collaborations between computational linguists, medical experts, and ontology developers will be crucial in addressing the limitations associated with ontology mapping and ensuring the seamless integration of our method into clinical practice. Additionally, exploring advanced techniques for data integration and harmonization, such as entity resolution and record linkage, can help to mitigate the impact of overlapping cases and provide a more holistic view of rare disease epidemiology.

Furthermore, the integration of our hybrid approach with other complementary methodologies, such as machine learning algorithms and knowledge graphs, can further enhance its performance and extend its applicability to a wider range of clinical scenarios. By leveraging the strengths of multiple techniques and data sources, we can develop a more comprehensive and robust framework for rare disease identification, ultimately improving patient care and accelerating research efforts in this critical domain.

Availability of data and materials

Data is available on Physionet with credentialed access: https://www.physionet.org/content/mimic-iv-note/2.2/.

Notes

  1. https://huggingface.co/meta-llama/Meta-Llama-3-8B.

  2. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2.

  3. https://huggingface.co/microsoft/Phi-3-mini-128k-instruct.

  4. https://huggingface.co/aaditya/Llama3-OpenBioLLM-8B.

  5. https://huggingface.co/BioMistral/BioMistral-7B.

  6. https://huggingface.co/xz97/AlpaCare-llama2-7b.

References

  1. Groft SC, Posada M, Taruscio D. Progress, challenges and global approaches to rare diseases. Acta Paediatr. 2021;110(10):2711–6.

    Article  PubMed  Google Scholar 

  2. Schieppati A, Henter JI, Daina E, Aperia A. Why rare diseases are an important medical and social issue. Lancet. 2008;371(9629):2039–41.

    Article  PubMed  Google Scholar 

  3. Bauskis A, Strange C, Molster C, Fisher C. The diagnostic odyssey: insights from parents of children living with an undiagnosed condition. Orphanet J Rare Dis. 2022;17(1):233.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Hampson C, Evans W, Menzies L, McKay L. Measuring the impact of the COVID-19 pandemic on diagnostic delay in rare disease. Innovations. 2022.

  5. Stoller JK. The challenge of rare diseases. Chest. 2018;153(6):1309–14.

    Article  PubMed  Google Scholar 

  6. Thygesen JH, Zhang H, Issa H, Wu J, Hama T, Pinho-Gomes AC, et al. A nationwide study of 331 rare diseases among 58 million individuals: prevalence, demographics, and COVID-19 outcomes. medRxiv. 2023;2023–10.

  7. Zhang Z. Diagnosing rare diseases and mental well-being: a family’s story. Orphanet J Rare Dis. 2023;18(1):45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Griggs RC, Batshaw M, Dunkle M, Gopal-Srivastava R, Kaye E, Krischer J, et al. Clinical research for rare disease: opportunities, challenges, and solutions. Mol Genet Metab. 2009;96(1):20–6.

    Article  CAS  PubMed  Google Scholar 

  9. Arbabi A, Adams DR, Fidler S, Brudno M, et al. Identifying clinical terms in medical text using ontology-guided machine learning. JMIR Med Inform. 2019;7(2):e12596.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Cook HV, Jensen LJ. A guide to dictionary-based text mining. Bioinforma Drug Discov. 2019;73–89.

  11. Wu H, Toti G, Morley KI, Ibrahim ZM, Folarin A, Jackson R, et al. SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. J Am Med Inform Assoc. 2018;25(5):530–7.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Dong H, Suárez-Paniagua V, Zhang H, Wang M, Casey A, Davidson E, et al. Ontology-driven and weakly supervised rare disease identification from clinical notes. BMC Med Inform Decis Mak. 2023;23(1):86.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. Gpt-4 technical report. arXiv preprint arXiv:230308774. 2023.

  14. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:230709288. 2023.

  15. Wang C, Liu X, Yue Y, Tang X, Zhang T, Jiayang C, et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:231007521. 2023.

  16. Ford E, Carroll JA, Smith HE, Scott D, Cassell JA. Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Inform Assoc. 2016;23(5):1007–15.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet. 2008;83(5):610–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Weinreich SS, Mangon R, Sikkens J, Teeuw ME, Cornel M. Orphanet: a European database for rare diseases. Ned Tijdschr Geneeskd. 2008;152(9):518–9.

    CAS  PubMed  Google Scholar 

  19. Hamosh A, Scott AF, Amberger J, Valle D, McKusick VA. Online Mendelian inheritance in man (OMIM). Hum Mutat. 2000;15(1):57–61.

    Article  CAS  PubMed  Google Scholar 

  20. Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl_1):D267–D270.

  21. Organization WH. International Statistical Classification of Diseases and related health problems: Alphabetical index. World Health Organ. 2004;3.

  22. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–13.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Kraljevic Z, Bean D, Mascio A, Roguski L, Folarin A, Roberts A, et al. MedCAT–medical concept annotation tool. arXiv preprint arXiv:191210166. 2019.

  24. Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, Peissig PL, et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc. 2016;23(6):1046–52.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Yu S, Ma Y, Gronsbell J, Cai T, Ananthakrishnan AN, Gainer VS, et al. Enabling phenotypic big data with PheNorm. J Am Med Inform Assoc. 2018;25(1):54–60.

    Article  PubMed  Google Scholar 

  26. Gehrmann S, Dernoncourt F, Li Y, Carlson ET, Wu JT, Welt J, et al. Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PLoS ONE. 2018;13(2):e0192360.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Tomašev N, Glorot X, Rae JW, Zielinski M, Askham H, Saraiva A, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature. 2019;572(7767):116–9.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Li Y, Rao S, Solares JRA, Hassaine A, Ramakrishnan R, Canoy D, et al. BEHRT: transformer for electronic health records. Sci Rep. 2020;10(1):7155.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Alonso Casero Á. Named entity recognition and normalization in biomedical literature: a practical case in SARS-CoV-2 literature. ETSI_Informatica. 2021. Unpublished. https://oa.upm.es/67933/.

  30. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40.

    Article  CAS  PubMed  Google Scholar 

  31. Feng Y, Qi L, Tian W. PhenoBERT: a combined deep learning method for automated recognition of human phenotype ontology. IEEE/ACM Trans Comput Biol Bioinforma. 2022;20(2):1269–77.

    Article  Google Scholar 

  32. Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4(1):86.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Shyr C, Hu Y, Bastarache L, Cheng A, Hamid R, Harris P, et al. Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models. J Healthc Inform Res. 2024;8(2):438–61.

  34. Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, et al. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns. 2024;5(1).

  35. Wang A, Liu C, Yang J, Weng C. Fine-tuning large language models for rare disease concept normalization. J Am Med Inform Assoc. 2024;31(5)

  36. Thompson WE, Vidmar DM, De Freitas JK, Pfeifer JM, Fornwalt BK, Chen R, et al. Large language models with retrieval-augmented generation for zero-shot disease phenotyping. arXiv preprint arXiv:231206457. 2023.

  37. Oniani D, Hilsman J, Dong H, Gao F, Verma S, Wang Y. Large language models vote: Prompting for rare disease identification. arXiv preprint arXiv:230812890. 2023.

  38. Shen W, Wang J, Han J. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Trans Knowl Data Eng. 2014;27(2):443–60.

    Article  Google Scholar 

  39. Organization WH, et al. International classification of diseases-Ninth revision (ICD-9). Wkly Epidemiol Rec Relevé Épidémiologique Hebdomadaire. 1988;63(45):343–4.

    Google Scholar 

  40. Johnson AE, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

 The authors would like to thank all the reviewers and editors for their valuable feedback that substantially improved this paper.

Funding

This work was supported by National Institute for Health and Care Research (NIHR202639), NIHR/HDR UK Winter Pressure Award (WP0006) and Medical Research Council (MR/S004149/2, MR/X030075/1). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. This work was also supported by British Council (UCL-NMU-SEU International Collaboration On Artificial Intelligence In Medicine: Tackling Challenges Of Low Generalisability And Health Inequality; Facilitating Better Urology Care With Effective And Fair Use Of Artificial Intelligence - A Partnership Between UCL And Shanghai Jiao Tong University School Of Medicine). H.Wu’s role in this research were partially funded by the Legal & General Group (research grant to establish the independent Advanced Care Research Centre at University of Edinburgh). The funders had no role in conduct of the study, interpretation, or the decision to submit for publication. The views expressed are those of the authors and not necessarily those of Legal & General.

Author information

Authors and Affiliations

Authors

Contributions

J.W., H.D., and H.Wu conceived the project. J.W. developed the methodology and experiments. Z.L., H.Wang and R.L. performed data annotation. J.W. and H.D. drafted the manuscript. All authors contributed to the final manuscript preparation.

Corresponding authors

Correspondence to Jinge Wu or Honghan Wu.

Ethics declarations

Ethics approval and consent to participate

This work used the de-identified clinical notes in MIMIC-IV. We completed the Collaborative Institutional Training Initiative (CITI) Program’s “Data or Specimens Only Research” course (https://physionet.org/content/shareclefehealth2013/view-required-trainings/1.0/#1) and signed the data use agreement to get access to the data. We strictly followed the guidelines and only used locally hosted LLMs with the data.

Consent for publication

All authors are consent for publication.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J., Dong, H., Li, Z. et al. A hybrid framework with large language models for rare disease phenotyping. BMC Med Inform Decis Mak 24, 289 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02698-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02698-7

Keywords