Skip to main content

The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data

Abstract

Background

Medical narratives are fundamental to the correct identification of a patient’s health condition. This is not only because it describes the patient’s situation. It also contains relevant information about the patient’s context and health state evolution. Narratives are usually vague and cannot be categorized easily. On the other hand, once the patient’s situation is correctly identified based on a narrative, it is then possible to map the patient’s situation into precise classification schemas and ontologies that are machine-readable. To this end, language models can be trained to read and extract elements from these narratives. However, the main problem is the lack of data for model identification and model training in languages other than English. First, gold standard annotations are usually not available due to the high level of data protection for patient data. Second, gold standard annotations (if available) are difficult to access. Alternative available data, like MIMIC (Sci Data 3:1, 2016) is written in English and for specific patient conditions like intensive care. Thus, when model training is required for other types of patients, like oncology (and not intensive care), this could lead to bias. To facilitate clinical narrative model training, a method for creating high-quality synthetic narratives is needed.

Method

We devised workflows based on generative AI methods to synthesize narratives in the German language to avoid the disclosure of patient’s health data. Since we required highly realistic narratives, we generated prompts, written with high-quality medical terminology, asking for clinical narratives containing both a main and co-disease. The frequency of distribution of both the main and co-disease was extracted from the hospital’s structured data, such that the synthetic narratives reflect the disease distribution among the patient’s cohort. In order to validate the quality of the synthetic narratives, we annotated them to train a Named Entity Recognition (NER) algorithm. According to our assumptions, the validation of this system implies that the synthesized data used for its training are of acceptable quality.

Result

We report precision, recall and F1 score for the NER model while also considering metrics that take into account both exact and partial entity matches. Trained models are cautious, with a precision up to 0.8 for Entity Type match metric and a F1 score of 0.3.

Conclusion

Despite its inherent limitations, this technology has the potential to allow data interoperability by using encoded diseases across languages and regions without compromising data safety. Additionally, it facilitates the synthesis of unstructured patient data. In this way, the identification and training of models can be accelerated. We believe that this method may be able to generate discharge letters for any combination of main and co-diseases, which will significantly reduce the amount of time spent writing these letters by healthcare professionals.

Peer Review reports

Background

Generative AI tools have been widely acclaimed because they can produce human-level text starting from prompts, such as engaging dialogue, language translations, articles, poetry and essays, and much more, in a wide range of styles [1]. These tools can have enormous potential not only for creative writing, but also for applications in disciplines like medicine, where medical narratives are used as a valuable information source for documentation and decision making.

Currently there are several open-source models available, and currently the field is highly dynamic, with new models frequently deployed. The most popular is ChatGPT, but also there are very competitive alternativesFootnote 1.

Among these models is BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), a collaborative effort between BigScience and other academic institutions. In most cases, these models use decoder-only transformers with minor modifications. Almost all of the training data has been released, along with information about its sources, curation, and processing. The model is one of the largest open-source multilingual models available [2]. A second relevant language model is OPT (Open Pre-trained Transformer), which was released by Meta. In this model, a decoder-only transformer architecture is used with some changes to the attention mechanism. This model has been trained using publicly available data [3].

Other large language models published in 2023 are for instance LLaMA (by Meta) in February, StableLM (by StabilityAI) and Pythia (by Eleuther AI) in April, MPT (by MosaicML) in May, X-GEN (by Salesforce) and Falcon (by TIIUAE) in June, Llama 2 (by Meta) in July, StableLM v2 (by StabilityAI) in August, Qwen (by Alibaba) and Mistral (by Mistral.AI) in September, Yi (by 01-ai) in November, DeciLM (by Deci), Phi-2, and SOLAR (by Upstage) in DecemberFootnote 2.

Generative AI can impact domains where data is scarce and real data collection is laborious. This assumes that the synthesized data is generated using real data as a seed. In this study, we aim to leverage generative AI (Artificial Intelligence) tools like ChatGPT [4] to generate a dataset that closely resembles authentic clinical narratives and demonstrate that data generation is a viable alternative to accessing data that, under other circumstances, is unavailable. Moreover, this approach aims to define methods to expand the volume and diversity of unstructured datasets. This is done while keeping sensitive information private and confidential.

AI-powered text generation in the medical domain has already been applied to the task of creating support medical documentation [5, 6], considering that it is a highly difficult task in daily medical business that requires good skills by the involved health personnel in writing precise medical descriptions with appropriate technical depth. But besides the exploration of these proof of concepts in this field, research about other use cases is still ongoing. Our use case focuses on the generation of AI narratives, which is still a relatively novel research field [7].

By doing this, we aim to solve two problems:

  1. i)

    access to medical narratives, and.

  2. ii)

    lack of available medical narratives in languages other than English.

First, there are several datasets containing annotated narratives in biomedicine, like article abstracts, for named entity recognition - for English but also for other languages like Spanish or German [8]. But clinical data containing doctor’s letters and vignettes is usually not available due to the high standards of data protection of patient medical records. Even access to publicly available repositories, like MIMIC III and MIMIC IV [9], requires the completion of a data protection course as well as a review process where academic reference is requiredFootnote 3. This is because the data was obtained under very specific conditions in an intensive care station setting and required careful handling. And while structured data, like bio-physiological measurements, can be easily synthesized for further model exploration and training (see for instance SMOTE as a data synthetization method [10]), non-structured data remains difficult to synthesize. On the other hand, such unstructured data is extremely relevant since the narrative context provides essential information that influences the correct identification of the patient’s health status. Unstructured data is essential for mapping the patient’s condition, like a disease, to machine-readable descriptions of this condition’s properties and their potential interrelation (for instance the disease ontologyFootnote 4) or classification systems, like the International Classification of Diseases (ICD, for diagnoses), or the Procedure Coding System (PCS in English, OPS in German, for medical procedures).

Second, an additional problem is the access to annotated biomedical data for languages other than English. State of the art biomedical named entity recognizers and linkers are able to detect and link relevant concepts in medical narratives written in English with high accuracy [11]. However, there are hardly any such models for languages like German, due to the lack of corpora annotated with biomedical entities written in German. A possible starting point are large language models pre-trained on German biomedical data like BioGottBERT [12]. However, these still require fine-tuning on annotated resources when applying them to named entity recognition and linking problems.

Considering these problems, generative AI represents a promising way to generate synthetic data for research purposes. For instance, Ali et al. have shown that ChatGPT produces clinically correct, fluent English text of average complexity, in the form of patient clinic letters describing cases of skin cancer [6].

The way generative AI text tools have been deployed resembles a large-scale, global Turing test, making it difficult to differentiate between human-generated content and automated content. On the other hand, this rollout has exposed several ethical problems, like the unauthorized use of text and image material, representing a violation of intellectual property rules, the generation of incoherent results (so called hallucinations), or inherent social problems like the exploitation of both labor, working in underpaid positions to label data and customers, providing for free and without their explicit knowledge annotated private data for model training [13].

In the scientific field, these methods are problematic when they are used as a method to automatically generate misleading content and disinformation [14], when it is used as a method to replace the work of human authors, or when a user’s prompt from generative AI may produce text that includes content that the user does not understand, but which the user may be tempted to incorporate into their writing (“The AI Writing on the Wall” 2023).

In the current application we are aiming to look for an ethical use of such tools as a method to synthetize data to protect patient’s privacy. The application of this method fulfills strong ethical issues regarding the exposition of patient’s data to models. It aims to improve data interoperability and data exchange across institutions. But any model trained on synthetic data must be explicitly declared and can only be used for research purposes and preliminary model exploration.

In this paper we propose a method based on ChatGPT to generate medical narratives in German describing the status of imaginary patients based on a distribution of diseases and comorbidities in a real patient cohort. Thus, any other patient information, recorded in the health record, is required as input data. The generated medical narratives are then automatically annotated, using information about the diseases and comorbidities employed to generate the narrative. This is done to train tools for Named Entity Recognition (NER). The main advantage of using ChatGPT in this scenario is that several tests have proved the quality of the medical texts that it generates [6]. These capabilities have huge potential in the generation of synthetic data for research purposes without compromising patient privacy by utilizing real patient data. Naturally, from an epistemic point of view this is problematic, given that a model is learning from a model; this epistemic problem could be addressed by taking some precautions. We argue that the use of generative AI tools in medicine is justified if, first, the data generation is performed using real anonymized data as reference, and second, if the models trained using synthetic data are validated against real anonymized data.

The article is structured as follows: in Sect. 2 we provide a few comments regarding the ethics of generative AI; in Sect. 3 we describe the implemented method, in Sect. 4 we describe the different data sources required in this project and in Sect. 5 we present the obtained results. In Sect. 6 we discuss the obtained results and their limitations. In Sect. 7, we summarize this investigation and discuss the next steps.

Method

The goal of this investigation is two-fold: (i) to integrate a generative AI tool into a pipeline for synthetizing biomedical narratives in German and (ii) to fine-tune existing transformer-based language model for recognizing biomedical entities in medical narratives written in German. This pipeline is shown in Fig. 1.

Fig. 1
figure 1

Pipeline for the integration of synthetic data for the fine-tuning of language models

As a proof of concept, we are training language models for the German language:

  • First it is a challenging problem: German is a language with a relatively complex grammatic (including 4 different forms of word inflections), sentence as well as word composition. This language is spoken as an official language in three countries with around 100 million speakers in Europe and represents an ideal case for research on the capabilities and limitations of language models.

  • Second, in German, there is no data available, but to train language models one would need this data.

As a use case, we chose oncology, because of the complexity of treatment patterns and distribution of diseases and comorbidities.

To test this pipeline, we defined two different use cases by sampling synthetic clinical notes and narratives (physician’s letters and case vignettes) using:

  1. a.

    disease descriptions obtained from the standardized ICD (International Statistical Classification of Diseases and Related Health Problems) Thesaurus (German language);

  2. b.

    oncologic data obtained from a hospital in a metropolitan area, containing the distribution diseases and co-diagnoses (German language).

Since the oncological data are only ICDs (classified diseases), there is no additional patient information in the input data, setting a high data protection standard.

AI Generative method

Currently there are several available models for automatic text generation, like Jenni, ChatGPT or AnywordFootnote 5. In this investigation we implemented an AI pipeline in MS-Azure using ChatGPT as an AI-Generative tool. OpenAI’s ChatGPT chatbot makes uses a Large Language Model (LLM) named GPT-3 to generate human-like text. By learning how input sequences are related to output sequences, the generative pre-trained transformer (GPT) language model can process large amounts of text data and produce coherent text outputs [6, 15]. In this investigation we used the Canonical API for ChatGPTFootnote 6 and no regular expression has been implementedFootnote 7.

Generation of synthetic medical narratives

We aim at creating prompts for ChatGPT that can be used to generate synthetic medical narratives similar to real clinical narratives. Thus, medical experience and knowledge is needed to formulate these prompts.

This paper proposes the systematic investigation of two kinds of prompts: one including one disease (PoC1) and a second prompt containing a disease and a co-disease (PoC2). These prompts are generated using the encoded disease distribution in the hospital (with ICD-10-GM disease codes). In this data the patient number has been replaced by a random number, and no other patient data has been considered. Therefore, this input data is fully anonymized.

PoC1

A first dataset is generated by extracting terms from the ICD-10-GM Thesaurus and using them to create a prompt for ChatGPT. In this case a sample prompt looks like this.

Write two sentences in German in the style of an EHR medical record written by a doctor, without mentioning the patient's personal data. The sentences should mention only [Disease] extracted from the ICD-10-GM. P1

The prompt uses a simple approach that takes only one disease into account and does not ask for any personal information (including gender and age). The German Thesaurus has been extracted from a public database provided by the German Federal Institute for Drugs and Medical Devices (Bundes Institut für Arzneimittel und Medizinprodukte Footnote 8). The prompt was written in English.

PoC2

For the generation of the second dataset, we rely on a set of real diagnostic codes from a cohort of patients at the hospital. Synthetic patient populations are thereby given a realistic structure.

Figure 2 illustrates the co-occurrence of diseases for patients with gastrointestinal (GI) and hematological (HEM) cancers (the real network of comorbidities would be much larger; in this visualization we selected an adjacency matrix with only 100 elements). From this co-occurrence the main diseases (cancer diseases) are linked to associated comorbidities. In the implemented prompts we selected highly correlated diseases with large co-occurrences.

Fig. 2
figure 2

A graph representing the co-occurrence of diseases derived from a database of patients with GI (gastrointestinal) cancer. In order to provide a legible figure, this graph, which contains only 100 correlations, is an extract from the whole disease correlation graph. Observe that in this extract some diseases, like I10.00 (Hypertonia), are often correlated with other diseases (central node). Since we aim to synthesize a database for cancer patients, we restrict the search to cancer diseases (main diseases, blue nodes), for instance C83.3 (diffuse large B-Cell lymphoma) or C34.1 (malignant neoplasm of over lobe)

The distribution of co-occurrences is used as a quantitative input for the prompts. The formulation of a prompt is based on the definition of a main disease combined with a plausible comorbidity, where the joint occurrence of the two diseases was observed in real patient data.

We formulated prompts in German as a case vignetteFootnote 9 [16] or as a medical letter using the following template (in German (G) and corresponding translation in English (E) as reference):

G Erstellen Sie eine Fallvignette über einen RA(f) Patienten mit K[i](f) als Hauptdiagnose und K[j(f)] als Nebendiagnose.P2

E Create a case vignette about a RA(f) and patient with K[i](f) as the main diagnosis and K[j](f) as secondary diagnosis.

In these expressions RA is the random age, and K is the corresponding description of a disease (or diagnosis). In the template above RA[f] is a value between 25 and 80, randomly selected for each prompt f considering the age distribution of the patient population, where the probability of an age between 46 and 50 years is 0.3, while the probability of an age between 55 and 86 years is 0.7 (see Appendix 2). We could also include the patient’s gender, considering that the about 50% of the population are feminine, and 50% is masculine in the following way:

G Erstellen Sie eine Fallvignette über einen RA(f) und RG(f) Patienten mit K[i](f) als Hauptdiagnose und K[j(f)] als Nebendiagnose. P2+G

E Create a case vignette about a RA(f) and RG(f) patient with K[i](f) as the main diagnosis and K[j](f) as secondary diagnosis.

Where RG(f) is the gender. This can however be problematic considering disease co-distribution, since some diseases can be contextually related to gender (for example, the probability of breast cancer is higher in women than men). This means that we can produce the wrong correlation between gender and disease by introducing a random generated gender (in a similar way to age).

To keep the prompt generation as simple as possible, we considered that the probability that the synthetized population has an age between 46 and 50 years is 0.3, while the probability that the synthetic patient has an age between 55 and 86 years is 0.7. Besides the ICD as well as age distribution in the patient population, no other patient data is used in this data synthetization. As a result, other contextual information, like the socioeconomic status of the patients, cannot be used to synthesize these narratives.

The diseases K[i](f) and K[j](f) are chosen among the main diseases (indexed by i) and comorbidities (indexed by j) obtained from the catalog of disease co-occurrences extracted from the hospital data (Fig. 2). We have decided to switch from using English prompts to creating the prompt in German for Q2. This is to make sure that the naming of the diseases is correctly understood by ChatGPT (more in the appendix).

An example of P2 looks like:

Erstellen Sie eine Fallvignette über einen 45 Patienten mit Diffuses großzelliges B-Zell-Lymphom als Hauptdiagnose und Hypertonie als Nebendiagnose.

While such prompts can be coupled to additional styles (for instance formulate a text with an electronic health record style (HR-style)), we consider that this simple style is enough to generate realistic samples. This represents a set of key concepts that can be stored and used in the pipeline. In general, this part concerns mainly the problem of defining guidelines for using ChatGPT to generate synthetic medical lettersFootnote 10.

Lastly, note that since the names of the diseases are extracted from the Thesaurus, there is no reference to expression, definition, or style in the hospital’s disease definition. As a result, it is not possible to reverse engineer or disclose any real narrative in the hospital, for instance using authorship analysis [17].

Entity annotation

The generated data is also used to create automatic annotations for training machine learning models. Two entities are annotated: main disease (PoC1 & PoC2) and co-disease (PoC2). We have therefore also implemented a method for automatic generation of annotations using the following algorithm:

  1. 1.

    Select K[i] (PoC1 & PoC2) and K[j] (PoC2).

  2. 2.

    Fill in the template using the selected diseases K[i] and K[j] as input parameters and write at the prompt following the rules defined in P1 or P2.

  3. 3.

    Feed the resulting prompt to ChatGPT and record the generated medical narrative.

  4. 4.

    Annotate the disease (for P1) mentioned in the generated text by exact matching with the ICD catalogue disease namesFootnote 11.

  5. 5.

    Annotate the diseases (for P2) mentioned in the generated text by exact matching with the ICD catalogue disease names.

The annotation has been performed by using the diagnoses K[i] included in the prompts and exact matching them in the synthetized textsFootnote 12. For the NER model, we use Inside–outside–beginning (IOB) tagging scheme [18] Footnote 13 to convert the annotated text sequence to the corresponding token level annotations, i.e., word tokens that are beginning of named entities are tagged with ‘B-’,tokens ‘inside’ (= continuations of) named entities are tagged with ‘I-’ and the non-entity tokens are tagged with “O”.Footnote 14

We observed that some labels get lost when synthesizing text in languages other than English. For instance, in German word inflections can introduce changes in the textual form of the generated entities, impairing the exact matching to the labels in the ICD catalogue. Therefore, we create a regular expression using the ICD disease name in the catalogue by removing the last two characters of each word in the name and allowing up to three arbitrary letters instead. For example, the inflections in the disease mention “Hauptsymptom des diffusen großzelligen B-Zell-Lymphoms” will be annotated using the regular expression for the ICD disease name “diffuses großzelliges B-Zell-Lymphom”. Furthermore, the longest possible span is annotated. Thus, the disease mention “großzell B-Zell-Lymphom” will be annotated with the regular expression for the disease name “diffuses großzelliges B-Zell-Lymphom” (See Table 1).

Table 1 Example of the generated text using generative AI for a patient with a main disease K [1] (a lymphoma) and a co-disease K [2] (Ataxia). The original German text is presented in the left side of the table. The corresponding translation to English is presented on the right

The implemented workflow is shown in Fig. 3. More details about the generated data will be presented in Sect. 4.3.

Fig. 3
figure 3

Architecture of AI-Generative tool

NER model for automatic disease recognition

The synthetic data is then used to train a machine learning model to automatically recognize diagnoses within a text. To this end we have fine-tunned three different language models:

  • BioBERT, a pre-trained biomedical language representation model for biomedical text miningFootnote 15 [11],

  • BERT-Base GermanFootnote 16,

  • SciBERTFootnote 17 (Beltagy, Lo, and Cohan 2019) [12].

Contextual language models such as BERT [19] have achieved state-of-the-art results for many natural language processing (NLP) tasks. This was made possible due to the underlying Transformer architecture [15], which uses a self-attention mechanism to assign different importance scores to different parts of the input sequence. Since BERT was released, a plethora of BERT-inspired models have been introduced to achieve optimal performance on different NLP benchmarks. It has been noticed that BERT models trained on a particular domain corpus such as biomedical outperform models trained on a general-purpose corpus [11]. For our experiments, we use two biomedical domain-specific models, namely, SciBERT and BioBERT [20]. Additionally, one general-purpose model GermanBERT is also fine-tuned for our datasets.

All three models are based on BERT models. The main difference is the corpus used for pretraining: while SciBERT is trained on English-written biomedical research papers retrieved from Semantic Scholar, BERT-Base GermanFootnote 18 and BioBERT are both trained on German-written data, i.e., BioBERT on biomedical German written data Footnote 19 [12, 20].

It is relevant to point out that Lenzen et al. have demonstrated that there is only a marginal difference between GottBERT and BioBERT [20], implying that a model trained on English data can also have a good performance on German data.

Model validation

For evaluation of NER models, each synthetic dataset is split into a train and a validation set. The details about the data splits are mentioned in Tables 1 and 2 in Sect. 4.3. The resulting data is used to train and validate the models using the transformer library [21]. Each model is trained for 5 epochs. It is observed that the models converge in the initial epochs and the best-performing model on the validation dataset is chosen at the end of the training process. Additionally, we test the best-performing model on custom annotated snippets from real doctor letters, i.e., brief text passages extracted from doctor’s letters.

Table 2 Distribution by style of the extracted narratives

We report precision, recall and F1-score based on the exact matches of the predicted entity mentions. In addition to exact matches, we report the results on two metrics (Partial, Type) which include partial matches also (Chinchor and Sundheim 1993). “Partial” focuses on the surface text overlap, whereas the “Type” focuses on the entity tag overlap.

The fine-tuned models were deployed using Plotly DashFootnote 20. The model deployment was performed on MS Azure.

Data

Patient data for narrative synthetization

Here are the steps we take to extract features from patient data for data synthetization:

  1. 1.

    Data anonymization:

Information extracted from medical documents should be kept confidential. To preserve confidentiality, we implemented robust data anonymization techniques. By utilizing regular expressions and conducting meticulous manual checks, we generated anonymized examples, meticulously validating each one to eliminate any potential leakage of patient information. This rigorous approach safeguards sensitive data and upholds ethical standards in our research.

  1. 2.

    ICD code extraction:

To avoid including patient information and doctor’s letters, we extracted only ICD codes from medical documents. Our approach ensures sensitive details are redacted while diagnostic codes are preserved. As a result, we can analyze label relationships without compromising patient privacy.

  1. 3.

    Exploratory data analysis:

Exploratory data analysis is crucial for understanding ICD code distribution across medical documents. Through this analysis, we identify potential data imbalances that can impact machine learning models’ performance. ICD code distribution should be reflected in our synthetic dataset.

  1. 4.

    Cooccurrence frequency calculation:

Calculating the co-occurrence frequency among ICD codes within the same document aids in establishing relationships between labels. Analyzing the frequency of ICD code pairs appearing together helps us understand the underlying relationships between diagnoses.

Patient data for model validation

We extracted 46 snippets with a maximum of 500 tokens from doctor letters that have been fully anonymized, i.e., the patient’s name, names of family members, names of health institutions and hospitals, addresses and phone numbers, were completely removed. Patient movement dates, like admission and discharge dates were altered accordingly.

Because we evaluate only snippets, and not full reports, we do not store any meta-information about the real patient. These snippets are punctual extracts from the full narratives and are mostly extracted from the medical examination or oncological course (see Table 2).

It is important to note that the extracted narratives are rather technical and objective, containing information about practical examination findings and therapy course, which are narratives that do not contain subjective content (such as symptoms).

We annotated only the diseases in the 46 snippets using doccanoFootnote 21, which is an open-source program to annotate data for machine learning.

Synthetic data

For PoC1, we constrained the generated text to ICD diagnoses in oncology. The aim is then to use the fine-tuned model for tasks like automatic disease recognition and ICD mapping. P1’s response is one sample from our dataset. In total, 6255 synthetic narratives were created.

For the generation of the first data sample (PoC1, based on P1) ChatGPT rephrased the ICD wording instead of mentioning the ICD-Code directly which makes it difficult to label automatically. ChatGPT did not preserve the exact order of the words in ICD codes with labels of less than four words. For instance, the following is the result when ICD German name is “Rickettsienpocken durch Rickettsia akari” (this disease is not very common. However, we selected this condition since it is challenging from an idiomatic perspective):

G Der Patient präsentiert Symptome von Rickettsienpocken aufgrund einer Infektion mit Rickettsia akari, einschließlich Fieber, Hautausschlag und stark juckenden Flecken. Die Diagnose wurde durch körperliche Untersuchung und Bluttests gestellt. Hauptsymptome: Fieber, Hautausschlag, juckende Flecken.

E “The patient presents with symptoms of rickettsial smallpox due to infection with Rickettsia akari, including fever, rash, and severely itchy patches. The diagnosis was made by physical examination and blood tests. Main symptoms: fever, rash, itchy spots”

In this example we observe that the corresponding ICD (A79.1 Rickettsienpocken durch Rickettsia akari) is not directly mentioned in the text and is therefore difficult to annotate. In the prompt formulation we implemented formulations like “write the disease description as it is”, however the results do not differ significantly from the results obtained using Q1 and Q2.

Thus, we label the text passages that correspond to specific ICDs by ignoring the last 4 characters of every word in the ICD name. We do this using regular expression. Observe that the ICD-10-GM catalogue (provided by the “BARMER Institut für Gesundheitssystemforschung”Footnote 22) contains 16658 diseases, and 10369 diseases with more than four words. We have restricted the number of samples in this proof of concept to 6255 diseases for PoC1. These are disease descriptions of less than four words. 3236 entities’ descriptions were directly mentioned inside the synthetic text and are therefore annotated first. The annotated samples increase to 4379 when inflections are also handled as discussed in Sect. 3.3 (see Table 3). Observe that 8.65% of the synthetic data is used for a first model validation.

Table 3 Parameters of synthetic data, use case 2, PoC1

For the generation of the second data sample (PoC2, Q2) we are using the data reported in Figs. 2 and 4 in order to provide a realistic statistical distribution for the implemented prompts. For the entity annotation, we use direct matching to the ICD Thesaurus. Also, similarly to PoC1, we try to resolve inflection in the text. Our direct matching and inflection handling yielded 2358 annotated samples (less than 3-word disease descriptions). The data statistics of PoC2 are shown in Table 4.

Fig. 4
figure 4

Distribution of comorbidities for ICD code C34.3, malignant neoplasm of the lower lobe, bronchus

Table 4 Parameters of synthetic data, use case 2, PoC2a 

In Table 2, it is evident that the majority of patient narratives are drawn from the physical examination and therapy course. Therefore, they do not contain any subjective information about the patient. This is why we also implemented a slightly modified version of the prompt asking for the generation of a physical examination (instead of a vignette).

Since the style and content of the physical examination does not significantly differ from the oncological course, course assessment or therapy, we opted to use only “examination report” as single style:

G Erstellen Sie eine Sie einen körperlichen Untersuchungsbefund über einen RA(f) Patienten mit K[i](f) als Hauptdiagnose und K[j(f)] als Nebendiagnose. P2-1

E Create a physical examination report about a RA(f) and patient with K[i](f) as the main diagnosis and K[j](f) as secondary diagnosis.

Observe that the prompt is similar to P2; but we have changed the word “fall vignette” by “körperlichen Untersuchungsbefund” (physical examination). With P2 and P2-1 we can then test the effect of the style in the quality of the generated narratives.

In ChatGPT API there are three available temperature levels \(\:T\): from 0 (deterministic) to 2 (random). For \(\:T=0\) the generated narratives, they were fully deterministic and without remarkable variability, and therefore were not useful for model training. On the other hand, for \(\:T=2\) generated narratives were extremely random and thus meaningless. Thus, we performed all our experiments for \(\:T=1\), which is the default parameter.

To assess the quality of the synthetic narratives, we evaluate the readability of the text [22]. This evaluation can be performed based on its content (the complexity of its vocabulary and syntax). To this end we implemented the Wiener readability function, which for German is defined asFootnote 23 [23, 24]:

$$\:{WSTF}_2=0.2007\bullet\:MS+0.1682\bullet\:SL+0.1373\bullet\:IW-2.779$$
(1)

Where \(\:MS\) is the percentage of words with more than three syllables, \(\:SL\) is the average sentence length, and \(\:IW\) is the percentage of words with more than six letters-characters. We obtained the following Wiener \(\:{WSTF}_{2}\) scores (see Table 5).

Table 5 Wiener scores to evaluate the readability of the synthetic narratives respect the rea narratives

The range of this score is from 4 to 15. As the score increases, the narrative becomes more difficult to read. It is in accordance with the results, which are technical descriptions of the patient’s condition. Additionally, we observe that the P2 narratives are more readable than the original narratives, while the P1 narratives are more complex.

Results

For the first dataset generated using Q1, we performed the request considering that the narrative can be generated using the style recorded in an electronic health record (EHR).

For the validation of real clinical narratives, we used snippets extracted from real patient data. Since exact matching is too simplistic and ignores partial matches, we report our validation results using “Exact”, “Partial” and “Type” matching, which are defined as follows (Table 6) [25]:

Table 6 Different kinds of validation methods implemented in this investigation

In this framework, entity type refers to the category in which an entity falls. The focus of this study is solely on the type “disease”Footnote 24.

For PoC1 the validation results are shown in Table 7.

Table 7 Validation values of PoC1 (P1) on synthetic data. In this validation we tested the different models using the different validation techniques

Since we are validating a fine-tuned model, and not a general-purpose model, this is a measure of how often the model correctly recognizes this single mention. The results obtained by all models do not differ significantly. However, the difference in scores between the exact and partial metrics (Partial/Entity Type) indicates that a considerable number of entity mentions are annotated partially by the models. Moreover, language models that are not specifically designed for German, like dmis-lab/biobert-v1.1, perform similarly to German-BERT. Similarly, we validated the model using the data generated by PoC2 (P2) and report the results in Table 8. It can be observed that there is no significant difference between all models’ performance.

Table 8 Validation values, use case 2, PoC2 on synthetic data

Simultaneously we also observe that BERT-Base German can detect more entities than the other models, and its recall score is therefore higher (See Fig. 5 for the comparison of model performance only for Entity Type validation). It is interesting to note that PoC1 outperforms PoC2, which may indicate that identifying two entities is more difficult. Additionally, we observe that BERT-Based German and SciBERT are highly performant models in both cases.

Fig. 5
figure 5

Comparison of the entity type validation on synthetic data (F1, is the F1 score, R is recall and P is precision) for PoC1 (A) and PoC2 (B)

However, for this validation we have the following remarks:

  • Validation can be performed with text having maximal 512 tokens(sub-words) (which is the limit for BERT).

  • Since recognized entities are matched with any ICD descriptions (not only the one/ones that was/were annotated), but it also remains a high probability of obtaining false positives with the trained model.

  • Due to the fact that it has not been trained with negated diagnoses, the current model is limited when ignoring such negated diagnoses.

Since the validation sets for POC 1 (P1) and POC 2 (P2) are different, the results of both models cannot really be compared.

Thus, after validating the different models using exclusively synthetic data, we validate again the finetuned models on real data. Results of P1-Finetuned models are reported in Table 9.

Table 9 Validation on real narratives with fine tunned models using narratives generated with P1

Table 10 shows the results obtained for the P2-Finetuned model. The results show that the model pretrained on English data, namely, SciBERT outperformed the other models trained on German language data (in terms of f1 and recall score for both exact and partial matches). BioGottBert performed best in precision. Further, it can be observed that we obtained a higher precision score, especially when the data were validated using entity-type validation, with a precision of 0.85 for SCAI-BIO/bio-gottbert. Despite the relatively high precision of the model trained on synthetic narratives, it has difficulty recognizing bio-medical entities from real data (low recall values across all models). This means, in this way we have trained cautious models.

Table 10 Validation on real narratives with fine tunned models using narratives generated P2

In general, we observe that P2-Fine-tuned models outperform P1-Fine-tuned models for tests with real data. There are two main changes between P1 and P2: While in P1 we explicitly ask to reproduce narrative using a health record style, in P2 we explicitly constrain the narrative style plus additional meta data like the patient's age. This indicates that, first, recognizing two entities performs better than identifying a single entity in terms of performance. In addition, the specification of the narrative’s style in relation to target data is fundamental to obtain reliable synthetic narratives for training.

If we look for the most suitable models across the P1 and P2 datasets, BioGottBERT achieves the highest precision, with 0.55 for P1 and 0.85 for P2. With BioGottBERT we obtained the largest f1 score for P1, while with SciBERT we obtained the largest f1 score for P2 (0.152, see Fig. 6). Although the metric scores are relatively low, the current results indicate the utility of our proposed method.

Fig. 6
figure 6

Comparison of the entity – type validation on real data (F1, is the F1 score, R is recall and P is precision), for P1 (A) and P2 (B)

We repeated the previous validation again with data generated using the P2-1 prompt, i.e. considering a style much closer to the style of the reference narratives (validation results reported in Table 11).

Table 11 Validation on real narratives with fine-tuned models using narratives generated through P2-1

As we discussed in Sect. 4.3, gender can be problematic, since it is closely related to certain diseases. Despite this, we performed a test to evaluate the effect of gender on data synthetization using the P2-1 prompt (P2-1 + G, see Table 12).

Table 12 Validation on real narratives with fine-tuned models using narratives generated through P2-1 and considering gender (P2-1 + G)

We observe that the overall validation results, especially the F1 values, improved using P2-1 prompt. In light of this, it is essential that the prompt asks the right questions in order to generate the right narrative style. It is therefore necessary to have some knowledge of the target narrative’s style when formulating the prompt. It should be noted, however, that a general review of the results confirms that it can be challenging to consider gender without having a detailed understanding of the patient’s etiology. When prompts include the patient’s gender, this may reduce the quality of synthesized narratives.

Despite this, recall as well as precision improved for all the models, except SCAI-BIO/ BioGottBERT-base, which deteriorated from 0.85 to 0.78 respecting the P2 prompt. It is also interesting to note that bert-base-German-cased improved, which may indicate the effect of storytelling style on the model’s ability to recognize items, especially for models trained in a specific language. Finally, it is interesting to observe that the allenai/scibert_scivocab_uncased, which is a model that has not been initially trained on German language, in general outperforms the other models, perhaps because this model is able to better capture the technical and scientific character of the narratives (as has been reported in Table 5; see Fig. 7) [26].

Fig. 7
figure 7

Final validation results for Entity Type for narratives generated using P2-1. In the test A prompts consider only age (P2-1). In the test B prompts consider both age and patient’s gender in P2-1 (P2-1+G)

Discussion

In this proof of concept, we implemented a method using a generative AI tool (ChatGPT) with real population data for disease distribution to synthesize medical narratives. These narratives have been annotated and used for model fine tuning. Our main strategy is to demonstrate the quality of text synthetization by comparing the performance of language models trained on this synthetic data.

As a use case we have generated narratives describing cancer patients. We have selected this disease due to its inherent complexity in the description not only of the disease but also its anatomical description and estimation of its severity.

Despite we generated several narratives with an acceptable structure, and that the validation with real narratives delivered an acceptable precision (0.85 for models trained with Q2 prompts), we have observed that this method still has several serious limitations:

  1. a.

    Simplistic data: data synthetization is of course not the right solution since it tends to generate more or less regular texts and does not contain all the nuances of real narratives, considering that we have adjusted the text generation temperature.

  2. b.

    Missing details: real medical narratives are composed using more precise descriptions, for instance considering biomarkers and symptoms levels. Furthermore, some of the synthetized narratives have a very naïve style, which does not reflect the level of professional competence required in the medical field.

  3. c.

    Missing linguistic diversity: also, to mention is the heterogeneity in writing styles and layout formats in real medical narratives, which cannot be fully reproduced with AI-Generation.

The previous limitations imply that model training solely based on synthetic narratives can be performed for initial model fine tuning. Otherwise, the application of this method, which intrinsically cannot generate all the context found in real narratives, has the risk of leading to model reinforcement.

Also, the nuances of the language influence the soundness and quality of the synthetized narratives. For instance, in the case of German language, the disease can appear at any position of the paragraph which is supposed to make the model more robust and generalizable.

Synthetic data is a single snapshot that can eventually evolve as additional data, for instance the disease distribution from the patient population, is added to the synthetic patient population. The influence of disease distribution also has an effect on interregional differences. Thus, by introducing specific disease and co-disease distributions in in the P2 and P2-1 prompts (see e.g., Figs. 2 and 4) we can generate narratives that attach to specific regional differences in the distribution of, for instance, the diseases, which can reflect not only specific contextual differences, but also different ways this disease is treated.

The prompt requires narratives to be defined in accordance with reference narratives. A rather general prompt (for example, generating a vignette) resulted in lower validation results than when we orient ourselves based on the style of the reference narratives (in this study, therapy courses). Thus, we think that the proposed methodology is well suited to reproduce technical narratives that allow the inclusion of stored data (like ICDs) in the prompt to generate new narratives.

This implies that an important aspect is the specific character of the narrative’s style: while in this study we analyzed and validated mainly technical narratives, we anticipate that validation results will deteriorate as we attempt to synthesize narratives that contain more subjective information, such as symptoms. This is because of generative AI’s tendency to produce hallucinations, i.e. data that seems plausible but is contextually wrong.

As mentioned in Appendix 2, we suggest a possible solution where knowledge databases, such as UMLSFootnote 25 or WikiData [27], can provide associated symptoms for the diseases selected. Nevertheless, this presents a challenging problem, since a given disease may present several plausible symptoms. The selection of these symptoms is dependent on the information provided by the patient or the correlation with other conditions. Thus, the investigation of the correct inclusion of symptoms in synthetic narratives, as well as the consideration of high data protection criteria as outlined in this article, requires additional research.

Additionally, when models are validated against real narratives, recall is generally lower than precision. As a result, we trained cautious models, i.e. models that hesitate to identify a particular individual. In spite of this, precision was high once the entity had been identified. In P2 and P2-1, the co-disease correlation can be used to include diseases other than those specified in the prompt. Despite the fact that the additional diseases are not necessarily hallucinations (and are generated from the data that were used for training these language models), they can blur the disease detection capabilities of the language model.

Finally, the way data annotation is performed has a natural impact on model validation. We have for instance observed that a method to annotate synthetic-generated data with more than 3 tokens is required. A further disadvantage of using ICD descriptions for disease annotation is that there are no headers for each disease description. This implies that entities that are not real diseases, like a procedure, get annotated as a disease, reducing the accuracy of the trained model.

In such a case, it is reasonable to think about more sophisticated methods for recognizing regular expressions in the text, with various lengths, able to match plausible strings with the disease catalogue, as well as using, for instance, MeSh classification (instead of ICD classification) to more accurately identify the entities in the text.

Conclusions

Synthetization of medical narratives using generative AI tools is a way to mask and provide context to data sampled from patient cohorts to protect real patient data, augment available data, and guarantee data interoperability, not only between institutions but also across regions (with different languages). By applying slight context to statistical data collected from a patient population, the goal is to reproduce a synthetic patient population.

As part of our use case, we used ChatGPT to develop generative AI tools for identifying individual diagnoses. These tools generate medical narratives based on the disease distribution among patient cohorts, synthesize data, and explore and train models without revealing patient data. We would coin this method, also used to annotate data for supervised machine learning, as an “aluminum standard”, in relation to the “silver standard” (automatically annotated data, whose annotations have not been proofread by a human) and “gold standard” (for human annotated data).

We aim to provide a method for generating synthetic and annotated unstructured data by utilizing statistical data from the patient population. It is done without disclosing any personal information about individual patients. Moreover, the way in which the diseases are described in the prompt is based on standard definitions. As a result, the specific style of writing about the disease in the hospital is also protected. By doing so, the synthesized narratives are safe and cannot be reverse engineered to identify individual patients.

In this way, the problem of training models for languages other than English may be alleviated. This also provides an opportunity for data to be interoperable between different regions: as co-disease distribution can be assessed in any health institution, this data can assist in creating safe databases of high-quality synthetic patients for different regions, which represents an alternative to initiatives such as the European Health Dataspace [28] to access patient information. Finally, the trained language model should be able to assist the writing of discharge letter for any combination of main and co-disease, reducing the time spent by healthcare professionals writing these letters.

Data availability

Data is provided within the manuscript or supplementary information files.

Notes

  1. https://huggingface.co/blog/2023-in-llms.

  2. By End 2024 other models have been published, like GPT-4.0, Claude 3.5, Grok-1, Mistral 7B, PaLM2. For a comprehensive list see for instance https://artificialanalysis.ai/leaderboards/models.

  3. https://www.bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-021-01395-zhttps://mimic.mit.edu/docs/gettingstarted/.

  4. https://www.disease-ontology.org/.

  5. https://techletters.medium.com/16-ai-text-generation-tools-e9f4989dc0c6.

  6. https://openai.com/blog/chatgpt.

  7. For instance, using regex. See e.g. https://medium.com/@lee_vaughan/let-chatgpt-write-your-regex-99d1751cb88.

  8. Source: https://www.bfarm.de.

  9. Which is a narrative outlining the salient features of the patient’s case. It differs in style from a conventional letter in which two physicians communicate about a patient’s condition.

  10. https://www.thelancet.com/journals/landig/article/PIIS2589-7500(23)00048-1/fulltext.

  11. The regular expression can also be added. For instance, the regex for ICD disease name “Sonstige Salmonelleninfektionen” is ‘(?:Sonsti[a-z]{0,3}\\s? )(?:Salmonelleninfektion[a-z]{0,3}\\s? )’ and it matched the mention in the text “sonstige Salmonelleninfektion”.

  12. Annotation can also be perfomed using ChatGPT. However, there is no gold standard that could help validate data annotated using this model.

  13. https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging).

  14. https://towardsdatascience.com/easy-fine-tuning-of-transformers-for-named-entity-recognition-d72f2b5340e3.

  15. https://github.com/google-research/bert.

  16. Repository at Hugging Face: https://huggingface.co/dbmdz/bert-base-german-cased.

  17. https://huggingface.co/allenai/scibert_scivocab_uncased.

  18. German BERT was trained using 12GB of raw text data basing on German Wikipedia (6GB), the OpenLegalData dump (2.4GB) and news articles (3.6GB).

  19. BRONCO150, CLEF eHealth 2019 Task 1, GGPONC, and JSynCC.

  20. https://plotly.com/dash/.

  21. https://doccano.github.io/doccano/.

  22. https://www.bifg.de/daten-und-analysen/klassifikationen-icd-ops-drg-ebm-morbirsa.

  23. https://de.wikipedia.org/wiki/Lesbarkeitsindex#cite_note-7.

  24. Entity – Type validation is a validation technique that is closer to the way how a human would validate these results, since cognitive capabilities allow humans to look for plausible entities describing a diagnose as good as possible, without constraining itself into strict token matching.

  25. https://www.nlm.nih.gov/research/umls/index.html.

Abbreviations

ICD:

International Classification of Diseases

NER:

Named Entity Recognition

NEL:

Named Entity Linking

AI:

Artificial Intelligence

PoC:

Proof of Concept

References

  1. The AI writing on the wall. Nat Mach Intell. 2023;5(1):1. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s42256-023-00613-9.

  2. Workshop B, BLOOM, et al. A 176B-parameter open-access multilingual language model. arXiv. 2023;27. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2211.05100.

  3. Zhang S, et al. OPT: open pre-trained transformer language models. arXiv. 2022;21. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2205.01068.

  4. Stiennon N et al. Learning to summarize with human feedback, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2020:3008–3021. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html. Accessed 15 Jun 2023.

  5. Li Y, Li Z, Zhang K, Dan R, Zhang Y. ChatDoctor: a medical chat model fine-tuned on LLaMA model using medical domain knowledge. arXiv. 2023;01. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2303.14070.

  6. Ali SR, Dobbs TD, Hutchings HA, Whitaker IS. Using ChatGPT to write patient clinic letters. Lancet Digit Health. 2023;5(4):e179-81. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/S2589-7500(23)00048-1.

    Article  CAS  PubMed  Google Scholar 

  7. Y Lu, M Shen, H Wang, W Wei. Machine learning for synthetic data generation: a review. arXiv. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2302.04062.

  8. Campillos-Llanos L, Valverde-Mateos A, Capllonch-Carrión A, Moreno-Sandoval A. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine. BMC Med Inf Decis Mak. 2021;21(1):69. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-021-01395-z.

    Article  Google Scholar 

  9. Johnson AEW, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):1. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/sdata.2016.35.

    Article  CAS  Google Scholar 

  10. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14(1):106. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/1471-2105-14-106.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Lee J, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/bioinformatics/btz682.

    Article  CAS  PubMed  Google Scholar 

  12. Scheible R, Thomczyk F, Tippmann P, Jaravine V, Boeker M. GottBERT: a pure German language model. arXiv. 2020;03. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2012.02110.

  13. Crawford K. Atlas of AI: power, politics, and the planetary costs of artificial intelligence. New Haven London: Yale University Press; 2022.

    Google Scholar 

  14. Goldstein JA, Sastry G, Musser M, DiResta R, Gentzel M, Sedova K. Generative language models and automated influence operations: Emerging threats and potential mitigations. 2023;arXiv:arXiv:2301.04246. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2301.04246.

  15. Vaswani A, et al. Attention is all you need. arXiv. 2017;05. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1706.03762.

  16. Kathiresan J, Patro BK. Case vignette: a promising complement to clinical case presentations in teaching. Educ Health Abingdon Engl. 2013;26(1):21–4. https://doiorg.publicaciones.saludcastillayleon.es/10.4103/1357-6283.112796.

    Article  Google Scholar 

  17. Nini A. A theory of linguistic individuality for authorship analysis. Elem Forensic Linguist. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1017/9781108974851.

    Article  Google Scholar 

  18. Ramshaw LA, Marcus MP. Text chunking using transformation-based learning. arXiv. 1995. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.cmp-lg/9505040.

  19. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. 2019. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1810.04805.

  20. Dec Lentzen M, et al. Critical assessment of transformer-based AI models for German clinical notes. JAMIA Open. 2022;5(4):ooac087. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/jamiaopen/ooac087.

    Article  Google Scholar 

  21. Wolf T et al. Transformers: state-of-the-art natural language processing, in proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online: Association for Computational Linguistics. 2020:38–45. https://doiorg.publicaciones.saludcastillayleon.es/10.18653/v1/2020.emnlp-demos.6.

  22. Immel KA. Verständlichkeit messen? In: Immel KA, editor. Regionalnachrichten Im Hörfunk: Verständlich Schreiben für Radiohörer. Wiesbaden: Springer Fachmedien; 2014:17–9. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/978-3-658-04893-8_5.

  23. Wichter S, Busch A. Wissenstransfer – Erfolgskontrolle und Rückmeldungen Aus Der Praxis. 1st ed. Frankfurt am Main: Peter Lang GmbH, Internationaler Verlag der Wissenschaften; 2006.

    Google Scholar 

  24. Mustafa FE. DEExtract: a customizable context-based german vocabulary learning tool, in 2023 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT). 2023:65–69. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/3ICT60104.2023.10391702.

  25. Chinchor N, Sundheim B. MUC-5 evaluation metrics, in Fifth Message Understanding Conference (MUC-5): Proceedings of a Conference Held in Baltimore, Maryland, August 25–27, 1993, 1993. Available: https://aclanthology.org/M93-1007. Accessed 20 Jul 2023.

  26. Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text. arXiv. 2019. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1903.10676.

  27. Y Wang, C Dima, S Staab. [Novel] WikiMed-DE: constructing a silver-standard dataset for German biomedical entity linking using Wikipedia and Wikidata, presented at the The 4th Wikidata Workshop. 2023. Available: https://openreview.net/forum?id=5dQ7YDSYya. Accessed 23 Feb 2024.

  28. Schmitt T, Cosgrove S, Pajić V, Papadopoulos K, Gille F. What does it take to create a European Health Data Space? International commitments and national realities. Z Für Evidenz Fortbild Qual Im Gesundheitswesen. 2023;179:1–7. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.zefq.2023.03.011.

    Article  Google Scholar 

Download references

Acknowledgements

In preparing this article, the authors would like to express their gratitude to Corina Dima for her fruitful discussions and valuable feedback. For their valuable input and constant technical support regarding the preparation of synthetic narratives using real patient data, JGDO would like to thank Simone Neumaier, Antje Jensch, and Susanne Walz.

Funding

This project was supported by the Ministry for Economics, Labor and Tourism from Baden-Württemberg, Germany via grant agreement number BW1_1456 (AI4MedCode). The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

JGDO and FM provided the original fundamental idea and developed it together with FM and FW. JGDO and FM implemented the first prototype and refined the idea with all the coauthors. YW extracted the co-occurrence matrix and prepared the first data from the hospital. MK provided medical advice in the development of the concept and the prompts. FM implemented the pipeline. KK provided the medical prompts. JGDO wrote the first draft of the manuscript. All the authors contributed to the final version of this manuscript. The conception of this investigation and writing process have been fully performed by the authors and no AI-Generative tool has been employed in the creative process.

Corresponding author

Correspondence to Juan G. Diaz Ochoa.

Ethics declarations

Ethics approval and consent to participate

This study was approved by Ethics Committee at the Baden-Württemberg State Medical Association (Ethik-Kommission bei der Landesärztekammer Baden-Württemberg), with approval number F-2023-125. The study was performed in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects, and research regulations of the country. The informed consent requirement is waived due to the state regulations of the Landeskrankenhausgesetz Baden-Württemberg (LKHG), owing to the retrospective, aggregated and anonymized nature of this study and of the database.

Consent for publication

Not applicable.

Competing interests

Felix Weil, Faizan E Mustafa & Juan G. Diaz Ochoa are employed at QuiBiQ GmbH.

All other authors have no competing interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Faizan E. Mustafa current is affiliation NEC laboratories.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Diaz Ochoa, J.G., Mustafa, F.E., Weil, F. et al. The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data. BMC Med Inform Decis Mak 24, 409 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02825-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02825-4

Keywords