- Review
- Open access
- Published:
Transformer models in biomedicine
BMC Medical Informatics and Decision Making volume 24, Article number: 214 (2024)
Abstract
Deep neural networks (DNN) have fundamentally revolutionized the artificial intelligence (AI) field. The transformer model is a type of DNN that was originally used for the natural language processing tasks and has since gained more and more attention for processing various kinds of sequential data, including biological sequences and structured electronic health records. Along with this development, transformer-based models such as BioBERT, MedBERT, and MassGenie have been trained and deployed by researchers to answer various scientific questions originating in the biomedical domain. In this paper, we review the development and application of transformer models for analyzing various biomedical-related datasets such as biomedical textual data, protein sequences, medical structured-longitudinal data, and biomedical images as well as graphs. Also, we look at explainable AI strategies that help to comprehend the predictions of transformer-based models. Finally, we discuss the limitations and challenges of current models, and point out emerging novel research directions.
Introduction
The transformer [1] is a well-known deep neural network (DNN) model, which has revolutionized the artificial intelligence (AI) field. The architecture of the transformer builds the backbone of large language models (LLM), enabling them to harness the power of vast amounts of data to gain a more profound understanding of the underlying information. The architecture was initially developed for comprehending natural language, achieving this by analyzing every input sentence and capturing the context of each word through focusing on other words. Generic LLMs have brought significant advancements to various natural language processing (NLP) tasks ranging from machine translation over text generation to question answering. Most common examples of generic LLMs include Generative Pre-trained Transformer (GPT) [2], Bidirectional Encoder Representations from Transformers (BERT) [3], Large Language Model Meta AI (LLamA) [4, 5], and BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) [6].
The success of transformer-based models can be attributed to the self-attention mechanism, integrated encoder-decoder architecture, and scalable as well as modular structure. These characteristics allow it to learn effective representations of the underlying data, encode long-range dependencies, and process huge amounts of data in an efficient way. The basic building block of the transformer is the self-attention mechanism [1, 7]. This mechanism allows the model to learn complex sequence representations by incorporating or attending to the information throughout the other parts of the same sequence. Equally important is the encoder-decoder structure of the transformer while both comprise multiple layers and variants of the self-attention mechanism (Fig. 1). This type of architecture facilitates sequence-to-sequence learning; therefore, transformers were originally used to solve the machine translation problem (e.g., translation from English to German). The encoder-only architecture (for instance, utilized in BERT) can be used for classification and understanding tasks [3], whereas decoders-only (such as GPT, LLamA, and BLOOM) are used for generative tasks [8, 9]. Furthermore, the modular and scalable architecture of the transformer allows the stacking of encoder and decoder blocks on each other, which substantially increases the capacity of the model. By processing huge amounts of data with larger models, the performance of transformers has been significantly increased on various tasks [8].
The transformer architecture with its self-attention mechanism. Original transformer images by https://github.com/dvgodoy/dl-visuals / CC BY 4.0.)
Transformer-based models such as BERT or GPT apply a two step process in their approach to understand the data provided to them and handling various downstream tasks. In a pre-training phase, they leverage the abundant unlabeled data to learn a general representation through an embedding model of the underlying objects in a self-supervised manner. Unlabeled data, characterized by the absence of labels or tags, is widely available. For instance, the web contains a vast amount of textual content in the form of web pages, blogs, and forums that are not categorized or labeled. In contrast, labeled datasets contain data that have been annotated with specific labels or categories, such as the label “gene” in case of biomedical texts. Due to the manual annotation process, obtaining labeled data is often more challenging and time-consuming compared to unlabeled data. In the fine-tuning phase, the pre-trained general representation model is used to train a supervised use case-specific task model using the limited labeled data. Over time they have been applied successfully beyond language to process other modalities and brought significant advancements to speech processing, computer vision (CV), and many more areas.
Transformers are now in the spotlight of many areas of biomedical-related AI research. They have been proven instrumental in addressing diverse biomedical-related questions, facilitating the analysis of data modalities ranging from biomedical literature to complex imaging and genetic information. The pace of progress has reached a limit that is difficult to grasp and, therefore, requires a thorough survey of the field. To our knowledge, such a thorough review is missing so far. Our paper thus tries to fill a gap. In the following, we highlight and discuss transformer-based models in five application fields (Fig. 2): 1) biomedical natural language processing (including biomedical literature, clinical notes, and social media text), 2) biological sequences (including protein sequences), 3) structured-longitudinal electronic health records (EHR), 4) biomedical images, and 5) biomedical graphs. We also introduce some studies that have pursued learning on multiple modalities jointly. Finally, we discuss methods to make transformer-based predictions, and we conclude by providing a prospect for future research.
Application fields of transformers in biomedicine. Transformer image by https://github.com/dvgodoy/dl-visuals / CC BY 4.0
Table 1 provides a glossary of concepts of AI that are discussed in this work. The mathematical details on transformers will not be elaborated in this work, however, we refer the readers to [10, 11] for more details.
Biomedical natural language processing
Domain-specific transformers
Transformer-based models have made major strides in the biomedical NLP field, largely through adapting general language models for the biomedical domain by pre-training on huge publicly available biomedical corpora including documents from databases such as PubMed, PubMed Central (PMC), and Medical Information Mart for Intensive Care-III (MIMIC-III) [12, 13]. The majority of the studies introducing domain-specific language models often follow a familiar pattern, focusing on a specific transformer-based model architecture, initializing it with random weights or the weights of a general language model, pre-training the initialized model with domain-specific corpora as well as multiple objective tasks, and finally evaluating the models with different sizes on various biomedical downstream tasks.
For instance, BioBERT, which is initialized with the weights of the general English language model Bidirectional Encoder Representations from Transformers (BERT) [3], is a domain-specific model further pre-trained on PubMed abstracts and PMC full-text documents [14]. BioBERT was fine-tuned for various downstream biomedical NLP tasks and achieved new state-of-the-art performances for named entity recognition (NER), question answering (QA), and relation extraction (RE). More studies have introduced various variants of biomedical pre-trained models using different transformer-based architectures such as ELECTRA [15], RoBERTa [16] and GPTs (GPT1, GPT2, GPT3) [2, 9, 17]. Furthermore, BERT variants have been pre-trained on different types of biomedical corpora, see Table 2 for an overview.
Finally, different efforts have been made to develop language-specific transformer variants for biomedical texts in different regions of the world (Table 2). Some examples of these variants are Bio-GottBERT [18], CamemBERT-bio [19], KM-BERT [20] dedicated to languages like German, French, and Korean, respectively. Noteworthy, the main limitation of these efforts is often the limited availability of language-specific data.
Applications to document and topic classification
Document and topic classification are typical NLP downstream application tasks to which pre-trained transformer models have been applied in biomedicine: During the Coronavirus disease 2019 (COVID-19) pandemic, a new search engine LitCovid [34] was introduced by the United States National Library of Medicine (NLM), which provides an overview of the latest COVID-19 literature and allows users to filter the literature based on different categories such as case reports, mechanism, prevention, or diagnosis. The classification of the literature was done manually by the creators. However, in a later stage, various experiments with transformer-based models like BioBERT, PubMedBERT, and others showed high performance with an F1-score of approx. 94% to automatically assign categories to new literature [35]. CO-Search is another example of a COVID-19 search engine that used a Siamese-BERT-based document retrieval engine with a strong evaluation performance [36]. Nentidis et al. [37] report results of a semantic indexing challenge in which the best participating system utilized BERT and BERTMeSH [38] models.
Applications to Named Entity Recognition (NER) and linking (NEL)
After identifying relevant documents for a certain topic, one is often interested in finding hidden but valuable biomedical concepts inside them. NER and named entity linking (NEL) tasks are specifically designed to extract these relevant concepts and link them to biological databases. Such concepts appear in various areas of biomedicine, ranging from molecular biology (genes, proteins, microRNAs, biological functions, and cellular components) to the clinical domain (medication/drug, adverse drug reactions, diagnoses, and diseases). For instance, the sentence “Apolipoprotein E: Structural Insights and Links to Alzheimer Disease Pathogenesis” (PMID:33,176,118) contains the mention of the protein Apolipoprotein E, the disease Alzheimer disease, and the biological process Pathogenesis that can be linked to Uniprot term APOE_HUMAN (ID: P02649), disease ontology term Alzheimer’s disease (DOID:10,652), and National Cancer Institute Thesaurus (NCIT) term Pathogenesis (NCIT: C18264), respectively. In the case of NER, the majority of studies have considered this task as a sequence labeling task (Table 1), in which they used BERT-based models to predict labels for each token in a sequence. Rather than a sequence labeling task, NER has also been formulated as a machine reading comprehension task (Table 1), which allows easy integration of prior knowledge into models [39].
Most authors fine-tune domain-specific transformer models, such as BioBERT, to detect one specific entity, for example drugs or genes [14]. However, multi-task learning strategies have also been proposed to detect chemical or disease mentions with one single model [40]. Some work has also been performed to capture complex cases of entities (such as discontinuous or overlapping entities) by Khandelwal et al. [41], where they combined BERT and GloVe embeddings with a new label-tagging schema to train an NER model in a distant supervision setting showing a significant performance boost in detection of disorder entities obtained from clinical free-text notes. Zaratiana et al. [42] have studied an integration of a BERT-based model with graph neural networks to create a span representation that can reduce the number of overlapping spans of disease mentions. They reported an F1 performance of 87.43%, however, the best F1-score reported on the used dataset is at 90.48%Footnote 1. An overview of different studies employing transformers for NER and NEL is shown in Table 3.
Applications to relation extraction
Relation extraction, often performed after NER, is one of the main tasks in information extraction, which creates semantic links between two or more entities appearing in the text. These links, among others, can be loose (associates, interacts, correlates, etc.), quite specific (increase/decrease, binds, has participants, etc.), or even causal (directly increases, directly decreases, determined by) as defined by the relation ontology [49]. For instance, the sentence “STK38 is associated with PPARgamma” (PMID:34,670,478) contains a simple association relation between two proteins, whereas “Mitotic exit kinase Dbf2 directly phosphorylates chitin synthase Chs2” (PMID:27,086,703) describes a causal relation. The extracted relations from unstructured text are mostly used to construct biomedical knowledge graphs and expand existing ones with new knowledge [50, 51].
Transformer-based models have achieved remarkable success in extracting relations from textual content. Most of the studies have typically fine-tuned BERT-like models on subject-predicate-object relations of one dataset in a supervised manner. For instance, Zhu et al. [52] utilized BioBERT to extract drug-drug interactions from text with an overall F1 performance of 80.9% beating previous deep learning approaches. Other approaches for relation extraction involve multi-task learning, where multiple datasets are used for fine-tuning with the intuition that a model will learn a general representation of encoded relations that are of different types. To extract associations between drug-drug, chemical-protein, and medical diagnosis-treatment concepts, Moscato et al. [53] proposed a transformer-based architecture with multiple classification heads each designed to learn features for a specific type of relation. With their multi-task model, they could improve the performance by approx. 1.5% for chemical-protein and medical diagnosis-treatment associations. However, the model showed a decline of performance by 0.6% for drug-drug interactions in comparison to the single-task model. This showed that the effectiveness of multi-task learning can vary across different datasets. Solutions have also been proposed to simultaneously link entities and extract relations either by integrating multiple models in a pipeline manner [54, 55] or train a joint model responsible for extracting entities and relations at once [56,57,58]. Some have also experimented with datasets that were created either using distant supervision [59] or even without any supervision [60]. An overview about different studies employing transformers for relation extraction is shown in Table 4.
To get an even broader view of transformer-based models used in biomedical text mining - especially on tasks this work has not focused on - we refer to various surveys published by many researchers around the world [30, 61,62,63].
In summary, transformer-based models are well set in the biomedical NLP field. One main challenge is however the lack of clinical datasets due to privacy reasons, which hinders the development and evaluation of models specific to clinical settings. Another challenge is the limited diversity of datasets used in studies evaluating pre-trained models as they often focus on single entity types like disease and chemical mentions. More efforts to utilize and generate NLP datasets that cover a wider range of biomedical entities and relations are required. Furthermore, the processing and analysis of longer biomedical texts still poses a challenge, which require sophisticated models. Newer models including LLaMa, BLOOM, and GPT4 implement techniques to cope with these challenges by enabling in-context learning and allowing to process longer texts. However, since these models are not specifically designed for the biomedical domain, thorough evaluation efforts are necessary to identify their advantages and limitations.
Biological sequences
Biological sequences, such as deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or protein sequences, are relatively similar to natural languages. In the same way that characters in a natural language construct meaningful words, phrases, or sentences to convey some meaning, the building blocks of sequences arranged in different combinations form structures or support specific biological functions. It is not a surprise that the recent success of transformer-based models in NLP tasks also motivated the development of dedicated models to represent and analyze biological sequences. This trend is further supported by the availability of large databases (UniProt; [72], ENSEMBL [73], GenBank [74] containing vast amounts of biological sequences that can be used to perform pre-training of transformer-based models based on amino-acid as well as DNA sequences (Table 5).
Trained protein embeddings are, among other things, used to evaluate whether the prediction of per-residue secondary structure and subcellular localization shows a similar accuracy as methods that use evolutionary information [75], the recovery of proteins along the species or gene axes is possible, the biochemical properties (such as hydrophobic or aromatic nature) of amino acids can be recovered [76], or the retrieved embeddings generalize over different protein sequence lengths [77]. For instance, Elnaggar et al. [75] used the pre-trained ProtTrans models to predict the secondary structure labels (such as alpha-helix, beta-strand, or coil) for each amino acid, reaching state-of-the-art performance on multiple datasets. Although the majority of studies that develop these pre-trained models using protein sequences employ them for various downstream classification tasks, such models can also generate de novo protein sequences with the same fundamental characteristics as the natural ones [78,79,80,81,82]. ProtGPT2, a protein autoregressive pre-trained language model trained on 50 million sequences, is such a model that can predict subsequent amino acid sequences given a certain context (such as a number of amino acids as input) [82]. The generated protein sequences have shown properties of globular proteins and preserved functional hotspots [82]. However, major limitations still exist as there is no way of anticipating the discovery of functional traits underlying new protein sequences, which would require costly high-throughput experimental approaches.
Recent studies have more systematically explored amino acid sequence representations learned by pre-trained transformers [83,84,85]. For example, the analysis by Detlefsen et al. [83] shows that pre-trained transformer models have difficulties separating details of a single protein family. In consequence, the authors propose fine-tuning (evo-tuning) on the respective protein family to increase their capacity to show clear phylogenetic separation. They also show that enforcing specific biological properties on representations is not a straightforward task and that it is currently steered by model architecture, specific preprocessing (such as using multiple sequence alignment) of underlying data, objective functions for the pre-training, and placing prior distributions on parts of the model to better mimic certain biological traits.
Fine-tuned transformer models for amino acid sequences have been used for various downstream tasks such as protein function classification, protein fitness prediction, and detection of protein interactions with chemical substances (Table 5). Furthermore, AlphaFold [86, 87] has achieved considerable improvements in the protein 3D structure prediction by using protein sequences as input. AlphaFold2 [87], builds upon two core deep learning-based modules, namely Evoformer and Structure modules, that has significantly improved the performance on the Critical Assessment of Protein Structure Prediction (CASP) 14 dataset by setting a new state-of-the-art [86, 87]. The transformer-based Evoformer module uses representation of multiple sequence alignment (MSA) and pairwise representation of protein sequence as input. The MSA, which is precomputed by conducting a search through sequence databases to find sequences that resemble the input protein sequence, informs the model about evolutionary conservation and variation. Whereas, the pairwise representation captures the interactions between pairs of amino acid residues, which is crucial for understanding the spatial geometry of protein. The Structure module uses these representations to construct an atomistic model of the protein’s structure. It employs an additional attention mechanism and optimization procedure to ensure that the predicted 3D structure is physically plausible and adheres to known biophysical constraints. AlphaFold2 combines both modules to refine the representations and 3D structure prediction in an iterative process to produce the final structure. Like AlphaFold, the transformer-based models RoseTTAFold [88] and ESMFold [89] were independently developed to also predict accurate 3D-protein structures by learning patterns appearing in protein sequences.
Some recent studies have proposed to learn and capture global representations of DNA sequences [90, 91]. Ji et al. [91] pre-trained a DNABERT model, which is based on the BERT model with masked language modeling (MLM) objective and used tokenized k-mer (with best k at 6) sequences as input instead of regarding each nucleotide as a single token. Due to the specific tokenization, the vocabulary size of DNABERT was set to 4k + 5 (using permutations of 4 nucleotides with additional 5 special tokens such as for separator and padding). The pre-trained DNABERT model was analyzed using various fine-tuning tasks showing particularly that it can effectively identify proximal and core promoter regions, transcription factor binding sites, and functional genetic variants. Furthermore, DNABERT can also be used for interpretability by using learned attention weighting that characterizes the contextual relationships within a sequence to visualize its important regions and motifs.
Another relevant biological problem of how non-coding DNA regions influence gene expression in cells has been analyzed by Avsec et al. [92], who propose a transformer-based architecture called Enformer that enables the integration of long-range interactions in the genome producing significant improvements in predicting tissue and cell-type-specific gene expression. Similar to Kelley et al. [93], to read long sequences with a size of around 197.000 base pairs, the Enformer uses a number of convolutional layers that perform convolution on input sequences to reduce the spatial dimensionality. After the convolutional layers, instead of dilated convolution as used by Kelley et al. [93], the Enformer implements transformer layers that use attention mechanisms to represent the long-range interactions. The Enformer has shown significant performance gains in gene expression prediction; however, it has not yet reached the accuracies of experimental approaches. Furthermore, Enformer has also shown improvements in variant effect prediction that was performed on expression quantitative trait loci (eQTL) data [92].
In summary, studies on pre-trained transformer-based models for biological sequences have highlighted their capabilities to produce state-of-the-art results on 3D structures, functions, and interactions prediction. These sequence models however have similar limitations to NLP models. They require huge amounts of training data, which can represent a bottleneck for certain sequences (such as small RNAs). Additionally, these models often struggle to capture long-range interactions due to fixed-length context windows, which can be crucial in biological sequences. In the case of protein structure prediction, AlphaFold and others are highly accurate in predicting single protein chains; they however lack the ability to generate precise multi-chain protein complexes. Newer studies such as AlphaFold-Multimer [94] and ESMPair [95] have extended the previous models to also predict accurate protein complex structures. While transformer-based models show the ability to generalize on biological sequences data, further research is required to identify additional methods (for e.g. layers or architecture) to overcome the aforementioned limitations.
Structured-longitudinal electronic health records
Electronic health records (EHRs) are now routinely and in vast quantities collected by many healthcare systems. Typically, they contain unstructured information like clinical notes but also structured data, including time-stamped diagnosis and medication codes as well as time-stamped codes for medical procedures. The latter provide excellent opportunities for the efficient development of machine learning models for better personalized healthcare. However, it is difficult to utilize such data due to high dimensionality, heterogeneity, temporal dependency, sparsity, and irregularity [105]. More specifically, structured EHRs can be regarded as an instance of multivariate discrete, irregular time series data.
Several studies have recently proposed transformer-based models for the analysis of structured EHR data. The intuition behind these approaches is that sequences of diagnosis, procedure and medication codes might be interpreted as a kind of language, in which codes recorded at one particular visit might be viewed as tokens. Accordingly, transformers have been pre-trained on large amounts of patient data to generate numeric representations of a patient’s medical history, which are then used for downstream tasks like medication recommendation or mortality prediction. For example, Shang et al. [106] developed the graph-augmented transformer model G-BERT. It uses the hierarchical information from the International Statistical Classification of Diseases and Related Health Problems (ICD) and Anatomical Therapeutic Chemical (ATC) ontologies to train a graph neural network, which encodes in a first step diagnosis and medication codes in a lower dimensional space. In a second step, corresponding concept embeddings are used as a modified position encoding in a BERT-like transformer architecture. The authors pre-trained their model on 20,000 patients from the MIMIC-III dataset, then applied it to a medication recommendation task and found it slightly superior to baseline techniques (1.06% gain in AUPR to the second-best approach).
Later, Li et al. [107] developed BERT for EHR (BEHRT), which uses an altered embedding layer to process a sequence of diagnosis codes. Unlike G-BERT, the model provides a patient representation for the entire medical history rather than each visit. When applied to a diagnosis code prediction task, BEHRT surpassed baseline methods (1.2–1.5% higher area under the receiver operating characteristic curve (AUROC) and 8.0-10.8% increased area under the precision-recall curve (AUPR) for the disease prediction task). Since BEHRT – like many other transformer-based models – is restricted to a maximum sequence length of 512 codes, Li et al. [108] devised a hierarchical BEHRT (HI-BEHRT) variant in a subsequent study. This method applies BEHRT to parts of the medical history using a sliding window separately before aggregating the information by forwarding the individual representations to a final transformer. In addition to the hierarchical modification, the authors included information on medications, procedures, and laboratory tests. In disease prediction tasks, it was discovered that HI-BEHRT outperforms BEHRT by 1–5% and 3–6%, respectively, in terms of AUROC and AUPR. Another variant is the Med-BERT model [109]. Compared to G-BERT and BEHRT, it employs a more extensive vocabulary of diagnosis codes. Furthermore, it introduces a new training objective called prediction of prolonged length of stay (LOS). During pre-training, the model predicts whether patients had hospital visits of seven or more days (LOS > 7 days) for their entire EHR sequences. After pre-training on data from 28 million patients, the model was applied to a disease prediction task. On three datasets originating from two clinical databases, the AUROC performance was increased by 1.21–6.14% compared to the baseline approaches. Very recent work further extended Med-BERT by adding demographic information, medications as well as quantitative lab measurements [110].
Other studies addressed the potential shortcomings of the approaches mentioned above. For instance, Pang et al. [111] proposed CEHR-BERT that, unlike Med-BERT and BEHRT, employs a different method to embed the time-series data before passing it to the transformer layers. It uses embeddings initialized with time2vec model [112] to encode the relative time between visits and the patient’s age. The age, time, and concept embeddings are concatenated and passed through a fully connected layer to generate the BERT architecture’s temporal concept embeddings. In addition, it incorporates a new pre-training task called visit type prediction (VTP) alongside MLM. This task requires the model to determine if the visit was inpatient, outpatient, emergency, or masked. Compared to baseline approaches, including the retrained versions of BEHRT and Med-BERT, CEHR-BERT increased AUPRs and AUROCs by 0.6–4.2% and 0.4–2.51%, respectively. The aspect of appropriate time encoding was also covered in several other studies [113,114,115,116,117].
In contrast, Agarwal et al. [118] based their Transmed approach on the notion of a hierarchical transformer for EHRs. On the one hand, a static context encoder was employed to handle information such as a patient’s age, sex, race, and prior conditions such as diabetes or smoking. On the other hand, temporal context encoders were used to process the information at individual visits. The aggregated representations from the static and temporal encoders are then used to predict a patient’s risk of hospital stay or mechanical ventilation following a COVID-19 diagnosis. Across all four tasks, Transmed outperformed a newly pre-trained version of BEHRT (11–20% higher AUROC) and was mostly on par or better than a baseline gated recurrent unit (GRU) model.
There have also been attempts to combine structured and unstructured EHR data into joint patient representations. For instance, the Bidirectional Representation Learning model with Transformer architecture on Multimodal EHR (BRLTM) [119] utilizes diagnosis, drug, and procedure codes as well as information derived from unstructured clinical notes via latent Dirichlet allocation (LDA). When the authors compared BRLTM to other models, including a retrained version of BEHRT, they discovered that it was superior at accurately predicting diseases over multiple time frames. Liu et al. [120] followed a different approach with their Med-PLM model. Instead of deriving features from clinical notes, they use ClinicalBERT for processing clinical notes and a G-BERT-like model for processing structured EHR data before combining both using a cross-attentional module. The authors found that the final model outperformed unimodal counterparts (e.g., ClinicalBERT or G-BERT) in all tasks, highlighting the potential of merging both data modalities. Similarly, Darabi et al. [113] used both data modalities for their TAPER model and reached comparable results.
Another recent development in the context of EHRs is the synthetic generation of EHRs with transformers. Cheng et al. [121] recently proposed CEHR-GPT, a model that builds upon their previous work on CEHR-BERT to generate synthetic EHRs using GPT. Unlike CEHR-BERT, CEHR-GPT includes additional information on demographics, patient history, and temporal dependencies. Each visit is represented by a visit type token (VTT), and time is encoded using artificial time tokens (ATT) and long-term (LT) tokens. In their experiments, the authors compared three different patient representations of GPT. They found that CEHR-GPT was the most suitable variant for generating realistic synthetic EHR data while preserving patient privacy and temporal dependencies. However, they reported that the prevalence of concepts in the generated data was skewed compared to the original data and that the representation of time intervals is currently limited, suggesting that further improvements could be made in training the model and the representation of EHR data.
A broad overview of different studies employing transformers for structured-longitudinal EHR analysis is shown in Table 6.
In summary, transformer-based models are promising for working with structured EHR data. However, applying these models to EHR data also presents several challenges. Firstly, EHR data is highly heterogeneous and diverse, making it relatively unclear how to best represent it compared to text, sequence, and image data. Many studies focus on finding a suitable data representation. In addition to this challenge, comparing these models and their results is difficult. Since most pre-trained models and datasets are publicly unavailable due to privacy concerns, a direct comparison of the models is often impossible. Although studies often use other models as baselines and perform pre-training on available data to compare model architectures, a direct comparison of initially pre-trained models is not feasible, as is common in the NLP field. Furthermore, Kumar et al. [122] point out that simple linear models could not only be data and computationally efficient but could achieve comparable performance to transformer-based models. For instance, they propose an attention-free architecture called SANSformer that outperformed BEHRT and BRLTM models. Despite these challenges, transformer-based models remain a promising tool for analyzing EHR data. Further research is important to understand their full potential as well as limitations and how they can improve patient outcomes and provide better decision support.
Biomedical images
Due to the self-attention mechanism employed in transformer-based models, they have shown superior ability to model long-term dependencies in data, however, mostly in cases where the data is of sequential nature. Recently, transformers have also been adapted successfully to a wide variety of image analysis cases. For the purpose of image analysis, the image is first split into a sequence of patches (regions), which are then flattened to fixed vector length - quite similar to tokens. The flattened image patches are then linearly projected and combined with their positional embeddings that provide spatial information on each patch. The sequence of transformed patches can then be fed to a transformer. This approach is referred to as a vision transformer (ViT) in the literature [123,124,125,126]. Dosovitskiy et al. [124] formulated image classification as a sequence prediction task, which he addressed via a ViT. They examined two approaches for aggregating spatial information from images: the use of a CLS token and global pooling [124]. The CLS token in ViTs aggregates global information through self-attention, dynamically adjusting to capture complex image relationships. Global pooling, including methods like global average and max pooling, simplifies feature aggregation by applying straightforward mathematical operations across all image patches. While the CLS token’s aggregation is learnable and adapts to task specifics, global pooling offers a more generalized and computationally efficient summary [124].
ViTs have been applied to medical images derived from imaging techniques such as X-ray, computer tomography (CT), MRI, ultrasonography, optical coherence tomography (OCT), and high-content cell imaging screens. For instance, ViTs were used to analyze lung X-rays to detect COVID-19 disease [127,128,129], breast sonography images to classify breast cancer [130, 131], or femur X-rays to check for fractures [132]. Chen et al. [133] have proposed a ViT to detect gastric cancer from histopathological imaging data. Furthermore, CT images were used by Wu et al. [134] to build a medical application for classification of emphysema that can be further divided into three different subtypes, whereas Wang et al. [135] screened rare medical OCT imaging dataset for lesions associated with genitourinary syndrome of menopause. Nonetheless, MRI datasets have also been classified using ViTs for brain tumors [136] or for intraductal papillary mucosal neoplasms in the pancreas by [137]. Upon closer examination of the work of Tanzi et al. [132], you can observe potential benefits of ViT architectures compared to conventional approaches. Based on their results, it seems worth exploring the superiority of embedding space representations generated by ViTs, which can boost performance for medical classification tasks. Examining attention layers, that are commonly part of ViT architectures, makes these models inherently explainable, an attribute highly regarded by clinicians for model evaluation. Lastly, their retrospective analysis of integration into clinical practice, allows for the conclusion that a ViT-based computer aided diagnosis (CAD) system can contribute to improving clinical workflows and decision making for young residents and experienced doctors alike.
Another relevant task in the biomedical computer vision field is to detect segments of object instances such as lesions in functional magnetic resonance images, tumors in histopathological images, brain tissues in magnetic resonance images, retinal vessels in fundus imagery, or single-cell information from microscopy imagery [138, 139]. Transformer-based models are being heavily used for segmentation as they often improve accuracy compared to the traditional convolutional neural network-based (CNN) methods. Although most studies use hybrid transformer architectures, some have also built pure transformer-based models. For instance, Gao et al. [140] have proposed a hybrid transformer-based architecture UTNet by integrating a complexity-reduced self-attention into a CNN for segmentation. In comparison Huang et al. [141] introduced the pure transformer-based method MISSFormer, optimized especially for medical image segmentation tasks. Most studies have focused on the medical field, but some have also applied transformer-based methods for segmenting cells in images that originated in in-vitro experiments. Prangemeier et al. [139] have proposed a cell detection transformer for direct end-to-end instance segmentation, reaching a similar accuracy as the CNN-based methods while showing the simplicity and improved runtime of the proposed model.
In the drug discovery field, it is common nowadays to perform an automated high-content screening of cells treated with specific chemical substances. These screening experiments might identify substances that have desirable effects on the phenotypes of cells. High-content images of cells are also used for image-based profiling, where the profiles are derived by extracting relevant features from screened images [142]. Such phenotypic profiles can be used in downstream applications such as identifying a disease-associated phenotype, identifying lead compounds, bioactivity and toxicity assessment, and detecting a compound’s mechanism of actions [142], where recently transformer-based models are being applied to [143, 144]. For instance, Cross-Zamirski et al. [143] proposed a ViT-based model that uses weak labels to learn phenotypic representations from a publicly available dataset containing high-content images of cells and evaluate the model on two mechanism-of-action classification tasks. Furthermore, the authors show that the representations are biologically meaningful by analyzing the attention maps. Table 7 provides a broader overview of recent applications of ViTs.
Even though ViTs have proven to be powerful architectures for a variety of problems in biomedical imaging, they can not be recommended unlimitedly in favor of more established computer vision models, e.g. convolutional neural networks (CNNs) [145, 146]. It is important to understand how both architectures “perceive” images, in order to understand its particularities, advantages and disadvantages. The architecture of convolutional networks is inspired by the visual cortex of the brain [147]. They use receptive fields to learn kernels enabling them to recognize features crucial to their task. A subsequent pooling operation relatively increases the receptive fields of the kernels. This process is repeated iteratively, so the kernels can interpret more distant areas of the image [148]. This, by design, creates inductive biases, like translation equivariance and locality [124], important properties for image classification.
In contrast, as described earlier, a ViT treats an image as a sequence of patches, and through self-attention, every patch of the sequence attends to every other patch, so it needs to learn all spatial relations from data training [124]. Essentially, this causes ViT to struggle in effectively generalizing with limited data [124]. However, their performance scales well with growing datasets, outperforming CNNs as the number of training samples increases. Unfortunately, especially in the biomedical domain, large publicly available datasets are scarce.
Nonetheless recent work by He et al. [149] has shown that pre-training techniques, such as training a masked autoencoder (MAE) for patch embeddings, can reduce the number of training samples, training times, and boost performance in natural images. Zhou et al. [150] later showed that this can be applied in the medical domain as well. Varma et al. [151] tackled the issue of ViTs relying on predefined image sizes, necessitating pre-processing steps that can degrade image information. Through their flexible positional embedding and alternate batching strategies, they can reduce image manipulation while maintaining fine-grained image features.
Driven by its popularity and constant developments through ongoing research, one can assume that ViT architectures will increase in value and impact for biomedical imaging tasks in the near future.
Biomedical graphs
Besides textual content, biological sequences, imaging data, and structured EHR data, graphs are frequently used in biomedicine to describe relations between concepts. Graphs can cover various aspects of life sciences, hence, they can connect different types of nodes and edges with each other. Graph representation learning with machine learning methods enables the usage of graphs for various biomedically relevant downstream tasks such as protein-protein interaction prediction, prediction of adverse drug reactions, cell-type-association prediction, disease-subgraph classification drug-interaction prediction, patient-treatment prediction [152]. These tasks can be modeled as graph or sub-graph classification, node classification, or link prediction, which are often performed by encoding the information included in graphs, such as the graph structure, local graph neighborhoods, and the distinguishing features of nodes and edges [152].
Graph Transformer [153], Graph Transformer Networks [154], GTransformer [155], Structured Transformer [156], GraphFormers [157], and Relphormer [158] are some adaptations of transformer-based models suitable for graph representation learning. Transformers for graphs are conceptually similar to relational graph attention networks (RGATs) [159]. They regard each node of a graph as an entity in a pseudo-sequence. However, unlike transformers for sequences, the attention is restricted to neighboring nodes, hence taking into account the graph topology.
Graph-based transformers have, for instance, been applied in the drug discovery field where the focus lies on the identification of targets [160, 161], prediction of response [162], prediction of ATC code [163], or adverse reactions for a certain drug [164]. Additional work has also been performed to predict the properties of molecules involving toxicity, carcinogenicity, or blood-brain barrier penetration [165,166,167]. Recently, also textual and image analysis tasks (such as relation extraction or deformable image registration) have been successfully pursued using graph transformers [168, 169]. A further example is the prediction of interactions between transcription factors and DNA, which can be formulated as a link prediction task in a bipartite graph [170]. Other authors have delved into engineering new proteins by generative graph representations of 3D protein structures [156, 171]. Also, the prediction of protein-protein interactions has been performed using graph neural networks while using protein 3D structure graphs and learned sequence embeddings of ProtBERT [172].
Although the analysis or usage of graphs to support biomedical tasks with graph transformers is yet focused only on some niche areas, researchers are already prospecting new fields, such as the analysis of single-cell multi-omics in immuno-oncology to characterize cellular heterogeneity [173], where this technology could also be helpful. Nonetheless, further experiments are required to assess whether the transfer learning strategies, along with graph transformers, will ultimately prevail over the general graph neural networks like RGATs.
Transformers for multimodal data
The majority of existing research studies have addressed biomedical tasks using just one single data modality, however, modeling complex processes of biology and medicine inherently requires the integration of and learning on multiple modalities, such as genetic, proteomics, pharmacogenomic, imaging, and text [174]. Recently, transformer-based models have been adapted to process multiple data modalities simultaneously. Koorathota et al. [175] introduced a multimodal neurophysiological transformer for recognizing emotions using multiple modalities (such as time series and extracted features) obtained through electroencephalography, galvanic skin response, and photoplethysmogram techniques. Inspired by the multisensory integration mechanism of the brain, Shi et al. [176] proposed an adapted transformer-based model to integrate visual and auditory modalities to improve emotion and bird species recognition using video-audio clips.
Furthermore, vision-and-language models are a recent development, which take textual content and images as input and jointly learn to capture the relationships between both modalities. These models have also been adapted to the clinical domain, for instance, for chest X-ray disease diagnosis [177] or to automatically generate reports for abnormal COVID-19 chest CT scans [178]. Similarly, paired images and textual reports of chest and musculoskeletal X-rays were used with contrastive learning to build new pre-trained models that improved upon medical image classification and retrieval on various datasets [179]. Others have explored an integration of molecular structures using simplified molecular-input line-entry system (SMILES) signatures in biomedical text to build a transformer-based multimodal system that can predict molecular properties, classify chemical reactions, and improve NER as well as relation extraction [180]. Finally, Lentzen et al. [110] proposed a multimodal transformer architecture to combine structured EHRs with quantitative clinical measures. Their idea was a concatenation of the latent representation learned by the transformer encoder with a feature vector representing quantitative data. The concatenated representations are then passed forward through the classification head during the fine-tuning phase.
Development of transformer-based models capable of learning from multimodal data is a non-trivial challenge. These models are highly specific to the particular modalities (for e.g. text, image, or structured EHR) and tasks at hand. There is a pressing need for further investigation into how transformer-based architectures can evolve into universal architectures that are agnostic to various biomedical modalities and the underlying tasks.
Making transformers explainable
Specifically in biomedicine, it is essential to study which features a model used to make predictions in order to identify potential flaws and build trust in the results. Since the first appearance of transformer-based models, several studies have proposed different approaches for post-hoc model explanation based on techniques developed in the booming field of Explainable AI (XAI) [181].
Most approaches focus on the implicitly learned attention weights of transformer-based models. For instance, Vig [182] developed the BertViz tool for displaying the attention weights for analysis and debugging purposes. Later, Ji et al. [91] used similar visualizations for their pre-trained DNABERT model. Following the evaluation, the authors studied the attention landscape and found, for instance, that the model prioritizes intronic sequence sections when predicting splice sites. Similarly, Avsec et al. [92] investigated their model, which was built to predict gene expression and chromatin states using the average attention weights. Their analysis revealed that the developed model attends to parts of the sequence located up to 100 kb from the gene site. A slightly different strategy was followed by Koorathota et al. [175], who proposed a multimodal neurophysiological transformer for predicting valence and arousal as a response to music. They created a metric known as the sum of absolute activation differences to interpret the interactions between the different modalities. Unlike the majority of attention-weight analyses, this analysis is neither affected by individual samples nor the selection of attention layers or heads. The study revealed that, for instance, electroencephalography and photoplethysmogram signals significantly affect the model’s prediction.
Other studies investigate the application of general-purpose XAI methodologies. For instance, Kokalj et al. [183] introduced TransShap, an adaptation of Shapley Additive Explanations (SHAP) [184] that may be utilized to evaluate and understand the functioning of text classifiers. Lastly, Madan et al. [103] applied the integrated gradients method [185] instead of focusing on the attention weights of the model. The authors utilized this method to explain their predictions on virus-host protein-protein interactions while discovering sections of sequences that contribute to the model’s predictions. Advances in the field of XAI methods, in general, have opened up new opportunities to interpret the models while gaining new insights on predictions, although significant limitations still exist due to the lack of validation datasets, hence, careful investigation of the reliability of these XAI strategies is highly necessary [186]. Furthermore, a general caveat is the possible misinterpretation of XAI approaches as providing a causal understanding of the prediction problem.
Discussion
Strengths of transformers
Transformer-based models have pushed the boundaries for processing and analyzing various data modalities such as text, EHRs, biological sequences, images, and graphs across a wide variety of biomedical tasks, as demonstrated by the examples shown in the previous sections. Since transformers originated in the NLP field, the biomedical NLP has seen a certain momentum with these models earlier than other disciplines, resulting in a greater number of transformer-related research studies within the NLP field. At the moment, transformers have mainly been applied to discrete data, but also first adaptations to continuous time series data have been proposed [187].
The success of the transformer can be mainly explained by two factors:
-
a)
the attention mechanism, which allows for capturing long-range dependencies in the input.
-
b)
the self-supervised learning paradigm that supports pre-training from huge amounts of unlabeled data and subsequent fine-tuning / transfer-learning of a domain-specific task.
Specifically, the second aspect allows for effective utilization of background information, which explains the often-observed superior prediction performance compared to more conventional machine learning approaches.
Challenges when using transformers
The pre-training of transformers using the self-supervised learning paradigm depends on huge training datasets. Accordingly, the training of transformers is computationally intensive. It should be noted that transformers have millions of parameters (one of the largest model PaLM published by Chowdhery et al. [188] has 540 billion parameters), and the underlying attention mechanism has a quadratic time complexity with regard to the input sequence length. To overcome these challenges, new solutions have been proposed such as optimizing the transformer model [189,190,191,192,193] or applying knowledge distillation technique [194]. For instance, Kitaev et al. [191] proposed the Reformer model that improves the efficiency of the transformer by reducing the complexity of the dot-product attention mechanism and by optimizing the storage of activations in the model.
Future direction 1: knowledge integration
Another line of research focuses on the utilization of background knowledge during the training procedure. For example, in the NLP field K-BERT is an extension of BERT, in which the input token stream is expanded by background information extracted from a knowledge graph [195]. ERNIE uses two encoders, a T-encoder for the original tokens and a K-encoder for entities in a knowledge graph, and both representations are fused [196, 197]. While the authors of these papers report enhanced prediction performances of NLP-related tasks outside the biomedical domain, there is the question of how according methods might be impacted by incompleteness and errors in the knowledge graph, which could be a major concern in the biomedical field. Furthermore, not all knowledge can be effectively represented as a graph. Depending on the respective application, other knowledge representations, such as logical rules and mathematical equations, could be worthwhile to consider in future research as well.
Future direction 2: multimodal data integration
Integrating multimodal data is key for many systems and precision medicine tasks. Heterogeneous information across different data modalities, such as genetics, epigenetics, proteomics, metabolomics, imaging, text, and clinical observations, must be aligned and fused to perform multimodal learning with transformer-based models. Although first publications are now focusing on multimodal transformers (see section above), this line of research is still at the beginning. For example, one general challenge in the area of multimodal data integration are varying dimensions and numerical ranges of input modalities [198]. Recent studies have begun to explore general-purpose architectures that can handle different modalities of varying dimensionalities [199,200,201], but we expect more work to come along those lines.
Future direction 3: generative modeling
More recently, generative transformer models have shown impressive advancements in the NLP field. One of the most prominent examples, which is however not particularly devoted to biomedicine, is ChatGPT [202]. ChatGPT has shown remarkable performances on generating near-human level textual content and leading dialogues with humans. Generative transformer models such as ChatGPT or its freely available variants (e.g., GPT4All; [203]) could in the future support many tasks in medical routine, such as generating synthetic clinical notes [32], writing discharge letters, or coding and billing diagnosis and medications. Furthermore, these models could also support the field of biomedical research. Researchers have already started experimenting with generative transformers to generate synthetic protein sequences [82, 204]. However, a huge challenge of applying such models in biomedicine is to verify the trustworthiness of the generated content. For instance, engineered protein sequences need to be experimentally tested. Automatically generated discharge letters have to be validated manually.
Future direction 4: better explainable models
By being able to explain and understand the predictions through XAI techniques, trust and confidence can be built in biomedical AI models, which is even more relevant for decision-making processes in the clinical domain. Several general-purpose XAI techniques have been adapted for transformers recently [205,206,207]. Some have shown that trust in models can also be increased by producing counterfactual explanations that show under which hypothetical changes to the input a different output will be generated, a method often used by humans to understand unfamiliar processes [208, 209]. However, the XAI field as such is still in its infancy. For example, there is no generally accepted definition of “explainability”, and there is a lack of gold standards against which new methods could be compared. Accordingly, existing attempts to make transformers explainable have to be seen relative to the advances of the XAI field as a whole. While first approaches in the XAI field mainly focused on images, the development of general-purpose model explanation techniques, such as SHAP is still relatively recent. We can thus expect that with the increasing advances of the XAI field also better explanation techniques for transformers will become available.
Conclusion
Transformers, originally created in the NLP field, are still a relatively new deep learning approach. Recent years have witnessed a dramatically increased use for various data types with transformers, which are of relevance in biomedicine, including structured EHRs, graphs, images, and biological sequences. The main strengths of transformers are the in-built attention mechanism and the possibility for self-supervised pre-training, which, however, requires huge datasets. Accordingly, transformers have currently found little use in domains where such datasets are not available, e.g., signals coming from wearable devices, or clinical studies and registries. Also, despite research on modeling time-series data with transformers [187], dedicated studies in biomedicine for this type of data are yet to emerge. Currently emerging directions of research include better strategies for knowledge integration, multimodal data fusion, and the adaptation of novel XAI techniques. We expect that efforts to integrate data across the entire healthcare system, such as those in United Kingdom (UK) like Health Data Research UK (https://www.hdruk.ac.uk/), UK Biobank (https://www.ukbiobank.ac.uk/) and Genomics England (https://www.genomicsengland.co.uk/), will enable an even more wide-spread use of transformers in the future.
Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
Change history
14 August 2024
The Level 1 headings in the website version of the article have been corrected.
Abbreviations
- AI:
-
Artificial Intelligence
- BERT:
-
Bidirectional Encoder Representations from Transformers
- CAD:
-
Computer aided diagnosis
- COVID-19:
-
Coronavirus Disease 2019
- CV:
-
Computer Vision
- CT:
-
Computer Tomography
- EHR:
-
Electronic Health Records
- GPT:
-
Generative pre-trained transformers
- MRI:
-
Magnetic Resonance Imaging
- NER:
-
Named Entity Recognition
- NEL:
-
Named Entity Linking
- NLP:
-
Natural Language Processing
- NLM:
-
United States National Library of Medicine
- PMC:
-
PubMed Central
- ViT:
-
Vision Transformers
- ICD:
-
International Statistical Classification of Diseases and Related Health Problems
- ATC:
-
Anatomical Therapeutic Chemical
- VTP:
-
Visit Type Prediction
- GRU:
-
Gated Recurrent Unit
- LDA:
-
Latent Dirichlet Allocation
- SHAP:
-
Shapley Additive Explanations
- XAI:
-
Explainable Artificial Intelligence
References
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. Red Hook, NY, USA: Curran Associates Inc; 2017. p. 6000–10.
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019. p. 4171–86.
Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models. 2023.
Touvron H, Martin L, Stone K, et al. Llama 2. Open foundation and fine-tuned chat models. 2023.
Workshop B, Scao TL, Fan A et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2211.05100.
Bahdanau D, Cho KH, Bengio Y. Neural machine translation by jointly learning to align and translate. San Diego: 3rd International Conference on Learning Representations, ICLR 2015; 2015.
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A. Language models are few-shot learners. arXiv. 2020;2005:14165.
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1:9.
Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge, Massachusetts: The MIT Press; 2016.
Lin T, Wang Y, Liu X, Qiu X. A survey of transformers. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2106.04554.
Johnson A, Pollard T, Mark R. MIMIC-III clinical database. 2015. https://doiorg.publicaciones.saludcastillayleon.es/10.13026/C2XW26.
Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, Moody B, Szolovits P, Anthony Celi L, Mark RG. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035.
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36:1234–40.
Clark K, Luong M-T, Le QV, Manning CD. Electra: pre-training text encoders as discriminators rather than generators. arXiv. 2020;2003:10555.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: a robustly optimized bert pretraining approach. arXiv. 2019;1907:11692.
OpenAI, Achiam J, Adler S et al. GPT–4 Technical Report. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2303.08774.
Lentzen M, Madan S, Lage-Rupprecht V, et al. Critical assessment of transformer-based AI models for German clinical notes. JAMIA Open. 2022;5:ooac087.
Copara Zea JL, Knafou JDM, Naderi N, Moro C, Ruch P, Teodoro D. Contextualized French language models for biomedical named entity recognition. Actes de la 6e conférence conjointe Journées d’Études sur la parole (JEP, 33e édition), Traitement Automatique Des Langues Naturelles (TALN, 27e édition), Rencontre Des Étudiants chercheurs en Informatique pour le Traitement Automatique Des Langues (RÉCITAL, 22e édition). Nancy, France: ATALA et AFCP: Atelier DÉfi Fouille de Textes; 2020. p. 36–48.
Kim Y, Kim J-H, Lee JM, Jang MJ, Yum YJ, Kim S, Shin U, Kim Y-M, Joo HJ, Song S. A pre-trained BERT for Korean medical natural language processing. Sci Rep. 2022;12:13847.
Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H. Domain-specific language model pretraining for biomedical natural language processing. 2020.
Shin HC, Zhang Y, Bakhturina E, Puri R, Patwary M, Shoeybi M, Mani R. BioMegatron: larger biomedical domain language model. In: Proceedings of the 2020 conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics; 2020. p. 4700–6. https://doiorg.publicaciones.saludcastillayleon.es/10.18653/v1/2020.emnlp-main.379.
Kanakarajan Kraj, Kundumani B, Sankarasubbu M. BioELECTRA: pretrained biomedical text encoder using discriminators. In: proceedings of the 20th workshop on biomedical language processing. Online: Association for Computational Linguistics; 2021. p. 143–54.
Naseem U, Dunn AG, Khushi M, Kim J. Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT. BMC Bioinformatics. 2022;23:144.
Gururangan S, Marasović A, Swayamdipta S, Lo K, Beltagy I, Downey D, Smith NA. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. 2020. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2004.10964.
Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, Liu T-Y. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23:bbac409.
Alsentzer E, Murphy J, Boag W, Weng WH, Jindi D, Naumann T, McDermott M. Publicly available clinical BERT embeddings. In: Proceedings of the 2nd clinical natural language processing workshop. Minneapolis, Minnesota, USA: Association for Computational Linguistics; 2019. p. 72–8.
Huang K, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv. 2019;1904:05342 [cs].
Huang K, Singh A, Chen S, Moseley E, Deng C-Y, George N, Lindvall C. Clinical XLNet: Modeling sequential clinical notes and predicting prolonged mechanical ventilation. In: Proceedings of the 3rd clinical natural language processing workshop. 2020. p. 94–100.
Yang X, Bian J, Hogan WR, Wu Y. Clinical concept extraction using transformers. J Am Med Inform Assoc. 2020;27:1935–42.
Li Y, Wehbe RM, Ahmad FS, Wang H, Luo Y. Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2201.11838.
Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5:194.
Basaldella M, Liu F, Shareghi E, Collier N. COMETA: a corpus for medical entity linking in the social media. arXiv. 2020;2010:03295 [cs].
Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID–19 literature. Nucleic Acids Res. 2021;49:D1534–40.
Chen Q, Allot A, Leaman R, et al. Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID–19 literature topic annotations. Database. 2022;2022:baac069.
Esteva A, Kale A, Paulus R, Hashimoto K, Yin W, Radev D, Socher R. COVID–19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. Npj Digit Med. 2021;4:1–9.
Nentidis A, Krithara A, Bougiatiotis K, Paliouras G. Overview of BioASQ 8a and 8b: results of the Eighth Edition of the BioASQ tasks a and b. In: Cappellato L, Eickhoff C, Ferro N, Névéol A, eds. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. Thessaloniki, Greece: CEUR; 2020. Available from: https://ceur-ws.org/Vol-2696/#paper_164.
You R, Liu Y, Mamitsuka H, Zhu S. BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text. Bioinformatics. 2021;37:684–92.
Sun C, Yang Z, Wang L, Zhang Y, Lin H, Wang J. Biomedical named entity recognition using BERT in the machine reading comprehension framework. J Biomed Inform. 2021;118:103799.
Peng Y, Chen Q, Lu Z. An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining. In: Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing. Online: Association for Computational Linguistics; 2020. p. 205–14. Available from: https://aclanthology.org/2020.bionlp-1.22.
Khandelwal A, Kar A, Chikka VR, Karlapalem K. Biomedical NER using novel schema and distant supervision. In: Proceedings of the 21st workshop on biomedical language processing. Dublin, Ireland: Association for Computational Linguistics; 2022. p. 155–60.
Zaratiana U, Tomeh N, Holat P, Charnois T. GNNer: reducing overlapping in span-based NER using graph neural networks. In: Proceedings of the 60th annual meeting of the Association for Computational Linguistics: student research workshop. Dublin, Ireland: Association for Computational Linguistics; 2022. p. 97–103.
Fries JA, Steinberg E, Khattar S, Fleming SL, Posada J, Callahan A, Shah NH. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat Commun. 2021;12:2017.
Madan S, Julius Zimmer F, Balabin H, Schaaf S, Fröhlich H, Fluck J, Neuner I, Mathiak K, Hofmann-Apitius M, Sarkheil P. Deep learning-based Detection of Psychiatric Attributes from German Mental Health Records. International Journal of Medical Informatics 104724; 2022.
Huang C-W, Tsai S-C, Chen Y-N. PLM-ICD: automatic ICD coding with pretrained language models. In: Proceedings of the 4th clinical natural language processing workshop. 2022. p. 10–20.
Johnson AE, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. In: Proceedings of the ACM conference on health, inference, and learning. 2020. p. 214–21.
Vakili T, Lamproudis A, Henriksson A, Dalianis H. Downstream task performance of BERT models pre-trained using automatically de-identified clinical data. In: Proceedings of the thirteenth language resources and evaluation conference. 2022. p. 4245–52.
Sung M, Jeong M, Choi Y, Kim D, Lee J, Kang J. BERN2: an advanced neural biomedical named entity recognition and normalization tool. Bioinformatics. 2022;38:4837–9.
Mungall C, Matentzoglu N, Balhoff J et al. Oborel/obo-relations: release 2022-10-26. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.7254604.
Karki R, Madan S, Gadiya Y, Domingo-Fernández D, Kodamullil AT, Hofmann-Apitius M. Data-driven modeling of knowledge assemblies in understanding comorbidity between type 2 diabetes mellitus and alzheimer’s disease. J Alzheimers Dis. 2020;78:1–9.
Kodamullil AT, Iyappan A, Karki R, Madan S, Younesi E, Hofmann-Apitius M. Of mice and men: comparative analysis of neuro-inflammatory mechanisms in human and mouse using cause-and-effect models. J Alzheimers Dis. 2017;59:1045–55.
Zhu Y, Li L, Lu H, Zhou A, Qin X. Extracting drug-drug interactions from texts with BioBERT and multiple entity-aware attentions. J Biomed Inform. 2020;106:103451.
Li D, Xiong Y, Hu B, Tang B, Peng W, Chen Q. Drug knowledge discovery via multi-task learning and pre-trained models. BMC Med Inf Decis Mak. 2021;21:251.
Hu D, Zhang H, Li S, Wang Y, Wu N, Lu X. Automatic extraction of lung cancer staging information from computed tomography reports: deep learning approach. JMIR Med Inf. 2021;9:e27955.
Zhang X, Zhang Y, Zhang Q, Ren Y, Qiu T, Ma J, Sun Q. Extracting comprehensive clinical information for breast cancer using deep learning methods. Int J Med Inf. 2019;132:103985.
Bansal T, Verga P, Choudhary N, McCallum A. Simultaneously linking entities and extracting relations from biomedical text without mention-level supervision. arXiv. 2019;1912:01070 [cs].
Chen M, Lan G, Du F, Lobanov V. oint Learning with Pre-trained Transformer on Named Entity Recognition and Relation Extraction Tasks for Clinical Analytics. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop. Online: Association for Computational Linguistics; 2020. p. 234–42.
Verga P, Strubell E, McCallum A. Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In: Proceedings of the 2018 conference of the North American chapter of the Association for Computational Linguistics: human language technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics; 2018. p. 872–84.
Iinuma N, Miwa M, Sasaki Y. Improving supervised drug-protein relation extraction with distantly supervised models. In: Proceedings of the 21st workshop on biomedical language processing. Dublin, Ireland: Association for Computational Linguistics; 2022. p. 161–70.
Papanikolaou Y, Roberts I, Pierleoni A. Deep bidirectional transformers for relation extraction without supervision. In: Proceedings of the 2nd workshop on deep learning approaches for low-resource NLP (DeepLo 2019). Hong Kong, China: Association for Computational Linguistics; 2019. p. 67–75.
Hall K, Chang V, Jayne C. A review on natural language processing models for COVID–19 research. Healthc Analytics. 2022;2:100078.
Kalyan KS, Rajasekharan A, Sangeetha S. AMMU: a survey of transformer-based biomedical pretrained language models. J Biomed Inform. 2022;126:103982.
Wang B, Xie Q, Pei J, Tiwari P, Li Z, fu J. Pre-trained language models in biomedical domain: a systematic survey. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2110.05006.
Syafiandini AF, Song G, Ahn Y, Kim H, Song M. An automatic hypothesis generation for plausible linkage between xanthium and diabetes. Sci Rep. 2022;12:17547.
Hong G, Kim Y, Choi Y, Song M. BioPREP: deep learning-based predicate classification with SemMedDB. J Biomed Inform. 2021;122:103888.
García del Valle EP, Lagunes García G, Prieto Santamaría L, Zanin M, Menasalvas Ruiz E, Rodríguez-González A. Leveraging network analysis to evaluate biomedical named entity recognition tools. Sci Rep. 2021;11:13537.
Aldahdooh J, Vähä-Koskela M, Tang J, Tanoli Z. Using BERT to identify drug-target interactions from whole PubMed. BMC Bioinformatics. 2022;23:245.
Zhou H, Li X, Yao W, Liu Z, Ning S, Lang C, Du L. Improving neural protein-protein interaction extraction with knowledge selection. Comput Biol Chem. 2019;83:107146.
Wang J, Ren Y, Zhang Z, Xu H, Zhang Y. From tokenization to self-supervision: building a high-performance information extraction system for chemical reactions in patents. Front Res Metr Anal. 2021;6:691105.
Jain H, Raj N, Mishra S. A Sui Generis QA Approach using RoBERTa for adverse drug event identification. BMC Bioinformatics. 2021;22:330.
Cho H, Kim B, Choi W, Lee D, Lee H. Plant phenotype relationship corpus for biomedical relationships between plants and phenotypes. Sci Data. 2022;9:235.
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–9.
Cunningham F, Allen JE, Allen J, et al. Ensembl 2022. Nucleic Acids Res. 2022;50:D988–95.
Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2020;48:D84–6.
Elnaggar A, Heinzinger M, Dallago C, et al. ProtTrans: towards cracking the Language of Lifes Code through Self-supervised Deep Learning and High Performance Computing. IEEE Trans Pattern Anal Mach Intell. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/tpami.2021.3095381.
Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.1073/pnas.2016239118.
Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38:2102–10.
Madani A, McCann B, Naik N, Keskar NS, Anand N, Eguchi RR, Huang P-S, Socher R. ProGen: Language Modeling for Protein Generation. 2020. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2004.03497.
Madani A, Krause B, Greene ER et al. (2021) Deep neural language modeling enables functional protein generation across families. 2021.07.18.452833.
Hesslow D, Zanichelli N, Notin P, Poli I, Marks D. RITA: a study on scaling up generative protein sequence models. arXiv. 2022;2205:05789.
Nijkamp E, Ruffolo J, Weinstein EN, Naik N, Madani A. ProGen2: Exploring the Boundaries of Protein Language Models. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2206.13517.
Ferruz N, Schmidt S, Höcker B. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun. 2022;13:4348.
Detlefsen NS, Hauberg S, Boomsma W. Learning meaningful representations of protein sequences. Nat Commun. 2022;13:1914.
Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, Abbeel P, Song Y. Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst. 2019;32.
Unsal S, Atas H, Albayrak M, Turhan K, Acar AC, Doğan T. Learning functional properties of proteins with language models. Nat Mach Intell. 2022;4:227–45.
Senior AW, Evans R, Jumper J, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577:706–10.
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9.
Baek M, DiMaio F, Anishchenko I, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–6.
Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30.
Clauwaert J, Waegeman W. Novel transformer networks for Improved sequence labeling in genomics. IEEE/ACM Trans Comput Biol Bioinf. 2020;19:97–106.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021;37:2112–20.
Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, Assael Y, Jumper J, Kohli P, Kelley DR. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18:1196–203.
Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 2018;28:739–50.
Evans R, O’Neill M, Pritzel A et al. Protein complex prediction with AlphaFold-Multimer. 2022;2021.10.04.463034.
Chen B, Xie Z, Qiu J, Ye Z, Xu J, Tang J. Improved the Protein Complex Prediction with Protein Language Models. 2022;2022.09.15.508065.
Rao RM, Liu J, Verkuil R, Meier J, Canny J, Abbeel P, Sercu T, Rives A. MSA Transformer. In: Proceedings of the 38th International Conference on Machine Learning. Online: PMLR; 2021. p. 8844–56. Available from: https://proceedings.mlr.press/v139/rao21a.html.
Teufel F, Almagro Armenteros JJ, Johansen AR, Gíslason MH, Pihl SI, Tsirigos KD, Winther O, Brunak S, von Heijne G, Nielsen H. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol. 2022;40:1023–5.
Notin P, Dias M, Frazer J, Hurtado JM, Gomez AN, Marks D. Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval. In: Proceedings of the 39th International Conference on Machine Learning. Online: PMLR; 2022. p. 16990–7017
Hsu C, Nisonoff H, Fannjiang C, Listgarten J. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol. 2022;40:1114–22.
Bernhofer M, Rost B. TMbed: transmembrane proteins predicted through language model embeddings. BMC Bioinformatics. 2022;23:326.
Castro E, Godavarthi A, Rubinfien J, Givechian K, Bhaskar D, Krishnaswamy S. (2022) Transformer-based protein generation with regularized latent space optimization. Nat Mach Intell 1–12.
Kang H, Goo S, Lee H, Chae J, Yun H, Jung S. Fine-tuning of BERT Model to accurately predict drug–target interactions. Pharmaceutics. 2022;14:1710.
Madan S, Demina V, Stapf M, Ernst O, Fröhlich H. Accurate prediction of virus-host protein-protein interactions via a siamese neural network using deep protein sequence embeddings. Patterns. 2022;3:100551.
Zitnik M, Sosič R, Maheshwari S, Leskovec J. BioSNAP Datasets: Stanford Biomedical Network Dataset Collection. 2018. http://snap.stanford.edu/biodata.
Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19:1236–46.
Shang J, Ma T, Xiao C, Sun J. Pre-training of graph augmented transformers for medication recommendation. In: 28th International Joint Conference on Artificial Intelligence, IJCAI 2019. Macao: International Joint Conferences on Artificial Intelligence (IJCAI); 2019. p. 5953–9.
Li Y, Rao S, Solares JRA, Hassaine A, Ramakrishnan R, Canoy D, Zhu Y, Rahimi K, Salimi-Khorshidi G. BEHRT: Transformer for electronic health records. Sci Rep. 2020;10:7155.
Li Y, Mamouei M, Salimi-Khorshidi G, Rao S, Hassaine A, Canoy D, Lukasiewicz T, Rahimi K. Hi-BEHRT: Hierarchical Transformer-Based Model for Accurate Prediction of Clinical Events Using Multimodal Longitudinal Electronic Health Records. IEEE J Biomed Health Inform. 2023;27:1106–17.
Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4:1–13.
Lentzen M, Linden T, Veeranki S, Madan S, Kramer D, Leodolter W, Fröhlich H. A transformer-based model trained on large scale Claims Data for prediction of severe COVID–19 disease progression. IEEE J Biomedical Health Inf. 2023;27:4548–58.
Pang C, Jiang X, Kalluri KS, Spotnitz M, Chen R, Perotte A, Natarajan K. (2021) CEHR-BERT: incorporating temporal information from structured EHR data to improve prediction tasks. Mach Learn Health 239–60.
Kazemi SM, Goel R, Eghbali S, Ramanan J, Sahota J, Thakur S, Wu S, Smyth C, Poupart P, Brubaker M. (2019) Time2Vec: learning a vector representation of time. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/ARXIV.1907.05321.
Darabi S, Kachuee M, Fazeli S, Sarrafzadeh M. TAPER: time-aware patient EHR representation. IEEE J Biomedical Health Inf. 2020;24:3268–75.
Finch A, Crowell A, Chang Y-C, Parameshwarappa P, Martinez J, Horberg M. A comparison of attentional neural network architectures for modeling with electronic medical records. JAMIA Open. 2021;4:ooab064.
Luo J, Ye M, Xiao C, Ma F. HiTANet: Hierarchical time-aware attention networks for risk prediction on electronic health records. HiTANet. 2020. https://doiorg.publicaciones.saludcastillayleon.es/10.1145/3394486.3403107.
Peng X, Long G, Shen T, Wang S, Jiang J. (2021) Sequential diagnosis prediction with transformer and ontological representation. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/ARXIV.2109.03069.
Ren H, Wang J, Zhao WX, Wu N. RAPT: pre-training of time-aware transformer for learning robust healthcare representation. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. New York, NY, USA: Association for Computing Machinery; 2021. p. 3511–3503.
Agarwal K, Choudhury S, Tipirneni S, et al. Preparing for the next pandemic via transfer learning from existing diseases with hierarchical multi-modal BERT: a study on COVID–19 outcome prediction. Sci Rep. 2022;12:10748.
Meng Y, Speier W, Ong MK, Arnold CW. Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression. IEEE J Biomedical Health Inf. 2021;25:3121–9.
Liu S, Wang X, Hou Y, Li G, Wang H, Xu H, Xiang Y, Tang B. (2022) Multimodal data matters: Language model pre-training over structured and unstructured electronic health records. IEEE J Biomedical Health Inf 1–12.
Pang C, Jiang X, Pavinkurve NP, Kalluri KS, Minto EL, Patterson J, Zhang L, Hripcsak G, Elhadad N, Natarajan K. CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2402.04400.
Kumar Y, Ilin A, Salo H, Kulathinal S, Leinonen MK, Marttinen P. (2024) Self-Supervised Forecasting in Electronic Health Records with attention-free models. IEEE Trans Artif Intell 1–17.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: Vedaldi A, Bischof H, Brox T, Frahm JM, eds. Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol. 12346. Cham: Springer; 2020. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/978-3-030-58452-8_1.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: ICLR 2021 The Ninth International Conference on Learning Representations. Online: International Conference on Learning Representations (ICLR). 2021.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. 2021. p. 10012–22.
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. Online: PMLR; 2021. p. 10347–57.
Krishnan KS, Krishnan KS. Vision transformer based COVID–19 detection using chest X-rays. In: 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC). 2021. p. 644–8.
Park S, Kim G, Oh Y, Seo JB, Lee SM, Kim JH, Moon S, Lim J-K, Ye JC. Multi-task vision transformer using low-level chest X-ray feature corpus for COVID–19 diagnosis and severity quantification. Med Image Anal. 2022;75:102299.
Shome D, Kar T, Mohanty SN, Tiwari P, Muhammad K, AlTameem A, Zhang Y, Saudagar AKJ. Covid-transformer: interpretable covid–19 detection using vision transformer for healthcare. Int J Environ Res Public Health. 2021;18:11086.
Gheflati B, Rivaz H. Vision transformers for classification of breast ultrasound images. In: 2022 44th annual international conference of the IEEE Engineering in Medicine & Biology Society (EMBC). 2022. p. 480–3.
Wang W, Jiang R, Cui N, Li Q, Yuan F, Xiao Z. Semi-supervised vision transformer with adaptive token sampling for breast cancer classification. Front Pharmacol. 2022;13:929755.
Tanzi L, Audisio A, Cirrincione G, Aprato A, Vezzetti E. Vision transformer for femur fracture classification. Injury. 2022;53:2625–34.
Chen H, Li C, Wang G, et al. GasHis-Transformer: a multi-scale visual transformer approach for gastric histopathological image detection. Pattern Recogn. 2022;130:108827.
Wu Y, Qi S, Sun Y, Xia S, Yao Y, Qian W. A vision transformer for emphysema classification using CT images. Phys Med Biol. 2021;66:245016.
Wang H, Ji Y, Song K, Sun M, Lv P, Zhang T. ViT-P: classification of genitourinary syndrome of menopause from OCT images based on vision transformer models. IEEE Trans Instrum Meas. 2021;70:1–14.
Tummala S, Kadry S, Bukhari SAC, Rauf HT. Classification of brain tumor from magnetic resonance imaging using vision transformers ensembling. Curr Oncol. 2022;29:7498–511.
Salanitri FP, Bellitto G, Palazzo S, et al. Neural transformers for Intraductal Papillary Mucosal Neoplasms (IPMN) classification in MRI images. In: 2022 44th annual international conference of the IEEE Engineering in Medicine & Biology Society (EMBC). 2022. p. 475–9.
He K, Gan C, Li Z, Rekik I, Yin Z, Ji W, Gao Y, Wang Q, Zhang J, Shen D. Transformers in medical image analysis: a review. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2202.12165.
Prangemeiz T, Reich C, Koeppl H. Attention-based transformers for instance segmentation of cells in microstructures. In: 2020 IEEE international conference on Bioinformatics and Biomedicine (BIBM). 2020. p. 700–7.
Gao Y, Zhou M, Metaxas DN. UTNet: a hybrid transformer architecture for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Online: Springer; 2021. p. 61–71.
Huang X, Deng Z, Li D, Yuan X. MISSFormer: an effective medical image segmentation transformer. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2109.07162.
Chandrasekaran SN, Ceulemans H, Boyd JD, Carpenter AE. Image-based profiling for drug discovery: due for a machine-learning upgrade? Nat Rev Drug Discov. 2021;20:145–59.
Cross-Zamirski JO, Williams G, Mouchet E, Schönlieb C-B, Turkki R, Wang Y. (2022) Self-supervised learning of phenotypic representations from cell images with weak labels. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2209.07819.
Wieser M, Siegismund D, Heyse S, Steigele S. Vision transformers show improved robustness in high-content image analysis. In: 2022 9th Swiss conference on Data Science (SDS). 2022. p. 72–71.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770–8.
Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Commun ACM. 2017;60:84–90.
Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol. 1962;160:106.
Li Z, Liu F, Yang W, Peng S, Zhou J. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Networks Learn Syst. 2022;33:6999–7019.
He K, Chen X, Xie S, Li Y, Dollár P, Girshick R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. p. 16000–9.
Zhou L, Liu H, Bae J, He J, Samaras D, Prasanna P. Self pre-training with masked autoencoders for medical image classification and segmentation. In: 2023 IEEE 20th international symposium on biomedical imaging (ISBI). IEEE. 2023. p. 1–6.
Varma A, Shit S, Prabhakar C, Scholz D, Li HB, Menze B, Rueckert D, Wiestler B. VariViT: A vision transformer for variable image sizes. In: Medical imaging with deep learning. Paris, France. 2024.
Li MM, Huang K, Zitnik M. Graph representation learning in biomedicine and healthcare. Nat Biomed Eng. 2022;1–17.
Dwivedi VP, Bresson X. A Generalization of Transformer Networks to Graphs. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2012.09699.
Yun S, Jeong M, Yoo S, Lee S, Yi SS, Kim R, Kang J, Kim HJ. Graph Transformer networks: learning meta-path graphs to improve GNNs. Neural Netw. 2022;153:104–19.
Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, Huang J. Self-supervised graph transformer on large-scale molecular data. Adv Neural Inf Process Syst. 2020;33:12559–71.
Ingraham J, Garg VK, Barzilay R, Jaakkola T. Generative models for graph-based protein design. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc; 2019. p 15820–31.
Yang J, Liu Z, Xiao S, Li C, Lian D, Agrawal S, Singh A, Sun G, Xie X. GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph. arXiv. 2021;2105.02605. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2105.02605.
Bi Z, Cheng S, Chen J, Liang X, Xiong F, Zhang N. Relphormer: Relational Graph Transformer for Knowledge Graph representations. Neurocomputing. 2024;566:127044.
Busbridge D, Sherburn D, Cavallo P, Hammerla NY. Relational graph attention networks. arXiv. 2019;1904:05811 [cs, stat].
Wang H, Guo F, Du M, Wang G, Cao C. A novel method for drug-target interaction prediction based on graph transformers model. BMC Bioinformatics. 2022;23:459.
Zhang P, Wei Z, Che C, Jin B. DeepMGT-DTI: Transformer network incorporating multilayer graph information for drug–target interaction prediction. Comput Biol Med. 2022;142:105214.
Chu T, Nguyen TT, Hai BD, Nguyen QH, Nguyen T. Graph transformer for drug response prediction. IEEE/ACM Trans Comput Biol Bioinform. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/TCBB.2022.3206888.
Yan C, Suo Z, Wang J, Zhang G, Luo H. DACPGTN: drug ATC code prediction method based on graph transformer network for drug discovery. Front Pharmacol. 2022;13:907676.
El-allaly E, Sarrouti M, En-Nahnahi N, Ouatik El Alaoui S. An attentive joint model with transformer-based weighted graph convolutional network for extracting adverse drug event relation. J Biomed Inform. 2022;125:103968.
Chen D, Gao K, Nguyen DD, Chen X, Jiang Y, Wei G-W, Pan F. Algebraic graph-assisted bidirectional transformers for molecular property prediction. Nat Commun. 2021;12:3521.
Fradkin P, Young A, Atanackovic L, Frey B, Lee LJ, Wang B. A graph neural network approach for molecule carcinogenicity prediction. Bioinformatics. 2022;38:i84–91.
Zhang T, Guo X, Chen H, Fan S, Li Q, Chen S, Guo X, Zheng H. (2022) TG-GNN: transformer based geometric enhancement graph neural network for molecular property prediction. https://doiorg.publicaciones.saludcastillayleon.es/10.21203/rs.3.rs-1795724/v1.
Lai P-T, Lu Z. (2021) BERT-GT: cross-sentence n-ary relation extraction with BERT and graph transformer. Bioinformatics btaa1087.
Yang T, Bai X, Cui X, Gong Y, Li L. GraformerDIR: graph convolution transformer for deformable image registration. Comput Biol Med. 2022;147:105799.
Yuan Q, Chen S, Rao J, Zheng S, Zhao H, Yang Y. AlphaFold2-aware protein–DNA binding site prediction using graph transformer. Brief Bioinform. 2022;23:bbab564.
Dong S, Wang S. Assembled graph neural network using graph transformer with edges for protein model quality assessment. J Mol Graph Model. 2022;110:108053.
Jha K, Saha S, Singh H. Prediction of protein–protein interaction using graph neural networks. Sci Rep. 2022;12:8360.
Ma A, Xin G, Ma Q. The use of single-cell multi-omics in immuno-oncology. Nat Commun. 2022;13:2728.
Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nat Med. 2022;28:1773–84.
Koorathota S, Khan Z, Lapborisuth P, Sajda P. Multimodal neurophysiological transformer for emotion recognition. In: 2022 44th annual international conference of the IEEE Engineering in Medicine & Biology Society (EMBC). 2022. p. 3563–7.
Shi Q, Fan J, Wang Z, Zhang Z. Multimodal channel-wise attention transformer inspired by multisensory integration mechanisms of the brain. Pattern Recogn. 2022;130:108837.
Monajatipoor M, Rouhsedaghat M, Li LH, Chien A, Kuo CCJ, Scalzo F, Chang KW. BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2108.04938.
Liu G, Liao Y, Wang F, Zhang B, Zhang L, Liang X, Wan X, Li S, Li Z, Zhang S. Medical-vlbert: medical visual language bert for covid–19 ct report generation with alternate learning. IEEE Trans Neural Networks Learn Syst. 2021;32:3786–97.
Zhang Y, Jiang H, Miura Y, Manning CD, Langlotz CP. Contrastive learning of medical visual representations from paired images and text. In: Proceedings of machine learning for health care 2022. 2022.
Zeng Z, Yao Y, Liu Z, Sun M. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nat Commun. 2022;13:862.
Speith T. A review of taxonomies of explainable artificial intelligence (XAI) methods. In: 2022 ACM conference on fairness, accountability, and transparency. New York, NY, USA: Association for Computing Machinery; 2022. p. 2239–50.
Vig J. BertViz: a tool for visualizing multihead self-attention in the BERT model. ICLR Workshop: Debugging Machine Learning Models. New Orleans: ICLR; 2019.
Kokalj E, Škrlj B, Lavrač N, Pollak S, Robnik-Šikonja M. BERT meets shapley: extending SHAP explanations to transformer-based classifiers. In: Proceedings of the EACL hackashop on news media content analysis and automated report generation. 2021. p. 16–21.
Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc.; 2017. p. 4768–77
Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. arXiv. 2017;1703:01365 [cs].
Saporta A, Gui X, Agrawal A et al. Benchmarking saliency methods for chest X-ray interpretation. 2022;2021.02.28.21252634.
Lim B, Arik SO, Loeff N, Pfister T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. 2020. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.1912.09363.
Chowdhery A, Narang S, Devlin J, et al. PaLM: scaling language modeling with pathways. arXiv. 2022;2204:02311 [cs].
Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer. arXiv. 2020;2004:05150.
Choromanski KM, Likhosherstov V, Dohan D, Song X, Gane A, Sarlos T, Hawkins P, Davis JQ, Mohiuddin A, Kaiser L. Rethinking attention with performers. International Conference on Learning Representations. Online: ICLR. 2021.
Kitaev N, Kaiser Ł, Levskaya A. Reformer: the efficient transformer. ArXiv. 2020;2001:04451 [cs, stat].
Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: a survey. ACM Comput Surv. 2022;55:1–109.
Zaheer M, Guruganesh G, Dubey KA, Ainslie J, Alberti C, Ontanon S, Pham P, Ravula A, Wang Q, Yang L. Big bird: transformers for longer sequences. Adv Neural Inf Process Syst. 2020;33:17283–97.
Gou J, Yu B, Maybank SJ, Tao D. Knowledge distillation: a survey. Int J Comput Vis. 2021;129:1789–819.
Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P. K-BERT: enabling language representation with knowledge graph. ArXiv. 2019;1909:07606 [cs].
Sun Y, Wang S, Li YK, Feng S, Tian H, Wu H, Wang H. ERNIE 2.0: a continual pre-training framework for language understanding. In: AAAI. 2020. p. 8968–75.
Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE: enhanced language representation with informative entities. arXiv. 2019;1905:07129.
Ahmad A, Fröhlich H. Integrating heterogeneous omics data via statistical inference and learning techniques. Genomics and computational biology. 2016. https://doiorg.publicaciones.saludcastillayleon.es/10.18547/gcb.2016.vol2.iss1.e32.
Baevski A, Hsu W-N, Xu Q, Babu A, Gu J, Auli M. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2202.03555.
Jaegle A, Borgeaud S, Alayrac J-B, et al. Perceiver IO. A aeneral architecture for structured Inputs & outputs. 2022.
Jaegle A, Gimeno F, Brock A, Vinyals O, Zisserman A, Carreira J. Perceiver: general perception with iterative attention. In: International conference on machine learning. Online: PMLR. 2021. p. 4651–64.
OpenAI. ChatGPT (Mar 14 version) Large language model. 2023. https://chat.openai.com/chat.
Anand Y, Nussbaum Z, Duderstadt B, Schmidt B, Treat A. GPT4All: an ecosystem of open-source assistants that run on local hardware. 2023.
Verkuil R, Kabeli O, Du Y, Wicky BIM, Milles LF, Dauparas J, Baker D, Ovchinnikov S, Sercu T, Rives A. Language models generalize beyond natural proteins. 2022;2022.12.21.521521.
Ali A, Schnake T, Eberle O, Montavon G, Müller K-R, Wolf L. XAI for transformers: better explanations through conservative propagation. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2202.07304.
Deb M, Deiseroth B, Weinbach S, Schramowski P, Kersting K. AtMan: understanding transformer predictions through memory efficient attention manipulation. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2301.08110.
Gavito AT, Klabjan D, Utke J. Multi-layer attention-based explainability via transformers for tabular data. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.48550/arXiv.2302.14278.
Del Ser J, Barredo-Arrieta A, Díaz-Rodríguez N, Herrera F, Saranti A, Holzinger A. On generating trustworthy counterfactual explanations. Inf Sci. 2024;655:119898.
Metsch JM, Saranti A, Angerschmid A, Pfeifer B, Klemt V, Holzinger A, Hauschild A-C. CLARUS: an interactive explainable AI platform for manual counterfactuals in graph neural networks. J Biomed Inform. 2024;150:104600.
Acknowledgements
Not applicable.
Funding
Open Access funding enabled and organized by Projekt DEAL. Research reported in this publication was supported by Integration of Heterogeneous Data and Evidence towards Regulatory and HTA Acceptance (IDERHA), an Innovative Health Initiative (IHI) Joint Undertaking (JU) under grant agreement No 101112135. The JU receives support from the European Union’s Horizon Europe research and innovation programme, and life science industries represented by COCIR, EFPIA / Vaccines Europe, EuropaBio and MedTech Europe. Views and opinions expressed in this paper are those of the author(s) only and do not necessarily reflect those of the aforementioned parties. Neither of the aforementioned parties can be held responsible for them.
Author information
Authors and Affiliations
Contributions
Sumit Madan: Conceptualization, Methodology, Investigation, Visualization, Writing - Original Draft, and Writing - Review & Editing; Manuel Lentzen: Investigation, Writing - Original Draft, and Writing - Review & Editing; Johannes Brandt: Writing - Review & Editing; Daniel Rueckert: Writing - Review & Editing; Martin Hofmann-Apitius: Conceptualization, Supervision, and Writing - Review & Editing; Holger Fröhlich: Conceptualization, Methodology, Supervision, Writing - Original Draft, and Writing - Review & Editing.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Madan, S., Lentzen, M., Brandt, J. et al. Transformer models in biomedicine. BMC Med Inform Decis Mak 24, 214 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02600-5
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12911-024-02600-5