Skip to main content

Table 3 Comparison of PLMs on the internal validation set. Each model was fine-tuned three times with different random seeds for 5 epochs. The results show the mean and standard deviation for each metric of those training runs. The training corpora focus gives an idea of the corpus the PLM was pre-trained on. Bold text indicates the best result for that metric. All F1 scores are macro averaged

From: Uncertainty-aware automatic TNM staging classification for [18F] Fluorodeoxyglucose PET-CT reports for lung cancer utilising transformer-based language models and multi-task learning

PLM

Training Corpora Focus

Parameters

ACCTNMu ↑

HLTNMu ↓

F1T ↑

F1N ↑

F1M ↑

F1u ↑

BERT (Base) [22]

General

110 m

0.63 ± 0.03

0.13 ± 0.01

0.78 ± 0.04

0.91 ± 0.02

0.80 ± 0.02

0.46 ± 0.00

BERT (Large) [22]

General

340 m

0.67 ± 0.02

0.12 ± 0.01

0.83 ± 0.03

0.91 ± 0.01

0.81 ± 0.01

0.53 ± 0.08

RoBERTa (Base) [59]

General

125 m

0.69 ± 0.03

0.10 ± 0.01

0.83 ± 0.02

0.93 ± 0.01

0.84 ± 0.01

0.56 ± 0.08

RoBERTa (Large) [59]

General

355 m

0.77 ± 0.01

0.08 ± 0.01

0.91 ± 0.01

0.93 ± 0.01

0.87 ± 0.02

0.74 ± 0.01

BioBERT (Base) [60]

Biomedical

110 m

0.70 ± 0.01

0.10 ± 0.00

0.85 ± 0.01

0.92 ± 0.01

0.85 ± 0.01

0.50 ± 0.02

BioBERT (Large) [60]

Biomedical

340 m

0.75 ± 0.02

0.09 ± 0.01

0.91 ± 0.02

0.93 ± 0.01

0.86 ± 0.02

0.70 ± 0.02

BioClinicalBERT [63]

Clinical

110 m

0.66 ± 0.01

0.12 ± 0.00

0.81 ± 0.03

0.90 ± 0.01

0.78 ± 0.03

0.46 ± 0.00

BioMegatron [61]

Biomedical

345 m

0.76 ± 0.02

0.08 ± 0.01

0.90 ± 0.01

0.95 ± 0.01

0.85 ± 0.01

0.73 ± 0.00

RadBERT [64]

Clinical

110 m

0.62 ± 0.01

0.13 ± 0.01

0.78 ± 0.02

0.88 ± 0.03

0.79 ± 0.03

0.46 ± 0.00

RadBERT-RoBERTa-4 m

Clinical

125 m

0.71 ± 0.01

0.10 ± 0.01

0.88 ± 0.02

0.93 ± 0.01

0.84 ± 0.01

0.59 ± 0.02

GatorTron (Base) [62]

Clinical

345 m

0.84 ± 0.01

0.06 ± 0.01

0.96 ± 0.01

0.96 ± 0.00

0.90 ± 0.01

0.81 ± 0.01