Skip to main content

Table 5 The external test set is used to compare the best performing multi-task model, an ensemble of the four best performing single task classifiers (all determined by average performance across all metrics on both internal and external test datasets), and the two expert annotators. Bold values represent which AI model pipeline performed best. All F1 scores are macro averaged

From: Uncertainty-aware automatic TNM staging classification for [18F] Fluorodeoxyglucose PET-CT reports for lung cancer utilising transformer-based language models and multi-task learning

 

ACCTNMu ↑

ACCTNM ↑

HLTNMu ↓

F1TNMu ↑

F1T ↑

F1N ↑

F1M ↑

F1u ↑

Multi-task

0.79

0.84

0.07

0.89

0.91

0.95

0.90

0.78

Single task

0.74

0.84

0.08

0.87

0.89

0.95

0.92

0.70

Annotator 1

0.90

0.93

0.04

0.94

0.95

0.99

0.96

0.84

Annotator 2

0.89

0.93

0.04

0.93

0.94

0.99

0.95

0.83