Uncertainty-aware automatic TNM staging classification for [18F] Fluorodeoxyglucose PET-CT reports for lung cancer utilising transformer-based language models and multi-task learning

Table 4 A comparison of machine learning pipelines including two multi-task approaches using a shared GatorTron PLM encoder (one including and one excluding uncertainty labels in training), an ensemble of finetuned binary classifiers using GatorTron, and a traditional machine learning model using TF-IDF encodings and individual logistic regression classifiers for each binary task. Each approach was trained three times with different random seeds with the mean result and standard deviation reported. For the single task ensembles we calculate the ‘TNMu’ and ‘TNM’ metrics using the models trained from that random seed. Bold values represent the best performing pipeline for that metric on each test dataset. All F1 scores are macro averaged

Dataset	Pipeline	ACC_TNMu ↑	ACC_TNM ↑	HL_TNMu ↓	F1_TNMu ↑	F1_T ↑	F1_N ↑	F1_M ↑	F1_u ↑
*Internal Test*	Multi-task (TNMu)	0.84 ± 0.01	0.86 ± 0.00	0.05 ± 0.00	0.92 ± 0.00	0.93 ± 0.00	0.94 ± 0.00	0.92 ± 0.01	0.87 ± 0.00
	Multi-task (TNM only)	N/a	0.85 ± 0.01	N/a	N/a	0.94 ± 0.01	0.95 ± 0.01	0.89 ± 0.01	N/a
	Single task	0.80 ± 0.02	0.86 ± 0.00	0.06 ± 0.00	0.91 ± 0.01	0.95 ± 0.00	0.96 ± 0.00	0.89 ± 0.02	0.85 ± 0.02
	TF-IDF + Logistic Regression	0.50 ± 0.00	0.60 ± 0.00	0.16 ± 0.00	0.66 ± 0.00	0.69 ± 0.00	0.81 ± 0.00	0.69 ± 0.00	0.45 ± 0.00
*External Test*	Multi-task (TNMu)	0.78 ± 0.01	0.83 ± 0.01	0.07 ± 0.00	0.88 ± 0.01	0.89 ± 0.02	0.95 ± 0.01	0.89 ± 0.01	0.77 ± 0.00
	Multi-task (TNM only)	N/a	0.83 ± 0.02	N/a	N/a	0.88 ± 0.01	0.95 ± 0.00	0.91 ± 0.02	N/a
	Single task	0.73 ± 0.00	0.82 ± 0.00	0.08 ± 0.00	0.85 ± 0.01	0.88 ± 0.00	0.95 ± 0.00	0.90 ± 0.01	0.68 ± 0.02
	TF-IDF + Logistic Regression	0.52 ± 0.00	0.61 ± 0.00	0.16 ± 0.00	0.64 ± 0.00	0.49 ± 0.00	0.85 ± 0.00	0.76 ± 0.00	0.46 ± 0.00

ISSN: 1472-6947