Validation of large language models for detecting pathologic complete response in breast cancer using population-based pathology reports

Cheligeer, Ken; Wu, Guosong; Laws, Alison; Quan, May Lynn; Li, Andrea; Brisson, Anne-Marie; Xie, Jason; Xu, Yuan

doi:10.1186/s12911-024-02677-y

BMC Medical Informatics and Decision Making

Table 1 LLMs performance statistics with 95% confidence interval

From: Validation of large language models for detecting pathologic complete response in breast cancer using population-based pathology reports

LLMs	Sensitivity	PPV	Specificity	NPV	Accuracy	AUC ROC	F1 score
Encoder-only models
BERT base model (uncased) [16]	100.0 (100.0—100.0)	75.0 (56.0—90.5)	86.0 (75.5—94.7)	100.0 (100.0—100.0)	90.1 (81.7—95.8)	93.0 (87.8—97.1)	85.7 (73.2—95.0)
BERT base model (cased) [16]	95.2 (84.2—100.0)	76.9 (59.1—92.0)	88.0 (78.9—96.2)	97.8 (92.5—100.0)	90.1 (83.1—95.8)	91.6 (84.1—97.2)	85.1 (70.6—94.7)
DistilBERT base model (uncased) [32]	95.2 (83.3—100.0)	74.1 (57.1—90.6)	86.0 (75.5—95.7)	97.7 (92.3—100.0)	88.7 (81.7—95.8)	90.6 (83.3—96.2)	83.3 (69.8—93.3)
BioClinicalBERT [33]	90.5 (76.5—100.0)	79.2 (60.9—95.2)	90.0 (80.0—98.0)	95.7 (89.3—100.0)	90.1 (83.1—97.2)	90.2 (81.3—97.1)	84.4 (71.1—94.5)
Tiny BERT [34]	95.2 (84.6—100.0)	76.9 (58.3—92.0)	88.0 (77.5—96.2)	97.8 (92.7—100.0)	90.1 (83.1—97.2)	91.6 (85.1—97.2)	85.1 (72.4—94.7)
BERT multilingual base model (cased) [16]	95.2 (84.2—100.0)	87.0 (70.6—100.0)	94.0 (86.0—100.0)	97.9 (93.0—100.0)	94.4 (88.7—98.6)	94.6 (87.9—99.1)	90.9 (81.1—98.0)
GatorTronS [35]	100.0 (100.0—100.0)	75.0 (56.5—90.9)	86.0 (75.0—94.4)	100.0 (100.0—100.0)	90.1 (83.1—95.8)	93.0 (88.0—97.1)	85.7 (73.2—94.7)
Encoder- decoder models
BART (base-sized model) [19]	100.0 (100.0—100.0)	84.0 (66.7—96.2)	92.0 (83.6—98.1)	100.0 (100.0—100.0)	94.4 (88.7—98.6)	96.0 (91.5—99.1)	91.3 (81.1—98.0)
BART (large-sized model) [19]	95.2 (84.6—100.0)	80.0 (62.5—95.5)	90.0 (81.8—98.0)	97.8 (92.6—100.0)	91.5 (84.5—97.2)	92.6 (85.7—98.0)	87.0 (74.3—96.3)
BART-large-mnli [19]	90.5 (76.2—100.0)	76.0 (57.1—91.3)	88.0 (77.5—96.2)	95.7 (88.9—100.0)	88.7 (81.7—95.8)	89.2 (80.9—96.3)	82.6 (69.2—92.7)
FLAN-T5 small [36]	76.2 (56.2—94.4)	64.0 (44.4—82.6)	82.0 (70.5—92.0)	89.1 (79.1—97.6)	80.3 (70.4—88.7)	79.1 (68.0—89.2)	69.6 (52.6—82.6)
T5-Large [20]	90.5 (76.2—100.0)	70.4 (51.4—86.2)	84.0 (72.9—93.8)	95.5 (88.1—100.0)	85.9 (77.5—93.0)	87.2 (77.5—94.7)	79.2 (64.9—90.6)
T5-Small [20]	90.5 (75.0—100.0)	70.4 (50.0—86.4)	84.0 (73.3—93.8)	95.5 (88.1—100.0)	85.9 (77.5—93.0)	87.2 (78.4—95.0)	79.2 (64.0—90.9)
Decoder-only models
GPT-2 Large [21]	100.0 (100.0—100.0)	84.0 (66.7—96.2)	92.0 (83.6—98.1)	100.0 (100.0—100.0)	94.4 (88.7—98.6)	96.0 (91.5—99.1)	91.3 (81.1—98.0)
GPT-2 [21]	85.7 (68.8—100.0)	78.3 (58.3—94.7)	90.0 (80.8—97.9)	93.8 (86.0—100.0)	88.7 (81.7—95.8)	87.9 (79.4—96.0)	81.8 (68.4—92.7)
Baseline models
Decision Tree based method [6]	90.5 (69.6–98.9)	76.0 (59.6–87.2)	87.8 (75.2–95.4)	93.8 (86.0—100.0)	88.6 (78.7–94.9)	87.9 (79.4—96.0)	81.8 (68.4—92.7)
Fine-tuned models (Pipeline B)
GPT-2 fine-tuned	95.3 (84.0—100.0)	90.9 (76.5–100.0)	96.0(90.0–100.0)	98.0 (93.3- 100.0)	*95.8 (90.1–100.0)*	95.6(89.4–100.0)	*93.0 (83.7–100.0)*

LLMs Large Language Models, NPV Negative Predictive Value, PPV Positive Predictive Value

Back to article page

ISSN: 1472-6947

Contact us

General enquiries: journalsubmissions@springernature.com