Skip to main content

Table 1 LLMs performance statistics with 95% confidence interval

From: Validation of large language models for detecting pathologic complete response in breast cancer using population-based pathology reports

LLMs

Sensitivity

PPV

Specificity

NPV

Accuracy

AUC ROC

F1 score

Encoder-only models

 BERT base model (uncased) [16]

100.0 (100.0—100.0)

75.0 (56.0—90.5)

86.0 (75.5—94.7)

100.0 (100.0—100.0)

90.1 (81.7—95.8)

93.0 (87.8—97.1)

85.7 (73.2—95.0)

 BERT base model (cased) [16]

95.2 (84.2—100.0)

76.9 (59.1—92.0)

88.0 (78.9—96.2)

97.8 (92.5—100.0)

90.1 (83.1—95.8)

91.6 (84.1—97.2)

85.1 (70.6—94.7)

 DistilBERT base model (uncased) [32]

95.2 (83.3—100.0)

74.1 (57.1—90.6)

86.0 (75.5—95.7)

97.7 (92.3—100.0)

88.7 (81.7—95.8)

90.6 (83.3—96.2)

83.3 (69.8—93.3)

 BioClinicalBERT [33]

90.5 (76.5—100.0)

79.2 (60.9—95.2)

90.0 (80.0—98.0)

95.7 (89.3—100.0)

90.1 (83.1—97.2)

90.2 (81.3—97.1)

84.4 (71.1—94.5)

 Tiny BERT [34]

95.2 (84.6—100.0)

76.9 (58.3—92.0)

88.0 (77.5—96.2)

97.8 (92.7—100.0)

90.1 (83.1—97.2)

91.6 (85.1—97.2)

85.1 (72.4—94.7)

 BERT multilingual base model (cased) [16]

95.2 (84.2—100.0)

87.0 (70.6—100.0)

94.0 (86.0—100.0)

97.9 (93.0—100.0)

94.4 (88.7—98.6)

94.6 (87.9—99.1)

90.9 (81.1—98.0)

 GatorTronS [35]

100.0 (100.0—100.0)

75.0 (56.5—90.9)

86.0 (75.0—94.4)

100.0 (100.0—100.0)

90.1 (83.1—95.8)

93.0 (88.0—97.1)

85.7 (73.2—94.7)

Encoder- decoder models

 BART (base-sized model) [19]

100.0 (100.0—100.0)

84.0 (66.7—96.2)

92.0 (83.6—98.1)

100.0 (100.0—100.0)

94.4 (88.7—98.6)

96.0 (91.5—99.1)

91.3 (81.1—98.0)

 BART (large-sized model) [19]

95.2 (84.6—100.0)

80.0 (62.5—95.5)

90.0 (81.8—98.0)

97.8 (92.6—100.0)

91.5 (84.5—97.2)

92.6 (85.7—98.0)

87.0 (74.3—96.3)

 BART-large-mnli [19]

90.5 (76.2—100.0)

76.0 (57.1—91.3)

88.0 (77.5—96.2)

95.7 (88.9—100.0)

88.7 (81.7—95.8)

89.2 (80.9—96.3)

82.6 (69.2—92.7)

 FLAN-T5 small [36]

76.2 (56.2—94.4)

64.0 (44.4—82.6)

82.0 (70.5—92.0)

89.1 (79.1—97.6)

80.3 (70.4—88.7)

79.1 (68.0—89.2)

69.6 (52.6—82.6)

 T5-Large [20]

90.5 (76.2—100.0)

70.4 (51.4—86.2)

84.0 (72.9—93.8)

95.5 (88.1—100.0)

85.9 (77.5—93.0)

87.2 (77.5—94.7)

79.2 (64.9—90.6)

 T5-Small [20]

90.5 (75.0—100.0)

70.4 (50.0—86.4)

84.0 (73.3—93.8)

95.5 (88.1—100.0)

85.9 (77.5—93.0)

87.2 (78.4—95.0)

79.2 (64.0—90.9)

Decoder-only models

 GPT-2 Large [21]

100.0 (100.0—100.0)

84.0 (66.7—96.2)

92.0 (83.6—98.1)

100.0 (100.0—100.0)

94.4 (88.7—98.6)

96.0 (91.5—99.1)

91.3 (81.1—98.0)

 GPT-2 [21]

85.7 (68.8—100.0)

78.3 (58.3—94.7)

90.0 (80.8—97.9)

93.8 (86.0—100.0)

88.7 (81.7—95.8)

87.9 (79.4—96.0)

81.8 (68.4—92.7)

Baseline models

 Decision Tree based method [6]

90.5 (69.6–98.9)

76.0 (59.6–87.2)

87.8 (75.2–95.4)

93.8 (86.0—100.0)

88.6 (78.7–94.9)

87.9 (79.4—96.0)

81.8 (68.4—92.7)

 Fine-tuned models (Pipeline B)

       

 GPT-2 fine-tuned

95.3 (84.0—100.0)

90.9 (76.5–100.0)

96.0(90.0–100.0)

98.0 (93.3- 100.0)

95.8 (90.1–100.0)

95.6(89.4–100.0)

93.0 (83.7–100.0)

  1. LLMs Large Language Models, NPV Negative Predictive Value, PPV Positive Predictive Value