LLMs | Sensitivity | PPV | Specificity | NPV | Accuracy | AUC ROC | F1 score |
---|---|---|---|---|---|---|---|
Encoder-only models | |||||||
 BERT base model (uncased) [16] | 100.0 (100.0—100.0) | 75.0 (56.0—90.5) | 86.0 (75.5—94.7) | 100.0 (100.0—100.0) | 90.1 (81.7—95.8) | 93.0 (87.8—97.1) | 85.7 (73.2—95.0) |
 BERT base model (cased) [16] | 95.2 (84.2—100.0) | 76.9 (59.1—92.0) | 88.0 (78.9—96.2) | 97.8 (92.5—100.0) | 90.1 (83.1—95.8) | 91.6 (84.1—97.2) | 85.1 (70.6—94.7) |
 DistilBERT base model (uncased) [32] | 95.2 (83.3—100.0) | 74.1 (57.1—90.6) | 86.0 (75.5—95.7) | 97.7 (92.3—100.0) | 88.7 (81.7—95.8) | 90.6 (83.3—96.2) | 83.3 (69.8—93.3) |
 BioClinicalBERT [33] | 90.5 (76.5—100.0) | 79.2 (60.9—95.2) | 90.0 (80.0—98.0) | 95.7 (89.3—100.0) | 90.1 (83.1—97.2) | 90.2 (81.3—97.1) | 84.4 (71.1—94.5) |
 Tiny BERT [34] | 95.2 (84.6—100.0) | 76.9 (58.3—92.0) | 88.0 (77.5—96.2) | 97.8 (92.7—100.0) | 90.1 (83.1—97.2) | 91.6 (85.1—97.2) | 85.1 (72.4—94.7) |
 BERT multilingual base model (cased) [16] | 95.2 (84.2—100.0) | 87.0 (70.6—100.0) | 94.0 (86.0—100.0) | 97.9 (93.0—100.0) | 94.4 (88.7—98.6) | 94.6 (87.9—99.1) | 90.9 (81.1—98.0) |
 GatorTronS [35] | 100.0 (100.0—100.0) | 75.0 (56.5—90.9) | 86.0 (75.0—94.4) | 100.0 (100.0—100.0) | 90.1 (83.1—95.8) | 93.0 (88.0—97.1) | 85.7 (73.2—94.7) |
Encoder- decoder models | |||||||
 BART (base-sized model) [19] | 100.0 (100.0—100.0) | 84.0 (66.7—96.2) | 92.0 (83.6—98.1) | 100.0 (100.0—100.0) | 94.4 (88.7—98.6) | 96.0 (91.5—99.1) | 91.3 (81.1—98.0) |
 BART (large-sized model) [19] | 95.2 (84.6—100.0) | 80.0 (62.5—95.5) | 90.0 (81.8—98.0) | 97.8 (92.6—100.0) | 91.5 (84.5—97.2) | 92.6 (85.7—98.0) | 87.0 (74.3—96.3) |
 BART-large-mnli [19] | 90.5 (76.2—100.0) | 76.0 (57.1—91.3) | 88.0 (77.5—96.2) | 95.7 (88.9—100.0) | 88.7 (81.7—95.8) | 89.2 (80.9—96.3) | 82.6 (69.2—92.7) |
 FLAN-T5 small [36] | 76.2 (56.2—94.4) | 64.0 (44.4—82.6) | 82.0 (70.5—92.0) | 89.1 (79.1—97.6) | 80.3 (70.4—88.7) | 79.1 (68.0—89.2) | 69.6 (52.6—82.6) |
 T5-Large [20] | 90.5 (76.2—100.0) | 70.4 (51.4—86.2) | 84.0 (72.9—93.8) | 95.5 (88.1—100.0) | 85.9 (77.5—93.0) | 87.2 (77.5—94.7) | 79.2 (64.9—90.6) |
 T5-Small [20] | 90.5 (75.0—100.0) | 70.4 (50.0—86.4) | 84.0 (73.3—93.8) | 95.5 (88.1—100.0) | 85.9 (77.5—93.0) | 87.2 (78.4—95.0) | 79.2 (64.0—90.9) |
Decoder-only models | |||||||
 GPT-2 Large [21] | 100.0 (100.0—100.0) | 84.0 (66.7—96.2) | 92.0 (83.6—98.1) | 100.0 (100.0—100.0) | 94.4 (88.7—98.6) | 96.0 (91.5—99.1) | 91.3 (81.1—98.0) |
 GPT-2 [21] | 85.7 (68.8—100.0) | 78.3 (58.3—94.7) | 90.0 (80.8—97.9) | 93.8 (86.0—100.0) | 88.7 (81.7—95.8) | 87.9 (79.4—96.0) | 81.8 (68.4—92.7) |
Baseline models | |||||||
 Decision Tree based method [6] | 90.5 (69.6–98.9) | 76.0 (59.6–87.2) | 87.8 (75.2–95.4) | 93.8 (86.0—100.0) | 88.6 (78.7–94.9) | 87.9 (79.4—96.0) | 81.8 (68.4—92.7) |
 Fine-tuned models (Pipeline B) |  |  |  |  |  |  |  |
 GPT-2 fine-tuned | 95.3 (84.0—100.0) | 90.9 (76.5–100.0) | 96.0(90.0–100.0) | 98.0 (93.3- 100.0) | 95.8 (90.1–100.0) | 95.6(89.4–100.0) | 93.0 (83.7–100.0) |