Skip to main content

Table 1 Results (macro \(F_1\) scorea ) for each clinical NLP task and each modeling paradigm

From: NLP modeling recommendations for restricted data availability in clinical settings

Model & paradigm

Prioritization

Specialty

NER

xlm-roberta

   

    Fine-tune & predict

88.85 %

51.71 %

11.09 %

    Cont. pre-train., fine-tune & pred.

89.03 % (+0.18)

52.36 % (+0.65)

13.85 % (+2.76)

roberta-bne

   

    Fine-tune & predict

88.58 %

52.50 %

22.59 %

    Cont. pre-train., fine-tune & pred.

88.80 % (+0.22)

51.65 % (−0.85)

23.29 % (+0.70)

roberta-biomedical-clinical

   

    Fine-tune & predict

88.80 %

53.79 %

34.46 %

    Cont. pre-train., fine-tune & pred.

88.85 % (+0.05)

53.85 % (+0.06)

37.25 % (+2.79)

Llama 2

   

    Prompt & predict (Zero-shot)

6.49 %

31.41 %

5.31 %

    Prompt & predict (Few-shot)

56.70 % (+50.21)

31.91 % (+0.50)

15.44 % (+10.13)

Llama 3

   

    Prompt & predict (Zero-shot)

36.87 %

38.49 %

17.59 %

    Prompt & predict (Few-shot)

47.64 % (+10.77)

48.50 % (+10.01)

23.14 % (+5.55)

  1. aMacro \(F_1\) score is the unweighted average of the \(F_1\) scores calculated for each class, treating all classes equally regardless of their frequency