Skip to main content

Table 1 Distribution of presenting complaints by data clusters

From: Leveraging large language models to mimic domain expert labeling in unstructured text-based electronic healthcare records in non-english languages

Presenting Complaints

Structured text data

Unstructured text data (“Others ()”)

Typographical errors

Total RTI Patient

98,904

3828

909

Fever

76,408

1131

246

Cough

53,866

890

143

Fatigue

19,926

33

19

Sore throat

10,897

302

47

Ear pain

9568

162

21

Respiratory Distress

4630

31

16

Non-cardiac chest pain

4077

78

8

URTI

3513

70

-

Crackles

718

249

2

Wheezing

246

5

34

COVID

9

289

423

Other RTI complaintsa

80

1449

117

Total words in the textb

717,153

64,529

12,117

Total categorized labelc

183,938

4689

1076

  1. This table summarizes the distribution of presenting complaints from patients admitted to the PED. The complaints are categorized into structured text data, unstructured text data (“Others ()”), and typographical errors. The data includes RTI-related complaints like fever, cough, and sore throat, as well as non-respiratory issues such as ear pain and non-cardiac chest pain. The table also presents the total word count and categorized labels extracted through standard filtering methods. The structured text data contains the highest number of RTI complaints, while the unstructured category reflects poorly labeled cases, many of which were identified as RTIs after further analysis. a: Other RTI labels in English are: Flu, Cold, Nasal congestion, Wheezing, Rhonchi, Asthma, Croup, Bronchiolitis, Pneumonia, Febrile convulsion, Lymphadenitis, Tonsillitis, Influenza, Laryngitis, Sputum. b: Total words in the text: The total counts of words within the text data, segmented by data clusters. c: Total categorized label: The number of categorical variables that can be extracted from the content of text data through standard filtering methods