Skip to main content

Table 2 Corpus statistics

From: A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish

Metric

Total

Documents

1,787

Tokens

221,854

Vocabulary

20,779

Lexical diversity

9.4\(\%\)

Tok. per doc.

124± 93

Ent. per doc.

8.6 ± 5.7

Annotated tokens

27,036

PII Entities

5,460

Medical Entities

10,019