Fig. 4

Zip's eponymous law [76]. Illustrated by a selection of word counts based on the analysis of the 863 937 clinical notes included in the dataset we use to conduct experiments, which yielded 863 937 unique unigram tokens and 1 803 428 common phrases in the knowledge base. Frequent words account for a large percentage of the text, but a large portion of words appear at a low frequency