A systematic review of large language model (LLM) evaluations in clinical medicine

Table 1 Key questions for data extraction

Q1	Based on the article provided, which medical field does this article pertain to?
Q2	Is the language of the article a non-English language? (yes = 1, No = 0)
Q3	Is an LLM or GPT mentioned in the article used for educational purposes in medical/clinical field? (yes = 1, No = 0)
Q4	Is an LLM or GPT mentioned in the article used for examination and evaluating purposes in medical/clinical field? (yes = 1, No = 0)
Q5	Is the evaluation of the LLM or GPT conducted by humans or compared with humans? (yes = 1, No = 0)
Q6	What is the name of the LLM(s) or GPT(s) version evaluated in the article?
Q7	What is the targeted group of interest for the LLM or GPT mentioned in the article (e.g., doctors, nurses, students, patients)?
Q8	How are the responses of the LLM evaluated?
Q9	What is the gold standard against which the LLM’s responses are compared?
Q10	What tools, scales, or set of questions are used in the evaluation, and how many questions are there?
Q11	What parameters are assessed to measure the LLM’s responses?

ISSN: 1472-6947