Skip to main content

Table 1 Table of standard performance metrics. This list included in [11] describes performance metrics typically used for ML-based classification tasks. Only those metrics are included, which contain no risk-based considerations according to the specification in our paper. It is assumed that the of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) are given. See [11] for more details about the definition and utilization of these metrics

From: Risk-based evaluation of machine learning-based classification methods used for medical devices

General / overarching definitions

Number of actual positive cases:

\(\:{P}={T}{P}+{F}{N}\)

Number of actual negative cases:

\(\:N=TN+FP\)

Number of predicted positive cases:

\(\:{P}{P}={T}{P}+{F}{P}\)

Number of predicted negative cases:

\(\:PN=TN+FN\)

Total Population:

\(\:{P}{o}{p}={P}+{N}\)

Prevalence:

\(\:Prev=\frac{P}{P+N}=\frac{P}{Pop}\)

Metrics documented in the literature research within this study

Sensitivity / Recall / True Positive Rate:

\(\:{T}{P}{R}=\frac{{T}{P}}{{P}}\)

Specificity / True Negative Rate:

\(\:TPN=\frac{TN}{N}\)

Accuracy:

\(\:{A}{c}{c}=\frac{{T}{P}+{T}{N}}{{T}{P}+{F}{P}+{T}{N}+{F}{N}}\)

or equivalently Error rate:

\(\:{E}{r}{r}=1-{A}{c}{c}\)

Balanced Accuracy,

i.e. accuracy after balancing of positive /

negative test samples / class members:

\(\:BA=\:\frac{TPR+TNR}{2}\)

Precision / Positive Predicted Value:

\(\:{P}{P}{V}=\frac{{T}{P}}{{P}{P}}\)

Negative Predictive Value:

\(\:NPV=\frac{TN}{PN}\)

\(\varvec{F_1}\)-Score:

\(\:{F}1=2\cdot\:\frac{{P}{P}{V}\cdot\:{T}{P}{R}}{{P}{P}{V}+\:{T}{P}{R}}\)

other \(\varvec{F_{\beta\:}}\)-Scores:

\(\:F\beta\:=\left(1+{\beta\:}^{2}\right)\cdot\:\frac{PPV\cdot\:TPR}{{\beta\:}^{2}\cdot\:PPV+\:TPR}\)

Matthews Correlation Coefficient:

\(\quad\quad MCC=\sqrt{TPR\cdot\:TNR\cdot\:PPV\cdot\:NPV}-\sqrt{\left(1-TPR\right)\cdot\:\left(1-TNR\right)\cdot\:\left(1-PPV\right)\cdot\:\left(1-NPV\right)}\)

Geometric Mean:

\(\:MCC=\:\sqrt{TPR\cdot\:TNR}\)

Measures which include not single models (fixed threshold) but multiple variations of thresholds

Receiver Operating Characteristics (ROC) Curve,

i.e. plot of \(\:{F}{P}{R}\) (on \(\:{x}\) axis)

vs. \(\:{T}{P}{R}\) (on y axis).

Precision-Recall Curve (PRC),

i.e. plot of recall / \(\:TPR\) (on \(\:x\) axis)

vs. precision / \(\:PPV\) (on \(\:y\) axis).

Area under the ROC Curve:

\(\:AUROC=\int_0^1{ROC\left(x\right)\:dx\:}\)

as the integral over the function \(\:{R}{O}{C}\left({x}\right)\)

described by the \(\:{R}{O}{C}\) Curve

Area under the PRC Curve:

\(\:AUPRC=\int_0^1PRC\left(x\right)\:dx\)

as the integral over the function \(\:PRC\left(x\right)\)

described by the \(\:PRC\) Curve

Measures for comparison of two predictions

(Cohen’s) Kappa:

\(\:{\kappa\:}=\frac{{{p}}_{0}-{{p}}_{{c}}}{1-{{p}}_{{c}}}\)

where \(\:{{p}}_{0}\) is the agreement between the predictions

and \(\:{{p}}_{{c}}\) is the agreement with respect to a random prediction

(Cohen’s) Weighted Kappa:

(Cohens’s) Kappa \(\:\kappa\:\)

with additional weights included,

e.g. according to risks or costs