Exploring the tradeoff between data privacy and utility with a clinical data analysis use case

Table 6 The features of the de-identified datasets

Dataset numbers	Re-identifi cation risk -before	Re-identi fication risk -after	Re-identi fication risk reduction rate	ARX utility score	EMD	# of records retained for logistic regression	# of predictors retained for logistic regression	Dataset retention ratio
1	0.993	0.064	0.936	0.722	62.346	547	11	0.401
2	0.993	0.076	0.924	0.807	62.559	396	11	0.290
3	0.993	0.064	0.936	0.722	62.346	547	11	0.401
4	0.993	0.076	0.924	0.807	62.559	396	11	0.290
5	0.908	0.044	0.952	0.485	61.746	954	12	0.762
6	0.908	0.059	0.935	0.599	62.017	765	12	0.611
7	0.908	0.000	1.000	1.000	61.118	1119	7	0.522
8	0.908	0.000	1.000	1.000	61.118	1119	7	0.522
9	0.963	0.059	0.939	0.500	61.623	910	12	0.727
10	0.963	0.085	0.911	0.600	61.945	756	12	0.604
11	0.963	0.002	0.998	0.890	62.542	1155	9	0.692
12	0.963	0.002	0.998	0.846	62.737	1155	9	0.692
13	0.135	0.014	0.897	0.449	61.414	1113	13	0.964
14	0.135	0.002	0.986	0.654	61.521	1052	12	0.841
15	0.135	0.014	0.897	0.449	61.414	1113	13	0.964
16	0.135	0.014	0.897	0.449	61.414	1113	13	0.964
17	0.965	0.064	0.934	0.749	63.512	547	11	0.401
18	0.991	0.076	0.924	0.749	62.558	396	11	0.290
19	0.943	0.064	0.932	0.639	63.498	547	11	0.401

Note. The number of records in the original dataset: 1155, the number of predictors for logistic regression: 13

ISSN: 1472-6947