Skip to main content

Table 6 The features of the de-identified datasets

From: Exploring the tradeoff between data privacy and utility with a clinical data analysis use case

Dataset numbers

Re-identifi

cation risk

-before

Re-identi

fication risk -after

Re-identi

fication risk reduction rate

ARX utility score

EMD

# of records retained for logistic regression

# of predictors retained for logistic regression

Dataset retention ratio

1

0.993

0.064

0.936

0.722

62.346

547

11

0.401

2

0.993

0.076

0.924

0.807

62.559

396

11

0.290

3

0.993

0.064

0.936

0.722

62.346

547

11

0.401

4

0.993

0.076

0.924

0.807

62.559

396

11

0.290

5

0.908

0.044

0.952

0.485

61.746

954

12

0.762

6

0.908

0.059

0.935

0.599

62.017

765

12

0.611

7

0.908

0.000

1.000

1.000

61.118

1119

7

0.522

8

0.908

0.000

1.000

1.000

61.118

1119

7

0.522

9

0.963

0.059

0.939

0.500

61.623

910

12

0.727

10

0.963

0.085

0.911

0.600

61.945

756

12

0.604

11

0.963

0.002

0.998

0.890

62.542

1155

9

0.692

12

0.963

0.002

0.998

0.846

62.737

1155

9

0.692

13

0.135

0.014

0.897

0.449

61.414

1113

13

0.964

14

0.135

0.002

0.986

0.654

61.521

1052

12

0.841

15

0.135

0.014

0.897

0.449

61.414

1113

13

0.964

16

0.135

0.014

0.897

0.449

61.414

1113

13

0.964

17

0.965

0.064

0.934

0.749

63.512

547

11

0.401

18

0.991

0.076

0.924

0.749

62.558

396

11

0.290

19

0.943

0.064

0.932

0.639

63.498

547

11

0.401

  1. Note. The number of records in the original dataset: 1155, the number of predictors for logistic regression: 13