Fig. 1
From: Collaborative learning from distributed data with differentially private synthetic data

Left: Schematic overview of our setup. Multiple parties create synthetic data replicas of their local data under privacy guarantees and make them publicly available. Any single party can then use the published synthetic data when performing a data analysis task (case A, blue) to improve results over only using their local data (case B, orange). The original data never crosses the (orange) privacy barriers. Right: Predictive log-likelihoods of the learned model (blue) are significantly improved over using only locally available data (orange) for most parties (centers). Uncertainty is also reduced. The dashed black line shows the log-likelihood for an impractical ideal setting where the analysis could be performed over the combined data of all parties. Log-likelihood is evaluated on a held-out test set of the whole population and normalised by dividing with the size of the test set. The box plots show the distributions of log-likelihood for parameters sampled from the distributions implied by maximum-likelihood solution and errors obtained from the analysis task and over 10 repeats of the experiment. Boxes extend from \(25\%\) to \(75\%\) quantiles of the obtained log-likelihood samples, with the median marked in the box. Whiskers extend to the furthest sample point within 1.5 inter-quartile range. Higher mean log-likelihoods of combined over local only are statistically highly significant (\(p < 0.001, n_{\text {local only}}= {1\,000}, n_{\text {combined}} = {100\,000}\)) for all centers except Nottingham, Croydon and Leeds. Local data log-likelihood of outlier center Barts is cut off for improved readability (median: \(-3.65\)). The full figure can be found in Fig. S3