So, about this "truly anonymous" synthetic data… 😬 I've been telling anyone who'd listen that ad hoc approaches to anonymization should be assumed broken until proven otherwise, that empirical privacy metrics are unreliable risk indicators, and that the synthetic data generation space is full of vendors making outlandish claims that aren't backed by anything solid 🤔 This new paper by Georgi Ganev and Emiliano De Cristofaro shows that if anything, the situation is even worse than I thought. They develop an attack against synthetic data that successfully reconstructs a large portion of outliers data points — exactly what synthetic data generation is supposed to prevent. In an ironic twist, the empirical metrics themselves are a key part of the vulnerability: knowing that a given synthetic dataset passes an empirical "privacy test" gives some indirect information about the original dataset, which the attacker can then exploit to break the privacy of real data points 🫢 I recommend anyone interested in synthetic data generation to take a look at the paper — even if you skip the technical details, don't miss the excellent explanations of the fundamental limitations of empirical privacy metrics (Section 4 and 6.1) and the lessons and takeaways that should be drawn from this work (Section 7) 💡 Link to the paper ➡️ https://lnkd.in/em5cTz8N
Truely synthetic data cannot be traced on a 1-to-1 basis back to an individual. It must be generated in such a way that the only possible use-case is high-level analytics and/or modeling.
Thank you! This is why robust re-identification risk analysis is so important in synthetic data as well, and not just anonymization. Nothing is perfect, but there are so many things we can improve upon with the right frameworks. Replica Analytics is a synthetic data company in healthcare that does re-identification risk analysis of the output, with Khaled El Emam being a research pioneers in this space.
Please note, that the authors were not able to demonstrate their "attack" in practice. Why? Because their key assumption is neither "minimal" nor "realistic", it's rather somewhere between "bizarre" and "outrageous". They assume that a system would allow to run 100s of tests against the original data to check for privacy! This is as realistic as assuming that hospitals allow to take 100s of X-rays of a patient to check for health 🤦♂️ Yes, taking 100s of X-rays harms the patient just as running 100s of tests leaks information. But, that is self evident. Thus no one in their right mind has ever or would ever allow to run tests more than once. Eg. MOSTLY AI doesn't even store the original data, thus it's simply impossible! And I'm sure that Gretel, Tonic, Hazy, Aindo, Replica, Statice, YData, Synthesized,.. all don't allow that either. But I leave it to them to comment / confirm here as well. fyi Yves-Alexandre de Montjoye Malte Beyer-Katzenberger Thomas Reutterer Katharina Koerner Harry Keen Fabiana Clemente Ali Golshan Alexander Watson Kalyan Veeramachaneni Nicolai Baldin, PhD Alexandra Ebert Tobias Hann
I would wish, that there is some kind of estimation about how good anonymization is achieved. (This paper is a yes no game.) That would help to tell patient, who are willing to give away their health data, that they are able to think about the benefit for their therapy (having better data) and the risk of their privacy being breached. Unless you serve both sides, patients being affected by severe health issues and privacy together as a bundle, you do a bad job, that is my opinion. Again I miss the terminus privacy filter. For my point of view, there is a much greater risk of breaking down privacy, if you copy from a non private database (applying privacy filters) to an anonymized database (by intention not filtering), because the host of the non private database can now just look into your anonymized database and pick out, what he knows.
Damien Desfontaines we are in *passionate* agreement with you at Equideum Health. I’d very much like to discuss our approach — going in a very different direction than what I’ll call the ‘synthetic data movement,’ particularly given EU AI Act passage. May we please arrange a virtual meeting minimally to include our CTO William Gleim, our Chief Ethics and Compliance Officer, Wendy Charles, PhD, and our Chief Scientific Officer, Sean Manion PhD. We do believe there are cases where synthetic data — depending on its origin — has some utility. But this is enormously exaggerated at present. We do not agree with the thesis that advancements in the field of synthetic data are the correct path forward for our industry. We’d like to receive your feedback on our alternative mechanics for addressing the challenges the synthetic data movement is seeking to address.
Interesting theoretical attack, but I agree with others that this doesn't look like any practical threat. I've been telling anyone who'd listen that theoretical attacks on anonymization should be assumed ineffective until proven otherwise...
at its current state, dp synthetic data loses all privacy guarantees when released into the wild. if you plan on releasing your data into the wild you should be looking at other technologies to solve your problems. -- in a vacuum, dp synthetic data is magical and amazing and every major tech company should be taking advantage of it.
Couldn't agree more Michael Platzer. Happy to share a synthetic dataset and challenge the team to identify the real individuals from there. In case you want to learn more about this tech, check this out: https://ydata.ai/products/synthetic_data
i’ve been flagging risks of synthetic data as a means to dodge personal data protection regulation for quite a while now (the first time at cpdp 2017 or 2018 I think - people looked at me weirdly and angry back then- like I just took away their brand new toy)… thanks for sharing this work!
Anonymize your data
7moThis is super valuable research. Thank you for sharing! The paper's conclusions around compliance and empirical privacy metrics writ large are way overstated relative to the empirical findings of the paper -- which, again, are significant. Let's look at the first two weaknesses listed for empirical privacy metrics: 1. "No theoretical guarantees" - so the main weakness is that empirical metrics are...empirical. This is a great example of starting with the conclusion -- in this case, "differential privacy is the only legitimate approach to privacy" -- and then working backward to supporting arguments. The paper proves that the empirical tests commonly used are not robust to this attack; that's woefully insufficient proof that theoretical guarantees are required. 2. "Privacy as a binary" - anonymization, as a legal concept, is a binary. This "weakness" represents the typical PET-centric mistake: mistaking technical evaluations for legal ones. The threshold question for anonymization is a legal question. Again, the paper provides a good argument that synthetic data doesn't satisfy the binary boundary condition. It's drastically overstated to disclaim the legitimacy or existence of a boundary condition existing at all.