Damien Desfontaines’ Post

View profile for Damien Desfontaines, graphic

Staff Scientist, differential privacy

So, about this "truly anonymous" synthetic data… 😬 I've been telling anyone who'd listen that ad hoc approaches to anonymization should be assumed broken until proven otherwise, that empirical privacy metrics are unreliable risk indicators, and that the synthetic data generation space is full of vendors making outlandish claims that aren't backed by anything solid 🤔 This new paper by Georgi Ganev and Emiliano De Cristofaro shows that if anything, the situation is even worse than I thought. They develop an attack against synthetic data that successfully reconstructs a large portion of outliers data points — exactly what synthetic data generation is supposed to prevent. In an ironic twist, the empirical metrics themselves are a key part of the vulnerability: knowing that a given synthetic dataset passes an empirical "privacy test" gives some indirect information about the original dataset, which the attacker can then exploit to break the privacy of real data points 🫢 I recommend anyone interested in synthetic data generation to take a look at the paper — even if you skip the technical details, don't miss the excellent explanations of the fundamental limitations of empirical privacy metrics (Section 4 and 6.1) and the lessons and takeaways that should be drawn from this work (Section 7) 💡 Link to the paper ➡️ https://lnkd.in/em5cTz8N

  • A screenshot of Figure 2 of the linked paper, showing an overview of the reconstruction attack.
Ben Winokur

Anonymize your data

7mo

This is super valuable research. Thank you for sharing! The paper's conclusions around compliance and empirical privacy metrics writ large are way overstated relative to the empirical findings of the paper -- which, again, are significant. Let's look at the first two weaknesses listed for empirical privacy metrics: 1. "No theoretical guarantees" - so the main weakness is that empirical metrics are...empirical. This is a great example of starting with the conclusion -- in this case, "differential privacy is the only legitimate approach to privacy" -- and then working backward to supporting arguments. The paper proves that the empirical tests commonly used are not robust to this attack; that's woefully insufficient proof that theoretical guarantees are required. 2. "Privacy as a binary" - anonymization, as a legal concept, is a binary. This "weakness" represents the typical PET-centric mistake: mistaking technical evaluations for legal ones. The threshold question for anonymization is a legal question. Again, the paper provides a good argument that synthetic data doesn't satisfy the binary boundary condition. It's drastically overstated to disclaim the legitimacy or existence of a boundary condition existing at all.

Douglas Ganim, MBA, FIP, CIPP, CIPM

Sr. Manager, Privacy Operations at Epsilon

7mo

Truely synthetic data cannot be traced on a 1-to-1 basis back to an individual. It must be generated in such a way that the only possible use-case is high-level analytics and/or modeling.

Like
Reply
Patricia Thaine

Co-Founder and CEO @ Private AI | Building the Privacy Layer for Data | World Economic Forum Technology Pioneer 2023

7mo

Thank you! This is why robust re-identification risk analysis is so important in synthetic data as well, and not just anonymization. Nothing is perfect, but there are so many things we can improve upon with the right frameworks. Replica Analytics is a synthetic data company in healthcare that does re-identification risk analysis of the output, with Khaled El Emam being a research pioneers in this space.

Michael Platzer

Co-Founder & CTO @ MOSTLY AI

7mo

Please note, that the authors were not able to demonstrate their "attack" in practice. Why? Because their key assumption is neither "minimal" nor "realistic", it's rather somewhere between "bizarre" and "outrageous". They assume that a system would allow to run 100s of tests against the original data to check for privacy! This is as realistic as assuming that hospitals allow to take 100s of X-rays of a patient to check for health 🤦♂️ Yes, taking 100s of X-rays harms the patient just as running 100s of tests leaks information. But, that is self evident. Thus no one in their right mind has ever or would ever allow to run tests more than once. Eg. MOSTLY AI doesn't even store the original data, thus it's simply impossible! And I'm sure that Gretel, Tonic, Hazy, Aindo, Replica, Statice, YData, Synthesized,.. all don't allow that either. But I leave it to them to comment / confirm here as well. fyi Yves-Alexandre de Montjoye Malte Beyer-Katzenberger Thomas Reutterer Katharina Koerner Harry Keen Fabiana Clemente Ali Golshan Alexander Watson Kalyan Veeramachaneni Nicolai Baldin, PhD Alexandra Ebert Tobias Hann

I would wish, that there is some kind of estimation about how good anonymization is achieved. (This paper is a yes no game.) That would help to tell patient, who are willing to give away their health data, that they are able to think about the benefit for their therapy (having better data) and the risk of their privacy being breached. Unless you serve both sides, patients being affected by severe health issues and privacy together as a bundle, you do a bad job, that is my opinion. Again I miss the terminus privacy filter. For my point of view, there is a much greater risk of breaking down privacy, if you copy from a non private database (applying privacy filters) to an anonymized database (by intention not filtering), because the host of the non private database can now just look into your anonymized database and pick out, what he knows.

Like
Reply
Heather Leigh Flannery

CEO, AI MINDSystems Foundation; Healthcare & Life Sciences Chair, Government Blockchain Association; Washington, DC Chapter Chair, AI 2030; Applied Futurist; Impact Innovator in Web3, AI, PETs, Standards, & New PPPs

7mo

Damien Desfontaines we are in *passionate* agreement with you at Equideum Health. I’d very much like to discuss our approach — going in a very different direction than what I’ll call the ‘synthetic data movement,’ particularly given EU AI Act passage. May we please arrange a virtual meeting minimally to include our CTO William Gleim, our Chief Ethics and Compliance Officer, Wendy Charles, PhD, and our Chief Scientific Officer, Sean Manion PhD. We do believe there are cases where synthetic data — depending on its origin — has some utility. But this is enormously exaggerated at present. We do not agree with the thesis that advancements in the field of synthetic data are the correct path forward for our industry. We’d like to receive your feedback on our alternative mechanics for addressing the challenges the synthetic data movement is seeking to address.

Paul Francis

Director Emeritus at MPI-SWS, Founding Member Open Diffix Project

7mo

Interesting theoretical attack, but I agree with others that this doesn't look like any practical threat. I've been telling anyone who'd listen that theoretical attacks on anonymization should be assumed ineffective until proven otherwise...

at its current state, dp synthetic data loses all privacy guarantees when released into the wild. if you plan on releasing your data into the wild you should be looking at other technologies to solve your problems. -- in a vacuum, dp synthetic data is magical and amazing and every major tech company should be taking advantage of it.

Like
Reply
Gonçalo (G) Martins Ribeiro

CEO @YData | Data quality for AI, Synthetic Data, Responsible AI, Data-centric AI

7mo

Couldn't agree more Michael Platzer. Happy to share a synthetic dataset and challenge the team to identify the real individuals from there. In case you want to learn more about this tech, check this out: https://ydata.ai/products/synthetic_data

Tjerk Timan

Digitisation in Industry / Trustworthy and fair AI / Data Spaces / Technology Impact Assessment / ELSA / EU Policy evaluation / EU project management / Ph.D. in Science- and Technology Studies.

7mo

i’ve been flagging risks of synthetic data as a means to dodge personal data protection regulation for quite a while now (the first time at cpdp 2017 or 2018 I think - people looked at me weirdly and angry back then- like I just took away their brand new toy)… thanks for sharing this work!

See more comments

To view or add a comment, sign in

Explore topics