Damien Desfontaines’ Post

Staff Scientist, differential privacy

7mo

So, about this "truly anonymous" synthetic data… 😬 I've been telling anyone who'd listen that ad hoc approaches to anonymization should be assumed broken until proven otherwise, that empirical privacy metrics are unreliable risk indicators, and that the synthetic data generation space is full of vendors making outlandish claims that aren't backed by anything solid 🤔 This new paper by Georgi Ganev and Emiliano De Cristofaro shows that if anything, the situation is even worse than I thought. They develop an attack against synthetic data that successfully reconstructs a large portion of outliers data points — exactly what synthetic data generation is supposed to prevent. In an ironic twist, the empirical metrics themselves are a key part of the vulnerability: knowing that a given synthetic dataset passes an empirical "privacy test" gives some indirect information about the original dataset, which the attacker can then exploit to break the privacy of real data points 🫢 I recommend anyone interested in synthetic data generation to take a look at the paper — even if you skip the technical details, don't miss the excellent explanations of the fundamental limitations of empirical privacy metrics (Section 4 and 6.1) and the lessons and takeaways that should be drawn from this work (Section 7) 💡 Link to the paper ➡️ https://lnkd.in/em5cTz8N

66 Comments

Ben Winokur

Anonymize your data

7mo

This is super valuable research. Thank you for sharing! The paper's conclusions around compliance and empirical privacy metrics writ large are way overstated relative to the empirical findings of the paper -- which, again, are significant. Let's look at the first two weaknesses listed for empirical privacy metrics: 1. "No theoretical guarantees" - so the main weakness is that empirical metrics are...empirical. This is a great example of starting with the conclusion -- in this case, "differential privacy is the only legitimate approach to privacy" -- and then working backward to supporting arguments. The paper proves that the empirical tests commonly used are not robust to this attack; that's woefully insufficient proof that theoretical guarantees are required. 2. "Privacy as a binary" - anonymization, as a legal concept, is a binary. This "weakness" represents the typical PET-centric mistake: mistaking technical evaluations for legal ones. The threshold question for anonymization is a legal question. Again, the paper provides a good argument that synthetic data doesn't satisfy the binary boundary condition. It's drastically overstated to disclaim the legitimacy or existence of a boundary condition existing at all.

10 Reactions

Douglas Ganim, MBA, FIP, CIPP, CIPM

Sr. Manager, Privacy Operations at Epsilon

7mo

Truely synthetic data cannot be traced on a 1-to-1 basis back to an individual. It must be generated in such a way that the only possible use-case is high-level analytics and/or modeling.

Patricia Thaine

Co-Founder and CEO @ Private AI | Building the Privacy Layer for Data | World Economic Forum Technology Pioneer 2023

7mo

Thank you! This is why robust re-identification risk analysis is so important in synthetic data as well, and not just anonymization. Nothing is perfect, but there are so many things we can improve upon with the right frameworks. Replica Analytics is a synthetic data company in healthcare that does re-identification risk analysis of the output, with Khaled El Emam being a research pioneers in this space.

3 Reactions

Michael Platzer

Co-Founder & CTO @ MOSTLY AI

7mo

Please note, that the authors were not able to demonstrate their "attack" in practice. Why? Because their key assumption is neither "minimal" nor "realistic", it's rather somewhere between "bizarre" and "outrageous". They assume that a system would allow to run 100s of tests against the original data to check for privacy! This is as realistic as assuming that hospitals allow to take 100s of X-rays of a patient to check for health 🤦♂️ Yes, taking 100s of X-rays harms the patient just as running 100s of tests leaks information. But, that is self evident. Thus no one in their right mind has ever or would ever allow to run tests more than once. Eg. MOSTLY AI doesn't even store the original data, thus it's simply impossible! And I'm sure that Gretel, Tonic, Hazy, Aindo, Replica, Statice, YData, Synthesized,.. all don't allow that either. But I leave it to them to comment / confirm here as well. fyi Yves-Alexandre de Montjoye Malte Beyer-Katzenberger Thomas Reutterer Katharina Koerner Harry Keen Fabiana Clemente Ali Golshan Alexander Watson Kalyan Veeramachaneni Nicolai Baldin, PhD Alexandra Ebert Tobias Hann

18 Reactions

Volker Rudolph

Real World Data

7mo

I would wish, that there is some kind of estimation about how good anonymization is achieved. (This paper is a yes no game.) That would help to tell patient, who are willing to give away their health data, that they are able to think about the benefit for their therapy (having better data) and the risk of their privacy being breached. Unless you serve both sides, patients being affected by severe health issues and privacy together as a bundle, you do a bad job, that is my opinion. Again I miss the terminus privacy filter. For my point of view, there is a much greater risk of breaking down privacy, if you copy from a non private database (applying privacy filters) to an anonymized database (by intention not filtering), because the host of the non private database can now just look into your anonymized database and pick out, what he knows.

Heather Leigh Flannery

CEO, AI MINDSystems Foundation; Healthcare & Life Sciences Chair, Government Blockchain Association; Washington, DC Chapter Chair, AI 2030; Applied Futurist; Impact Innovator in Web3, AI, PETs, Standards, & New PPPs

7mo

Damien Desfontaines we are in *passionate* agreement with you at Equideum Health. I’d very much like to discuss our approach — going in a very different direction than what I’ll call the ‘synthetic data movement,’ particularly given EU AI Act passage. May we please arrange a virtual meeting minimally to include our CTO William Gleim, our Chief Ethics and Compliance Officer, Wendy Charles, PhD, and our Chief Scientific Officer, Sean Manion PhD. We do believe there are cases where synthetic data — depending on its origin — has some utility. But this is enormously exaggerated at present. We do not agree with the thesis that advancements in the field of synthetic data are the correct path forward for our industry. We’d like to receive your feedback on our alternative mechanics for addressing the challenges the synthetic data movement is seeking to address.

3 Reactions

Paul Francis

Director Emeritus at MPI-SWS, Founding Member Open Diffix Project

7mo

Interesting theoretical attack, but I agree with others that this doesn't look like any practical threat. I've been telling anyone who'd listen that theoretical attacks on anonymization should be assumed ineffective until proven otherwise...

3 Reactions

David Zagardo (CIPT)

7mo

at its current state, dp synthetic data loses all privacy guarantees when released into the wild. if you plan on releasing your data into the wild you should be looking at other technologies to solve your problems. -- in a vacuum, dp synthetic data is magical and amazing and every major tech company should be taking advantage of it.

Gonçalo (G) Martins Ribeiro

CEO @YData | Data quality for AI, Synthetic Data, Responsible AI, Data-centric AI

7mo

Couldn't agree more Michael Platzer. Happy to share a synthetic dataset and challenge the team to identify the real individuals from there. In case you want to learn more about this tech, check this out: https://ydata.ai/products/synthetic_data

3 Reactions

Tjerk Timan

Digitisation in Industry / Trustworthy and fair AI / Data Spaces / Technology Impact Assessment / ELSA / EU Policy evaluation / EU project management / Ph.D. in Science- and Technology Studies.

7mo

i’ve been flagging risks of synthetic data as a means to dodge personal data protection regulation for quite a while now (the first time at cpdp 2017 or 2018 I think - people looked at me weirdly and angry back then- like I just took away their brand new toy)… thanks for sharing this work!

5 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Ben Winokur

Anonymize your data
7mo
Report this post
Resharing this interesting and valuable research. It would be dishonest to only share thoughts that support our anonymization approach at Subsalt, so here's a mathematically rigorous counterargument to the use of synthetic data for anonymization. While the attacks in the paper don't map precisely onto how our product works or how we measure re-identification risk, every synthetic data company needs to take this paper seriously and address the risks of record reconstruction from Reconsyn. But it's also worth highlighting that the paper's conclusions around compliance and empirical privacy metrics writ large are way overstated relative to the empirical findings of the paper -- which, again, are significant. To illustrate, we can look at the first two weaknesses listed in the paper for empirical privacy metrics: 1. "No theoretical guarantees" - so the main weakness is that empirical metrics are...empirical. This is a great example of starting with the conclusion -- in this case, "differential privacy is the only legitimate approach to privacy" -- and then working backward to supporting arguments. The paper proves that the empirical tests commonly used are not robust to this attack; that's woefully insufficient proof that theoretical guarantees are required. 2. "Privacy as a binary" - anonymization, as a legal concept, is a binary. This "weakness" represents the typical PET-centric mistake: mistaking technical evaluations for legal ones. The threshold question for anonymization is a legal question. Again, the paper provides a good argument that synthetic data doesn't satisfy the binary boundary condition. It's drastically overstating the findings to disclaim the legitimacy or existence of a boundary condition existing at all. The actionable findings of the paper are: - it's dangerous and likely empirically incorrect to assume that data synthesis is inherently an anonymizing process - even where vendors provide a set of privacy metrics to support anonymization claims, the specific metrics and rigor of the evaluation criteria are important (and the paper shows that those metrics can actually be a re-identification vector 🤯) - synthetic data vendors should protect against this new attack and re-evaluate how their products prevent record reconstruction via Reconsyn These findings come from the empirical results of the paper. The rest of the "analysis" is just naked advocacy for an alternative method of anonymization.
Damien Desfontaines

Staff Scientist, differential privacy
7mo

So, about this "truly anonymous" synthetic data… 😬 I've been telling anyone who'd listen that ad hoc approaches to anonymization should be assumed broken until proven otherwise, that empirical privacy metrics are unreliable risk indicators, and that the synthetic data generation space is full of vendors making outlandish claims that aren't backed by anything solid 🤔 This new paper by Georgi Ganev and Emiliano De Cristofaro shows that if anything, the situation is even worse than I thought. They develop an attack against synthetic data that successfully reconstructs a large portion of outliers data points — exactly what synthetic data generation is supposed to prevent. In an ironic twist, the empirical metrics themselves are a key part of the vulnerability: knowing that a given synthetic dataset passes an empirical "privacy test" gives some indirect information about the original dataset, which the attacker can then exploit to break the privacy of real data points 🫢 I recommend anyone interested in synthetic data generation to take a look at the paper — even if you skip the technical details, don't miss the excellent explanations of the fundamental limitations of empirical privacy metrics (Section 4 and 6.1) and the lessons and takeaways that should be drawn from this work (Section 7) 💡 Link to the paper ➡️ https://lnkd.in/em5cTz8N
Like Comment
To view or add a comment, sign in
EAIGG: Ethical AI Governance Group

3,568 followers
7mo
Report this post
Raising the alarm on synthetic data privacy, a compelling new study reveals significant vulnerabilities, casting doubt on the reliability of "anonymous" synthetic datasets. It's a wakeup call for anyone in the field of data privacy, demonstrating that standard anonymization techniques are far from foolproof. The paper delves into the inherent flaws of empirical privacy metrics and serves as a crucial read for those seeking to understand and improve data privacy measures. #DataPrivacy #SyntheticData #Cybersecurity Emmanuel Benhamou Anik Bose Ash Tutika
Damien Desfontaines

Staff Scientist, differential privacy
7mo

So, about this "truly anonymous" synthetic data… 😬 I've been telling anyone who'd listen that ad hoc approaches to anonymization should be assumed broken until proven otherwise, that empirical privacy metrics are unreliable risk indicators, and that the synthetic data generation space is full of vendors making outlandish claims that aren't backed by anything solid 🤔 This new paper by Georgi Ganev and Emiliano De Cristofaro shows that if anything, the situation is even worse than I thought. They develop an attack against synthetic data that successfully reconstructs a large portion of outliers data points — exactly what synthetic data generation is supposed to prevent. In an ironic twist, the empirical metrics themselves are a key part of the vulnerability: knowing that a given synthetic dataset passes an empirical "privacy test" gives some indirect information about the original dataset, which the attacker can then exploit to break the privacy of real data points 🫢 I recommend anyone interested in synthetic data generation to take a look at the paper — even if you skip the technical details, don't miss the excellent explanations of the fundamental limitations of empirical privacy metrics (Section 4 and 6.1) and the lessons and takeaways that should be drawn from this work (Section 7) 💡 Link to the paper ➡️ https://lnkd.in/em5cTz8N
Like Comment
To view or add a comment, sign in
Hazy

4,832 followers
7mo
Report this post
Not all synthetic data is created equally private. Hazy’s Head of Research Georgi Ganev’s and Emiliano De Cristofaro’s latest paper explores the fundamental limitations of entirely relying on ad-hoc similarity based privacy metrics to measure and the guarantee the privacy of synthetic data. The main takeaway? Practitioners should rely on robust definitions of privacy with provable guarantees such as Differential Privacy. Read the full paper here: https://lnkd.in/eZaiJJ82 #privacy #differentialprivacy #ai #syntheticdata #data #anonymisation
Damien Desfontaines

Staff Scientist, differential privacy
7mo

So, about this "truly anonymous" synthetic data… 😬 I've been telling anyone who'd listen that ad hoc approaches to anonymization should be assumed broken until proven otherwise, that empirical privacy metrics are unreliable risk indicators, and that the synthetic data generation space is full of vendors making outlandish claims that aren't backed by anything solid 🤔 This new paper by Georgi Ganev and Emiliano De Cristofaro shows that if anything, the situation is even worse than I thought. They develop an attack against synthetic data that successfully reconstructs a large portion of outliers data points — exactly what synthetic data generation is supposed to prevent. In an ironic twist, the empirical metrics themselves are a key part of the vulnerability: knowing that a given synthetic dataset passes an empirical "privacy test" gives some indirect information about the original dataset, which the attacker can then exploit to break the privacy of real data points 🫢 I recommend anyone interested in synthetic data generation to take a look at the paper — even if you skip the technical details, don't miss the excellent explanations of the fundamental limitations of empirical privacy metrics (Section 4 and 6.1) and the lessons and takeaways that should be drawn from this work (Section 7) 💡 Link to the paper ➡️ https://lnkd.in/em5cTz8N
Like Comment
To view or add a comment, sign in
Madelon Molhoek

Consultant Data Scientist at TNO, implementing responsible and explainable Artificial Intelligence.
7mo
Report this post
Interesting paper by Ganev et al. on the Privacy leakage of Synthetic Data. Firstly, the importance of good privacy metrics is discussed as empirical evaluation methods based on similarity between Original and Synthetic data are not deemed sufficient. The need to add Differential Privacy (e.g. 'noise') to the generation process is underlined. Secondly, their proposed reconstruction attack ReconSyn is able to reconstruct most outliers in the original datasets. Based on the assumptions an attacker has many different Synthetic datasets and can perform privacy metrics on all these datasets. I have not encountered this situation yet. Under which circumstances would you think these assumptions could be realistic? For me it shows the importance of a) selecting the right model to generate synthetic data, b) good governance around accessibility when applying Synthetic data, and c) limit the amount of generated data sets severely.
Damien Desfontaines

Staff Scientist, differential privacy
7mo

So, about this "truly anonymous" synthetic data… 😬 I've been telling anyone who'd listen that ad hoc approaches to anonymization should be assumed broken until proven otherwise, that empirical privacy metrics are unreliable risk indicators, and that the synthetic data generation space is full of vendors making outlandish claims that aren't backed by anything solid 🤔 This new paper by Georgi Ganev and Emiliano De Cristofaro shows that if anything, the situation is even worse than I thought. They develop an attack against synthetic data that successfully reconstructs a large portion of outliers data points — exactly what synthetic data generation is supposed to prevent. In an ironic twist, the empirical metrics themselves are a key part of the vulnerability: knowing that a given synthetic dataset passes an empirical "privacy test" gives some indirect information about the original dataset, which the attacker can then exploit to break the privacy of real data points 🫢 I recommend anyone interested in synthetic data generation to take a look at the paper — even if you skip the technical details, don't miss the excellent explanations of the fundamental limitations of empirical privacy metrics (Section 4 and 6.1) and the lessons and takeaways that should be drawn from this work (Section 7) 💡 Link to the paper ➡️ https://lnkd.in/em5cTz8N
1 Comment
Like Comment
To view or add a comment, sign in
Mel Johannes Hortal

Business & Operations Specialist at JPMorgan Chase & Co.
1mo
Report this post
Introduction to Data Privacy

Meljohannes Hortal's Statement of Accomplishment | DataCamp

datacamp.com
Like Comment
To view or add a comment, sign in
Urban Institute

96,643 followers
5mo
Report this post
Government agencies hold valuable information, but privacy concerns and #data breaches pose risks. A team at Urban Institute is developing synthetic data and an automated validation server to balance #privacy and #research needs. This #innovation aims to streamline access, allowing researchers to analyze confidential data without compromising privacy. Learn how these tools could revolutionize the future of policymaking and research. https://urbn.is/3wcZW8O

New Tools Are Needed to Unlock Private Data for Better Policymaking

medium.com

1 Comment
Like Comment
To view or add a comment, sign in
InstaDatahelp AI News

132 followers
8mo
Report this post
The Ethical Implications of Big Data Analytics: Balancing Privacy and Progress Title: The Ethical Implications of Big Data Analytics: Balancing Privacy and Progress Introduction In today’s digital age, the proliferation of data has become an integral part of our lives. With the advent of big data analytics, organizations can now harness vast amounts of information to gain valuable insights and make informed decisions. However, the ethical […] https://lnkd.in/dZDvSu8r

The Ethical Implications of Big Data Analytics: Balancing Privacy and Progress

https://instadatahelp.com
Like Comment
To view or add a comment, sign in
Shishir Vahia

Marketing Leader at Sage IT | Global Rising Marketeer 2023 | Technology Marketeer 2021 | Content Mogul 2019-2022 | Most Influential Marketing Leader 2017
1mo
Report this post
Data masking is an important part of how organizations comply with privacy regulations and still are able to mine #data for valuable insights. However, the ever-growing volume of structured and unstructured data that organizations collect has increased the complexity of #datamasking at scale.

What is Data Masking? Techniques, Types and Best Practices

techtarget.com
Like Comment
To view or add a comment, sign in
Dr. Subhabaha Pal

Co-Founder, Chief AI & Analytics Advisor @ InstaDataHelp | Innovator and Patent-Holder in Gen AI and LLM | Data Science Thought Leader and Blogger | FRSS(UK) FSASS FRIOASD | 16+ Years of Excellence
9mo
Report this post
Ethical Considerations in Data Science: Balancing Innovation with Privacy and Security Title: Ethical Considerations in Data Science: Balancing Innovation with Privacy and Security Introduction Data science has emerged as a powerful tool in the digital age, enabling organizations to extract valuable insights from vast amounts of data. However, as data collection and analysis become more pervasive, ethical considerations surrounding privacy and security have come to the […] https://lnkd.in/dvFw-EGB

Ethical Considerations in Data Science: Balancing Innovation with Privacy and Security

https://instadatahelp.com
Like Comment
To view or add a comment, sign in
David Patariu

Attorney and Privacy Law Specialist (PLS); IAPP Fellow of Information Privacy, CIPP/US, CIPP/E, CIPM; ISC2 CCSP, CISSP
1mo
Report this post
Today I'll be representing the Mitigating Unauthorized Scraping Alliance as part of an IAPP - International Association of Privacy Professionals web conference panel discussing unlawful data scraping -- "A global initiative to elevate safeguards against data scraping" -- on May 30, 2024, from 07:00 to 08:00 PST, 10:00 to 11:00 EST, and 16:00 to 17:00 CET. You can register to watch this live panel here: https://lnkd.in/gdSHQA-h Description: Whether it represents the mass scraping of billions of facial images or scraping of publicly accessible information to drive artificial intelligence, data scraping has become a global risk. On 23 August 2023, 12 data protection authorities from the Global Privacy Assembly issued a joint statement on data scraping that called upon social media companies to ensure elevated safeguards for their customers’ personal information, which they are entrusted to protect. This panel will outline regulator safeguard expectations to protect against unlawful data scraping, including key takeaways from the joint statement, other independent enforcement initiatives and recent policy developments in Europe. It will highlight commitments and perspectives of social media companies on this initiative and underscore the benefits of less formal enforcement collaboration actions in promoting global compliance. What you will learn: (1) Publicly accessible information is still subject to privacy laws. (2) Unlawful data scraping is a growing global concern, but safeguards to deter it exist. (3) Social media companies have a role to play to combat unlawful scraping of their customers’ data. Moderator: Brent Homan, Commissioner, The Office of the Data Protection Authority (Guernsey) Panelists: Anna Colaps, Member of the Cabinet of the Supervisor, European Data Protection Supervisor David Patariu, CIPP/E, CIPP/US, CIPM, FIP, PLS, Attorney, Venable and representative of MUSA Michael Maguire, Director, Private Sector Investigations, Office of Privacy Commissioner of Canada Rachel Masterton, CIPP/E, Deputy Commissioner, The Office of the Data Protection Authority (Guernsey) Host Selena Lefebvre, Programming and Speaker Coordinator, IAPP Eligible CPEs: CIPP/A, CIPP/C, CIPP/E, CIPP/US, CIPM and CIPT. 1.0 CPE credit
Like Comment
To view or add a comment, sign in

2,236 followers

111 Posts

View Profile Follow

Damien Desfontaines’ Post

More Relevant Posts

Explore topics