subscribe to arXiv mailings

Towards Understanding the Interplay of Generative Artificial Intelligence and the Internet

Authors: Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, Rik Sarkar

Abstract: The rapid adoption of generative Artificial Intelligence (AI) tools that can generate realistic images or text, such as DALL-E, MidJourney, or ChatGPT, have put the societal impacts of these technologies at the center of public debate. These tools are possible due to the massive amount of data (text and images) that is publicly available through the Internet. At the same time, these generative AI… ▽ More The rapid adoption of generative Artificial Intelligence (AI) tools that can generate realistic images or text, such as DALL-E, MidJourney, or ChatGPT, have put the societal impacts of these technologies at the center of public debate. These tools are possible due to the massive amount of data (text and images) that is publicly available through the Internet. At the same time, these generative AI tools become content creators that are already contributing to the data that is available to train future models. Therefore, future versions of generative AI tools will be trained with a mix of human-created and AI-generated content, causing a potential feedback loop between generative AI and public data repositories. This interaction raises many questions: how will future versions of generative AI tools behave when trained on a mixture of real and AI generated data? Will they evolve and improve with the new data sets or on the contrary will they degrade? Will evolution introduce biases or reduce diversity in subsequent generations of generative AI tools? What are the societal implications of the possible degradation of these models? Can we mitigate the effects of this feedback loop? In this document, we explore the effect of this interaction and report some initial results using simple diffusion models trained with various image datasets. Our results show that the quality and diversity of the generated images can degrade over time suggesting that incorporating AI-created data can have undesired effects on future versions of generative models. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2303.01255 [pdf, other]

Combining Generative Artificial Intelligence (AI) and the Internet: Heading towards Evolution or Degradation?

Authors: Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, Rik Sarkar

Abstract: In the span of a few months, generative Artificial Intelligence (AI) tools that can generate realistic images or text have taken the Internet by storm, making them one of the technologies with fastest adoption ever. Some of these generative AI tools such as DALL-E, MidJourney, or ChatGPT have gained wide public notoriety. Interestingly, these tools are possible because of the massive amount of dat… ▽ More In the span of a few months, generative Artificial Intelligence (AI) tools that can generate realistic images or text have taken the Internet by storm, making them one of the technologies with fastest adoption ever. Some of these generative AI tools such as DALL-E, MidJourney, or ChatGPT have gained wide public notoriety. Interestingly, these tools are possible because of the massive amount of data (text and images) available on the Internet. The tools are trained on massive data sets that are scraped from Internet sites. And now, these generative AI tools are creating massive amounts of new data that are being fed into the Internet. Therefore, future versions of generative AI tools will be trained with Internet data that is a mix of original and AI-generated data. As time goes on, a mixture of original data and data generated by different versions of AI tools will populate the Internet. This raises a few intriguing questions: how will future versions of generative AI tools behave when trained on a mixture of real and AI generated data? Will they evolve with the new data sets or degenerate? Will evolution introduce biases in subsequent generations of generative AI tools? In this document, we explore these questions and report some very initial simulation results using a simple image-generation AI tool. These results suggest that the quality of the generated images degrades as more AI-generated data is used for training thus suggesting that generative AI may degenerate. Although these results are preliminary and cannot be generalised without further study, they serve to illustrate the potential issues of the interaction between generative AI and the Internet. △ Less

Submitted 17 February, 2023; originally announced March 2023.

Comments: First version

arXiv:2209.03620 [pdf, other]

Black-Box Audits for Group Distribution Shifts

Authors: Marc Juarez, Samuel Yeom, Matt Fredrikson

Abstract: When a model informs decisions about people, distribution shifts can create undue disparities. However, it is hard for external entities to check for distribution shift, as the model and its training set are often proprietary. In this paper, we introduce and study a black-box auditing method to detect cases of distribution shift that lead to a performance disparity of the model across demographic… ▽ More When a model informs decisions about people, distribution shifts can create undue disparities. However, it is hard for external entities to check for distribution shift, as the model and its training set are often proprietary. In this paper, we introduce and study a black-box auditing method to detect cases of distribution shift that lead to a performance disparity of the model across demographic groups. By extending techniques used in membership and property inference attacks -- which are designed to expose private information from learned models -- we demonstrate that an external auditor can gain the information needed to identify these distribution shifts solely by querying the model. Our experimental results on real-world datasets show that this approach is effective, achieving 80--100% AUC-ROC in detecting shifts involving the underrepresentation of a demographic group in the training set. Researchers and investigative journalists can use our tools to perform non-collaborative audits of proprietary models and expose cases of underrepresentation in the training datasets. △ Less

Submitted 8 September, 2022; originally announced September 2022.

arXiv:2206.12183 [pdf, other]

"You Can't Fix What You Can't Measure": Privately Measuring Demographic Performance Disparities in Federated Learning

Authors: Marc Juarez, Aleksandra Korolova

Abstract: As in traditional machine learning models, models trained with federated learning may exhibit disparate performance across demographic groups. Model holders must identify these disparities to mitigate undue harm to the groups. However, measuring a model's performance in a group requires access to information about group membership which, for privacy reasons, often has limited availability. We prop… ▽ More As in traditional machine learning models, models trained with federated learning may exhibit disparate performance across demographic groups. Model holders must identify these disparities to mitigate undue harm to the groups. However, measuring a model's performance in a group requires access to information about group membership which, for privacy reasons, often has limited availability. We propose novel locally differentially private mechanisms to measure differences in performance across groups while protecting the privacy of group membership. To analyze the effectiveness of the mechanisms, we bound their error in estimating a disparity when optimized for a given privacy budget. Our results show that the error rapidly decreases for realistic numbers of participating clients, demonstrating that, contrary to what prior work suggested, protecting privacy is not necessarily in conflict with identifying performance disparities of federated models. △ Less

Submitted 11 January, 2023; v1 submitted 24 June, 2022; originally announced June 2022.

Comments: 14 pages, 6 figures

arXiv:2202.09727 [pdf, other]

Online Platforms and the Fair Exposure Problem Under Homophily

Authors: Jakob Schoeffer, Alexander Ritchie, Keziah Naggita, Faidra Monachou, Jessie Finocchiaro, Marc Juarez

Abstract: In the wake of increasing political extremism, online platforms have been criticized for contributing to polarization. One line of criticism has focused on echo chambers and the recommended content served to users by these platforms. In this work, we introduce the fair exposure problem: given limited intervention power of the platform, the goal is to enforce balance in the spread of content (e.g.,… ▽ More In the wake of increasing political extremism, online platforms have been criticized for contributing to polarization. One line of criticism has focused on echo chambers and the recommended content served to users by these platforms. In this work, we introduce the fair exposure problem: given limited intervention power of the platform, the goal is to enforce balance in the spread of content (e.g., news articles) among two groups of users through constraints similar to those imposed by the Fairness Doctrine in the United States in the past. Groups are characterized by different affiliations (e.g., political views) and have different preferences for content. We develop a stylized framework that models intra- and intergroup content propagation under homophily, and we formulate the platform's decision as an optimization problem that aims at maximizing user engagement, potentially under fairness constraints. Our main notion of fairness requires that each group see a mixture of their preferred and non-preferred content, encouraging information diversity. Promoting such information diversity is often viewed as desirable and a potential means for breaking out of harmful echo chambers. We study the solutions to both the fairness-agnostic and fairness-aware problems. We prove that a fairness-agnostic approach inevitably leads to group-homogeneous targeting by the platform. This is only partially mitigated by imposing fairness constraints: we show that there exist optimal fairness-aware solutions which target one group with different types of content and the other group with only one type that is not necessarily the group's most preferred. Finally, using simulations with real-world data, we study the system dynamics and quantify the price of fairness. △ Less

Submitted 10 March, 2023; v1 submitted 19 February, 2022; originally announced February 2022.

Comments: 37th AAAI Conference on Artificial Intelligence (AAAI-23)

arXiv:1910.12293 [pdf, other]

Generation of digital patients for the simulation of tuberculosis with UISS-TB

Authors: Marzio Pennisi, Miguel A. Juarez, Giulia Russo, Marco Viceconti, Francesco Pappalardo

Abstract: EC funded STriTuVaD project aims to test, through a phase IIb clinical trial, two of the most advanced therapeutic vaccines against tuberculosis. In parallel, we have extended the Universal Immune System Simulator to include all relevant determinants of such clinical trial, to establish its predictive accuracy against the individual patients recruited in the trial, to use it to generate digital pa… ▽ More EC funded STriTuVaD project aims to test, through a phase IIb clinical trial, two of the most advanced therapeutic vaccines against tuberculosis. In parallel, we have extended the Universal Immune System Simulator to include all relevant determinants of such clinical trial, to establish its predictive accuracy against the individual patients recruited in the trial, to use it to generate digital patients and predict their response to the HRT being tested, and to combine them to the observations made on physical patients using a new in silico-augmented clinical trial approach that uses a Bayesian adaptive design. This approach, where found effective could drastically reduce the cost of innovation in this critical sector of public healthcare. One of the most challenging task is to develop a methodology to reproduce biological diversity of the subjects that have to be simulated, i.e., provide an appropriate strategy for the generation of libraries of digital patients. This has been achieved through the the creation of the initial immune system repertoire in a stochastic way, and though the identification of a "vector of features" that combines both biological and pathophysiological parameters that personalize the digital patient to reproduce the physiology and the pathophysiology of the subject. △ Less

Submitted 27 October, 2019; originally announced October 2019.

Comments: 5 pages

arXiv:1909.04660 [pdf]

POSITION PAPER: Credibility of In Silico Trial Technologies: A Theoretical Framing

Authors: Marco Viceconti, Miguel A. Juárez, Cristina Curreli, Marzio Pennisi, Giulia Russo, Francesco Pappalardo

Abstract: Different research communities have developed various approaches to assess the credibility of predictive models. Each approach usually works well for a specific type of model, and under some epistemic conditions that are normally satisfied within that specific research domain. Some regulatory agencies recently started to consider evidences of safety and efficacy on new medical products obtained us… ▽ More Different research communities have developed various approaches to assess the credibility of predictive models. Each approach usually works well for a specific type of model, and under some epistemic conditions that are normally satisfied within that specific research domain. Some regulatory agencies recently started to consider evidences of safety and efficacy on new medical products obtained using computer modelling and simulation (which is referred to as In Silico Trials); this has raised the attention in the computational medicine research community on the regulatory science aspects of this emerging discipline. But this poses a foundational problem: in the domain of biomedical research the use of computer modelling is relatively recent, without a widely accepted epistemic framing for problem of model credibility. Also, because of the inherent complexity of living organisms, biomedical modellers tend to use a variety of modelling methods, sometimes mixing them in the solution of a single problem. In such context merely adopting credibility approaches developed within other research community might not be appropriate. In this position paper we propose a theoretical framing for the problem of assessing the credibility of a predictive models for In Silico Trials, which accounts for the epistemic specificity of this research field and is general enough to be used for different type of models. △ Less

Submitted 24 October, 2019; v1 submitted 10 September, 2019; originally announced September 2019.

Comments: 11 pages

arXiv:1906.09682 [pdf, other]

Encrypted DNS --> Privacy? A Traffic Analysis Perspective

Authors: Sandra Siby, Marc Juarez, Claudia Diaz, Narseo Vallina-Rodriguez, Carmela Troncoso

Abstract: Virtually every connection to an Internet service is preceded by a DNS lookup which is performed without any traffic-level protection, thus enabling manipulation, redirection, surveillance, and censorship. To address these issues, large organizations such as Google and Cloudflare are deploying recently standardized protocols that encrypt DNS traffic between end users and recursive resolvers such a… ▽ More Virtually every connection to an Internet service is preceded by a DNS lookup which is performed without any traffic-level protection, thus enabling manipulation, redirection, surveillance, and censorship. To address these issues, large organizations such as Google and Cloudflare are deploying recently standardized protocols that encrypt DNS traffic between end users and recursive resolvers such as DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH). In this paper, we examine whether encrypting DNS traffic can protect users from traffic analysis-based monitoring and censoring. We propose a novel feature set to perform the attacks, as those used to attack HTTPS or Tor traffic are not suitable for DNS' characteristics. We show that traffic analysis enables the identification of domains with high accuracy in closed and open world settings, using 124 times less data than attacks on HTTPS flows. We find that factors such as location, resolver, platform, or client do mitigate the attacks performance but they are far from completely stopping them. Our results indicate that DNS-based censorship is still possible on encrypted DNS traffic. In fact, we demonstrate that the standardized padding schemes are not effective. Yet, Tor -- which does not effectively mitigate traffic analysis attacks on web traffic -- is a good defense against DoH traffic analysis. △ Less

Submitted 6 October, 2019; v1 submitted 23 June, 2019; originally announced June 2019.

arXiv:1801.02265 [pdf, other]

Deep Fingerprinting: Undermining Website Fingerprinting Defenses with Deep Learning

Authors: Payap Sirinam, Mohsen Imani, Marc Juarez, Matthew Wright

Abstract: Website fingerprinting enables a local eavesdropper to determine which websites a user is visiting over an encrypted connection. State-of-the-art website fingerprinting attacks have been shown to be effective even against Tor. Recently, lightweight website fingerprinting defenses for Tor have been proposed that substantially degrade existing attacks: WTF-PAD and Walkie-Talkie. In this work, we pre… ▽ More Website fingerprinting enables a local eavesdropper to determine which websites a user is visiting over an encrypted connection. State-of-the-art website fingerprinting attacks have been shown to be effective even against Tor. Recently, lightweight website fingerprinting defenses for Tor have been proposed that substantially degrade existing attacks: WTF-PAD and Walkie-Talkie. In this work, we present Deep Fingerprinting (DF), a new website fingerprinting attack against Tor that leverages a type of deep learning called Convolutional Neural Networks (CNN) with a sophisticated architecture design, and we evaluate this attack against WTF-PAD and Walkie-Talkie. The DF attack attains over 98% accuracy on Tor traffic without defenses, better than all prior attacks, and it is also the only attack that is effective against WTF-PAD with over 90% accuracy. Walkie-Talkie remains effective, holding the attack to just 49.7% accuracy. In the more realistic open-world setting, our attack remains effective, with 0.99 precision and 0.94 recall on undefended traffic. Against traffic defended with WTF-PAD in this setting, the attack still can get 0.96 precision and 0.68 recall. These findings highlight the need for effective defenses that protect against this new attack and that could be deployed in Tor. △ Less

Submitted 19 August, 2018; v1 submitted 7 January, 2018; originally announced January 2018.

Comments: To appear in the 2018 ACM SIGSAC Conference on Computer and Communications Security (ACM CCS 2018)

arXiv:1708.08475 [pdf, other]

doi 10.1145/3133956.3134005

How Unique is Your .onion? An Analysis of the Fingerprintability of Tor Onion Services

Authors: Rebekah Overdorf, Marc Juarez, Gunes Acar, Rachel Greenstadt, Claudia Diaz

Abstract: Recent studies have shown that Tor onion (hidden) service websites are particularly vulnerable to website fingerprinting attacks due to their limited number and sensitive nature. In this work we present a multi-level feature analysis of onion site fingerprintability, considering three state-of-the-art website fingerprinting methods and 482 Tor onion services, making this the largest analysis of th… ▽ More Recent studies have shown that Tor onion (hidden) service websites are particularly vulnerable to website fingerprinting attacks due to their limited number and sensitive nature. In this work we present a multi-level feature analysis of onion site fingerprintability, considering three state-of-the-art website fingerprinting methods and 482 Tor onion services, making this the largest analysis of this kind completed on onion services to date. Prior studies typically report average performance results for a given website fingerprinting method or countermeasure. We investigate which sites are more or less vulnerable to fingerprinting and which features make them so. We find that there is a high variability in the rate at which sites are classified (and misclassified) by these attacks, implying that average performance figures may not be informative of the risks that website fingerprinting attacks pose to particular sites. We analyze the features exploited by the different website fingerprinting methods and discuss what makes onion service sites more or less easily identifiable, both in terms of their traffic traces as well as their webpage design. We study misclassifications to understand how onion service sites can be redesigned to be less vulnerable to website fingerprinting attacks. Our results also inform the design of website fingerprinting countermeasures and their evaluation considering disparate impact across sites. △ Less

Submitted 20 September, 2017; v1 submitted 28 August, 2017; originally announced August 2017.

Comments: Accepted by ACM CCS 2017

arXiv:1708.06376 [pdf, other]

doi 10.14722/ndss.2018.23105

Automated Website Fingerprinting through Deep Learning

Authors: Vera Rimmer, Davy Preuveneers, Marc Juarez, Tom Van Goethem, Wouter Joosen

Abstract: Several studies have shown that the network traffic that is generated by a visit to a website over Tor reveals information specific to the website through the timing and sizes of network packets. By capturing traffic traces between users and their Tor entry guard, a network eavesdropper can leverage this meta-data to reveal which website Tor users are visiting. The success of such attacks heavily… ▽ More Several studies have shown that the network traffic that is generated by a visit to a website over Tor reveals information specific to the website through the timing and sizes of network packets. By capturing traffic traces between users and their Tor entry guard, a network eavesdropper can leverage this meta-data to reveal which website Tor users are visiting. The success of such attacks heavily depends on the particular set of traffic features that are used to construct the fingerprint. Typically, these features are manually engineered and, as such, any change introduced to the Tor network can render these carefully constructed features ineffective. In this paper, we show that an adversary can automate the feature engineering process, and thus automatically deanonymize Tor traffic by applying our novel method based on deep learning. We collect a dataset comprised of more than three million network traces, which is the largest dataset of web traffic ever used for website fingerprinting, and find that the performance achieved by our deep learning approaches is comparable to known methods which include various research efforts spanning over multiple years. The obtained success rate exceeds 96% for a closed world of 100 websites and 94% for our biggest closed world of 900 classes. In our open world evaluation, the most performant deep learning model is 2% more accurate than the state-of-the-art attack. Furthermore, we show that the implicit features automatically learned by our approach are far more resilient to dynamic changes of web content over time. We conclude that the ability to automatically construct the most relevant traffic features and perform accurate traffic recognition makes our deep learning based approach an efficient, flexible and robust technique for website fingerprinting. △ Less

Submitted 5 December, 2017; v1 submitted 21 August, 2017; originally announced August 2017.

Comments: To appear in the 25th Symposium on Network and Distributed System Security (NDSS 2018)

arXiv:1512.00524 [pdf, other]

Toward an Efficient Website Fingerprinting Defense

Authors: Marc Juarez, Mohsen Imani, Mike Perry, Claudia Diaz, Matthew Wright

Abstract: Website Fingerprinting attacks enable a passive eavesdropper to recover the user's otherwise anonymized web browsing activity by matching the observed traffic with prerecorded web traffic templates. The defenses that have been proposed to counter these attacks are impractical for deployment in real-world systems due to their high cost in terms of added delay and bandwidth overhead. Further, these… ▽ More Website Fingerprinting attacks enable a passive eavesdropper to recover the user's otherwise anonymized web browsing activity by matching the observed traffic with prerecorded web traffic templates. The defenses that have been proposed to counter these attacks are impractical for deployment in real-world systems due to their high cost in terms of added delay and bandwidth overhead. Further, these defenses have been designed to counter attacks that, despite their high success rates, have been criticized for assuming unrealistic attack conditions in the evaluation setting. In this paper, we propose a novel, lightweight defense based on Adaptive Padding that provides a sufficient level of security against website fingerprinting, particularly in realistic evaluation conditions. In a closed-world setting, this defense reduces the accuracy of the state-of-the-art attack from 91% to 20%, while introducing zero latency overhead and less than 60% bandwidth overhead. In an open-world, the attack precision is just 1% and drops further as the number of sites grows. △ Less

Submitted 19 July, 2016; v1 submitted 1 December, 2015; originally announced December 2015.

Comments: To appear In the proceedings of the European Symposium on Research in Computer Security (ESORICS), pp. 20, Springer, 2016

Showing 1–12 of 12 results for author: Juarez, M