subscribe to arXiv mailings

P4: Towards private, personalized, and Peer-to-Peer learning

Authors: Mohammad Mahdi Maheri, Sandra Siby, Sina Abdollahi, Anastasia Borovykh, Hamed Haddadi

Abstract: Personalized learning is a proposed approach to address the problem of data heterogeneity in collaborative machine learning. In a decentralized setting, the two main challenges of personalization are client clustering and data privacy. In this paper, we address these challenges by developing P4 (Personalized Private Peer-to-Peer) a method that ensures that each client receives a personalized model… ▽ More Personalized learning is a proposed approach to address the problem of data heterogeneity in collaborative machine learning. In a decentralized setting, the two main challenges of personalization are client clustering and data privacy. In this paper, we address these challenges by developing P4 (Personalized Private Peer-to-Peer) a method that ensures that each client receives a personalized model while maintaining differential privacy guarantee of each client's local dataset during and after the training. Our approach includes the design of a lightweight algorithm to identify similar clients and group them in a private, peer-to-peer (P2P) manner. Once grouped, we develop differentially-private knowledge distillation for clients to co-train with minimal impact on accuracy. We evaluate our proposed method on three benchmark datasets (FEMNIST or Federated EMNIST, CIFAR-10 and CIFAR-100) and two different neural network architectures (Linear and CNN-based networks) across a range of privacy parameters. The results demonstrate the potential of P4, as it outperforms the state-of-the-art of differential private P2P by up to 40 percent in terms of accuracy. We also show the practicality of P4 by implementing it on resource constrained devices, and validating that it has minimal overhead, e.g., about 7 seconds to run collaborative training between two clients. △ Less

Submitted 31 May, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.06545 [pdf, other]

Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval

Authors: Mengjia Niu, Hao Li, Jie Shi, Hamed Haddadi, Fan Mo

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various domains, although their susceptibility to hallucination poses significant challenges for their deployment in critical areas such as healthcare. To address this issue, retrieving relevant facts from knowledge graphs (KGs) is considered a promising method. Existing KG-augmented approaches tend to be resource-intens… ▽ More Large language models (LLMs) have demonstrated remarkable capabilities across various domains, although their susceptibility to hallucination poses significant challenges for their deployment in critical areas such as healthcare. To address this issue, retrieving relevant facts from knowledge graphs (KGs) is considered a promising method. Existing KG-augmented approaches tend to be resource-intensive, requiring multiple rounds of retrieval and verification for each factoid, which impedes their application in real-world scenarios. In this study, we propose Self-Refinement-Enhanced Knowledge Graph Retrieval (Re-KGR) to augment the factuality of LLMs' responses with less retrieval efforts in the medical field. Our approach leverages the attribution of next-token predictive probability distributions across different tokens, and various model layers to primarily identify tokens with a high potential for hallucination, reducing verification rounds by refining knowledge triples associated with these tokens. Moreover, we rectify inaccurate content using retrieved knowledge in the post-processing stage, which improves the truthfulness of generated responses. Experimental results on a medical dataset demonstrate that our approach can enhance the factual capability of LLMs across various foundational models as evidenced by the highest scores on truthfulness. △ Less

Submitted 10 May, 2024; originally announced May 2024.

ACM Class: I.2.7; H.3.3

arXiv:2405.00596 [pdf, other]

Unbundle-Rewrite-Rebundle: Runtime Detection and Rewriting of Privacy-Harming Code in JavaScript Bundles

Authors: Mir Masood Ali, Peter Snyder, Chris Kanich, Hamed Haddadi

Abstract: This work presents Unbundle-Rewrite-Rebundle (URR), a system for detecting privacy-harming portions of bundled JavaScript code, and rewriting that code at runtime to remove the privacy harming behavior without breaking the surrounding code or overall application. URR is a novel solution to the problem of JavaScript bundles, where websites pre-compile multiple code units into a single file, making… ▽ More This work presents Unbundle-Rewrite-Rebundle (URR), a system for detecting privacy-harming portions of bundled JavaScript code, and rewriting that code at runtime to remove the privacy harming behavior without breaking the surrounding code or overall application. URR is a novel solution to the problem of JavaScript bundles, where websites pre-compile multiple code units into a single file, making it impossible for content filters and ad-blockers to differentiate between desired and unwanted resources. Where traditional content filtering tools rely on URLs, URR analyzes the code at the AST level, and replaces harmful AST sub-trees with privacy-and-functionality maintaining alternatives. We present an open-sourced implementation of URR as a Firefox extension, and evaluate it against JavaScript bundles generated by the most popular bundling system (Webpack) deployed on the Tranco 10k. We measure the performance, measured by precision (1.00), recall (0.95), and speed (0.43s per-script) when detecting and rewriting three representative privacy harming libraries often included in JavaScript bundles, and find URR to be an effective approach to a large-and-growing blind spot unaddressed by current privacy tools. △ Less

Submitted 7 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

arXiv:2404.00190 [pdf, other]

GuaranTEE: Towards Attestable and Private ML with CCA

Authors: Sandra Siby, Sina Abdollahi, Mohammad Maheri, Marios Kogias, Hamed Haddadi

Abstract: Machine-learning (ML) models are increasingly being deployed on edge devices to provide a variety of services. However, their deployment is accompanied by challenges in model privacy and auditability. Model providers want to ensure that (i) their proprietary models are not exposed to third parties; and (ii) be able to get attestations that their genuine models are operating on edge devices in acco… ▽ More Machine-learning (ML) models are increasingly being deployed on edge devices to provide a variety of services. However, their deployment is accompanied by challenges in model privacy and auditability. Model providers want to ensure that (i) their proprietary models are not exposed to third parties; and (ii) be able to get attestations that their genuine models are operating on edge devices in accordance with the service agreement with the user. Existing measures to address these challenges have been hindered by issues such as high overheads and limited capability (processing/secure memory) on edge devices. In this work, we propose GuaranTEE, a framework to provide attestable private machine learning on the edge. GuaranTEE uses Confidential Computing Architecture (CCA), Arm's latest architectural extension that allows for the creation and deployment of dynamic Trusted Execution Environments (TEEs) within which models can be executed. We evaluate CCA's feasibility to deploy ML models by developing, evaluating, and openly releasing a prototype. We also suggest improvements to CCA to facilitate its use in protecting the entire ML deployment pipeline on edge devices. △ Less

Submitted 29 March, 2024; originally announced April 2024.

Comments: Accepted at the 4th Workshop on Machine Learning and Systems (EuroMLSys '24)

arXiv:2403.15905 [pdf, other]

Towards Low-Energy Adaptive Personalization for Resource-Constrained Devices

Authors: Yushan Huang, Josh Millar, Yuxuan Long, Yuchen Zhao, Hamed Haddadi

Abstract: The personalization of machine learning (ML) models to address data drift is a significant challenge in the context of Internet of Things (IoT) applications. Presently, most approaches focus on fine-tuning either the full base model or its last few layers to adapt to new data, while often neglecting energy costs. However, various types of data drift exist, and fine-tuning the full base model or th… ▽ More The personalization of machine learning (ML) models to address data drift is a significant challenge in the context of Internet of Things (IoT) applications. Presently, most approaches focus on fine-tuning either the full base model or its last few layers to adapt to new data, while often neglecting energy costs. However, various types of data drift exist, and fine-tuning the full base model or the last few layers may not result in optimal performance in certain scenarios. We propose Target Block Fine-Tuning (TBFT), a low-energy adaptive personalization framework designed for resource-constrained devices. We categorize data drift and personalization into three types: input-level, feature-level, and output-level. For each type, we fine-tune different blocks of the model to achieve optimal performance with reduced energy costs. Specifically, input-, feature-, and output-level correspond to fine-tuning the front, middle, and rear blocks of the model. We evaluate TBFT on a ResNet model, three datasets, three different training sizes, and a Raspberry Pi. Compared with the $Block Avg$, where each block is fine-tuned individually and their performance improvements are averaged, TBFT exhibits an improvement in model accuracy by an average of 15.30% whilst saving 41.57% energy consumption on average compared with full fine-tuning. △ Less

Submitted 29 March, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

Comments: Accepetd to The 4th Workshop on Machine Learning and Systems (EuroMLSys '24)

arXiv:2403.12844 [pdf, other]

MELTing point: Mobile Evaluation of Language Transformers

Authors: Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi

Abstract: Transformers have revolutionized the machine learning landscape, gradually making their way into everyday tasks and equipping our computers with "sparks of intelligence". However, their runtime requirements have prevented them from being broadly deployed on mobile. As personal devices become increasingly powerful and prompt privacy becomes an ever more pressing issue, we explore the current state… ▽ More Transformers have revolutionized the machine learning landscape, gradually making their way into everyday tasks and equipping our computers with "sparks of intelligence". However, their runtime requirements have prevented them from being broadly deployed on mobile. As personal devices become increasingly powerful and prompt privacy becomes an ever more pressing issue, we explore the current state of mobile execution of Large Language Models (LLMs). To achieve this, we have created our own automation infrastructure, MELT, which supports the headless execution and benchmarking of LLMs on device, supporting different models, devices and frameworks, including Android, iOS and Nvidia Jetson devices. We evaluate popular instruction fine-tuned LLMs and leverage different frameworks to measure their end-to-end and granular performance, tracing their memory and energy requirements along the way. Our analysis is the first systematic study of on-device LLM execution, quantifying performance, energy efficiency and accuracy across various state-of-the-art models and showcases the state of on-device intelligence in the era of hyperscale models. Results highlight the performance heterogeneity across targets and corroborates that LLM inference is largely memory-bound. Quantization drastically reduces memory requirements and renders execution viable, but at a non-negligible accuracy cost. Drawing from its energy footprint and thermal behavior, the continuous execution of LLMs remains elusive, as both factors negatively affect user experience. Last, our experience shows that the ecosystem is still in its infancy, and algorithmic as well as hardware breakthroughs can significantly shift the execution cost. We expect NPU acceleration, and framework-hardware co-design to be the biggest bet towards efficient standalone execution, with the alternative of offloading tailored towards edge deployments. △ Less

Submitted 25 July, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: Accepted at the 30th Annual International Conference On Mobile Computing And Networking (MobiCom 2024)

arXiv:2403.08040 [pdf, other]

MicroT: Low-Energy and Adaptive Models for MCUs

Authors: Yushan Huang, Ranya Aloufi, Xavier Cadet, Yuchen Zhao, Payam Barnaghi, Hamed Haddadi

Abstract: We propose MicroT, a low-energy, multi-task adaptive model framework for resource-constrained MCUs. We divide the original model into a feature extractor and a classifier. The feature extractor is obtained through self-supervised knowledge distillation and further optimized into part and full models through model splitting and joint training. These models are then deployed on MCUs, with classifier… ▽ More We propose MicroT, a low-energy, multi-task adaptive model framework for resource-constrained MCUs. We divide the original model into a feature extractor and a classifier. The feature extractor is obtained through self-supervised knowledge distillation and further optimized into part and full models through model splitting and joint training. These models are then deployed on MCUs, with classifiers added and trained on local tasks, ultimately performing stage-decision for joint inference. In this process, the part model initially processes the sample, and if the confidence score falls below the set threshold, the full model will resume and continue the inference. We evaluate MicroT on two models, three datasets, and two MCU boards. Our experimental evaluation shows that MicroT effectively improves model performance and reduces energy consumption when dealing with multiple local tasks. Compared to the unoptimized feature extractor, MicroT can improve accuracy by up to 9.87%. On MCUs, compared to the standard full model inference, MicroT can save up to about 29.13% in energy consumption. MicroT also allows users to adaptively adjust the stage-decision ratio as needed, better balancing model performance and energy consumption. Under the standard stage-decision ratio configuration, MicroT can increase accuracy by 5.91% and save about 14.47% of energy consumption. △ Less

Submitted 9 July, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2402.02877 [pdf]

Feedback to the European Data Protection Board's Guidelines 2/2023 on Technical Scope of Art. 5(3) of ePrivacy Directive

Authors: Cristiana Santos, Nataliia Bielova, Vincent Roca, Mathieu Cunche, Gilles Mertens, Karel Kubicek, Hamed Haddadi

Abstract: We very much welcome the EDPB's Guidelines. Please find hereunder our feedback to the Guidelines 2/2023 on Technical Scope of Art. 5(3) of ePrivacy Directive. Our comments are presented after a quotation from the proposed text by the EDPB in a box. We very much welcome the EDPB's Guidelines. Please find hereunder our feedback to the Guidelines 2/2023 on Technical Scope of Art. 5(3) of ePrivacy Directive. Our comments are presented after a quotation from the proposed text by the EDPB in a box. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2401.14332 [pdf, other]

SunBlock: Cloudless Protection for IoT Systems

Authors: Vadim Safronov, Anna Maria Mandalari, Daniel J. Dubois, David Choffnes, Hamed Haddadi

Abstract: With an increasing number of Internet of Things (IoT) devices present in homes, there is a rise in the number of potential information leakage channels and their associated security threats and privacy risks. Despite a long history of attacks on IoT devices in unprotected home networks, the problem of accurate, rapid detection and prevention of such attacks remains open. Many existing IoT protecti… ▽ More With an increasing number of Internet of Things (IoT) devices present in homes, there is a rise in the number of potential information leakage channels and their associated security threats and privacy risks. Despite a long history of attacks on IoT devices in unprotected home networks, the problem of accurate, rapid detection and prevention of such attacks remains open. Many existing IoT protection solutions are cloud-based, sometimes ineffective, and might share consumer data with unknown third parties. This paper investigates the potential for effective IoT threat detection locally, on a home router, using AI tools combined with classic rule-based traffic-filtering algorithms. Our results show that with a slight rise of router hardware resources caused by machine learning and traffic filtering logic, a typical home router instrumented with our solution is able to effectively detect risks and protect a typical home IoT network, equaling or outperforming existing popular solutions, without any effects on benign IoT functionality, and without relying on cloud services and third parties. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: This paper is accepted at Passive and Active Measurement (PAM) conference 2024

arXiv:2401.01353 [pdf, other]

The Boomerang protocol: A Decentralised Privacy-Preserving Verifiable Incentive Protocol

Authors: Ralph Ankele, Hamed Haddadi

Abstract: In the era of data-driven economies, incentive systems and loyalty programs, have become ubiquitous in various sectors, including advertising, retail, travel, and financial services. While these systems offer advantages for both users and companies, they necessitate the transfer and analysis of substantial amounts of sensitive data. Privacy concerns have become increasingly pertinent, necessitatin… ▽ More In the era of data-driven economies, incentive systems and loyalty programs, have become ubiquitous in various sectors, including advertising, retail, travel, and financial services. While these systems offer advantages for both users and companies, they necessitate the transfer and analysis of substantial amounts of sensitive data. Privacy concerns have become increasingly pertinent, necessitating the development of privacy-preserving incentive protocols. Despite the rising demand for secure and decentralized systems, the existing landscape lacks a comprehensive solution. We propose the Boomerang protocol, a novel decentralized privacy-preserving incentive protocol that leverages cryptographic black box accumulators to securely store user interactions within the incentive system. Moreover, the protocol employs zero-knowledge proofs based on BulletProofs to transparently compute rewards for users, ensuring verifiability while preserving their privacy. To further enhance public verifiability and transparency, we utilize a smart contract on a Layer 1 blockchain to verify these zero-knowledge proofs. The careful combination of black box accumulators with selected elliptic curves in the zero-knowledge proofs makes the Boomerang protocol highly efficient. Our proof of concept implementation shows that we can handle up to 23.6 million users per day, on a single-threaded backend server with financial costs of approximately 2 USD. Using the Solana blockchain we can handle 15.5 million users per day with approximate costs of 0.00011 USD per user. The Boomerang protocol represents a significant advancement in privacy-preserving incentive protocols, laying the groundwork for a more secure and privacy-centric future. △ Less

Submitted 9 January, 2024; v1 submitted 6 December, 2023; originally announced January 2024.

Comments: fix formatting issue in abstract

arXiv:2311.03417 [pdf]

Federated Learning for Clinical Structured Data: A Benchmark Comparison of Engineering and Statistical Approaches

Authors: Siqi Li, Di Miao, Qiming Wu, Chuan Hong, Danny D'Agostino, Xin Li, Yilin Ning, Yuqing Shang, Huazhu Fu, Marcus Eng Hock Ong, Hamed Haddadi, Nan Liu

Abstract: Federated learning (FL) has shown promising potential in safeguarding data privacy in healthcare collaborations. While the term "FL" was originally coined by the engineering community, the statistical field has also explored similar privacy-preserving algorithms. Statistical FL algorithms, however, remain considerably less recognized than their engineering counterparts. Our goal was to bridge the… ▽ More Federated learning (FL) has shown promising potential in safeguarding data privacy in healthcare collaborations. While the term "FL" was originally coined by the engineering community, the statistical field has also explored similar privacy-preserving algorithms. Statistical FL algorithms, however, remain considerably less recognized than their engineering counterparts. Our goal was to bridge the gap by presenting the first comprehensive comparison of FL frameworks from both engineering and statistical domains. We evaluated five FL frameworks using both simulated and real-world data. The results indicate that statistical FL algorithms yield less biased point estimates for model coefficients and offer convenient confidence interval estimations. In contrast, engineering-based methods tend to generate more accurate predictions, sometimes surpassing central pooled and statistical FL models. This study underscores the relative strengths and weaknesses of both types of methods, emphasizing the need for increased awareness and their integration in future FL applications. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2310.08147 [pdf, other]

Optimization of Federated Learning's Client Selection for Non-IID Data Based on Grey Relational Analysis

Authors: Shuaijun Chen, Omid Tavallaie, Michael Henri Hambali, Seid Miad Zandavi, Hamed Haddadi, Nicholas Lane, Song Guo, Albert Y. Zomaya

Abstract: Federated learning (FL) is a novel distributed learning framework designed for applications with privacy-sensitive data. Without sharing data, FL trains local models on individual devices and constructs the global model on the server by performing model aggregation. However, to reduce the communication cost, the participants in each training round are randomly selected, which significantly decreas… ▽ More Federated learning (FL) is a novel distributed learning framework designed for applications with privacy-sensitive data. Without sharing data, FL trains local models on individual devices and constructs the global model on the server by performing model aggregation. However, to reduce the communication cost, the participants in each training round are randomly selected, which significantly decreases the training efficiency under data and device heterogeneity. To address this issue, in this paper, we introduce a novel approach that considers the data distribution and computational resources of devices to select the clients for each training round. Our proposed method performs client selection based on the Grey Relational Analysis (GRA) theory by considering available computational resources for each client, the training loss, and weight divergence. To examine the usability of our proposed method, we implement our contribution on Amazon Web Services (AWS) by using the TensorFlow library of Python. We evaluate our algorithm's performance in different setups by varying the learning rate, network size, the number of selected clients, and the client selection round. The evaluation results show that our proposed algorithm enhances the performance significantly in terms of test accuracy and the average client's waiting time compared to state-of-the-art methods, federated averaging and Pow-d. △ Less

Submitted 23 January, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

arXiv:2309.05845 [pdf, other]

Effective Abnormal Activity Detection on Multivariate Time Series Healthcare Data

Authors: Mengjia Niu, Yuchen Zhao, Hamed Haddadi

Abstract: Multivariate time series (MTS) data collected from multiple sensors provide the potential for accurate abnormal activity detection in smart healthcare scenarios. However, anomalies exhibit diverse patterns and become unnoticeable in MTS data. Consequently, achieving accurate anomaly detection is challenging since we have to capture both temporal dependencies of time series and inter-relationships… ▽ More Multivariate time series (MTS) data collected from multiple sensors provide the potential for accurate abnormal activity detection in smart healthcare scenarios. However, anomalies exhibit diverse patterns and become unnoticeable in MTS data. Consequently, achieving accurate anomaly detection is challenging since we have to capture both temporal dependencies of time series and inter-relationships among variables. To address this problem, we propose a Residual-based Anomaly Detection approach, Rs-AD, for effective representation learning and abnormal activity detection. We evaluate our scheme on a real-world gait dataset and the experimental results demonstrate an F1 score of 0.839. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: Poster accepted by the 29th Annual International Conference On Mobile Computing And Networking (ACM MobiCom 2023)

ACM Class: J.3; I.2.6

arXiv:2308.15309 [pdf, other]

Understanding the Privacy Risks of Popular Search Engine Advertising Systems

Authors: Salim Chouaki, Oana Goga, Hamed Haddadi, Peter Snyder

Abstract: We present the first extensive measurement of the privacy properties of the advertising systems used by privacy-focused search engines. We propose an automated methodology to study the impact of clicking on search ads on three popular private search engines which have advertising-based business models: StartPage, Qwant, and DuckDuckGo, and we compare them to two dominant data-harvesting ones: Goog… ▽ More We present the first extensive measurement of the privacy properties of the advertising systems used by privacy-focused search engines. We propose an automated methodology to study the impact of clicking on search ads on three popular private search engines which have advertising-based business models: StartPage, Qwant, and DuckDuckGo, and we compare them to two dominant data-harvesting ones: Google and Bing. We investigate the possibility of third parties tracking users when clicking on ads by analyzing first-party storage, redirection domain paths, and requests sent before, when, and after the clicks. Our results show that privacy-focused search engines fail to protect users' privacy when clicking ads. Users' requests are sent through redirectors on 4% of ad clicks on Bing, 86% of ad clicks on Qwant, and 100% of ad clicks on Google, DuckDuckGo, and StartPage. Even worse, advertising systems collude with advertisers across all search engines by passing unique IDs to advertisers in most ad clicks. These IDs allow redirectors to aggregate users' activity on ads' destination websites in addition to the activity they record when users are redirected through them. Overall, we observe that both privacy-focused and traditional search engines engage in privacy-harming behaviors allowing cross-site tracking, even in privacy-enhanced browsers. △ Less

Submitted 23 September, 2023; v1 submitted 29 August, 2023; originally announced August 2023.

arXiv:2306.13039 [pdf, other]

GT-TSCH: Game-Theoretic Distributed TSCH Scheduler for Low-Power IoT Networks

Authors: Omid Tavallaie, Seid Miad Zandavi, Hamed Haddadi, Albert Y. Zomaya

Abstract: Time-Slotted Channel Hopping (TSCH) is a synchronous medium access mode of the IEEE 802.15.4e standard designed for providing low-latency and highly-reliable end-to-end communication. TSCH constructs a communication schedule by combining frequency channel hopping with Time Division Multiple Access (TDMA). In recent years, IETF designed several standards to define general mechanisms for the impleme… ▽ More Time-Slotted Channel Hopping (TSCH) is a synchronous medium access mode of the IEEE 802.15.4e standard designed for providing low-latency and highly-reliable end-to-end communication. TSCH constructs a communication schedule by combining frequency channel hopping with Time Division Multiple Access (TDMA). In recent years, IETF designed several standards to define general mechanisms for the implementation of TSCH. However, the problem of updating the TSCH schedule according to the changes of the wireless link quality and node's traffic load left unresolved. In this paper, we use non-cooperative game theory to propose GT-TSCH, a distributed TSCH scheduler designed for low-power IoT applications. By considering selfish behavior of nodes in packet forwarding, GT-TSCH updates the TSCH schedule in a distributed approach with low control overhead by monitoring the queue length, the place of the node in the Directed Acyclic Graph (DAG) topology, the quality of the wireless link, and the data packet generation rate. We prove the existence and uniqueness of Nash equilibrium in our game model and we find the optimal number of TSCH Tx timeslots to update the TSCH slotframe. To examine the performance of our contribution, we implement GT-TSCH on Zolertia Firefly IoT motes and the Contiki-NG Operating System (OS). The evaluation results reveal that GT-TSCH improves performance in terms of throughput and end-to-end delay compared to the state-of-the-art method. △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: 43rd IEEE International Conference on Distributed Computing Systems

arXiv:2306.04337 [pdf, other]

A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment

Authors: Xavier F. Cadet, Ranya Aloufi, Sara Ahmadi-Abhari, Hamed Haddadi

Abstract: Automating dysarthria assessments offers the opportunity to develop practical, low-cost tools that address the current limitations of manual and subjective assessments. Nonetheless, the small size of most dysarthria datasets makes it challenging to develop automated assessment. Recent research showed that speech representations from models pre-trained on large unlabelled data can enhance Automatic… ▽ More Automating dysarthria assessments offers the opportunity to develop practical, low-cost tools that address the current limitations of manual and subjective assessments. Nonetheless, the small size of most dysarthria datasets makes it challenging to develop automated assessment. Recent research showed that speech representations from models pre-trained on large unlabelled data can enhance Automatic Speech Recognition (ASR) performance for dysarthric speech. We are the first to evaluate the representations from pre-trained state-of-the-art Self-Supervised models across three downstream tasks on dysarthric speech: disease classification, word recognition and intelligibility classification, and under three noise scenarios on the UA-Speech dataset. We show that HuBERT is the most versatile feature extractor across dysarthria classification, word recognition, and intelligibility classification, achieving respectively $+24.7\%, +61\%, \text{and} +7.2\%$ accuracy compared to classical acoustic features. △ Less

Submitted 22 March, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: Accepted as a workshop paper at ICASSP SASB 2024

arXiv:2306.01398 [pdf, other]

Evaluating The Robustness of Self-Supervised Representations to Background/Foreground Removal

Authors: Xavier F. Cadet, Ranya Aloufi, Alain Miranville, Sara Ahmadi-Abhari, Hamed Haddadi

Abstract: Despite impressive empirical advances of SSL in solving various tasks, the problem of understanding and characterizing SSL representations learned from input data remains relatively under-explored. We provide a comparative analysis of how the representations produced by SSL models differ when masking parts of the input. Specifically, we considered state-of-the-art SSL pretrained models, such as DI… ▽ More Despite impressive empirical advances of SSL in solving various tasks, the problem of understanding and characterizing SSL representations learned from input data remains relatively under-explored. We provide a comparative analysis of how the representations produced by SSL models differ when masking parts of the input. Specifically, we considered state-of-the-art SSL pretrained models, such as DINOv2, MAE, and SwaV, and analyzed changes at the representation levels across 4 Image Classification datasets. First, we generate variations of the datasets by applying foreground and background segmentation. Then, we conduct statistical analysis using Canonical Correlation Analysis (CCA) and Centered Kernel Alignment (CKA) to evaluate the robustness of the representations learned in SSL models. Empirically, we show that not all models lead to representations that separate foreground, background, and complete images. Furthermore, we test different masking strategies by occluding the center regions of the images to address cases where foreground and background are difficult. For example, the DTD dataset that focuses on texture rather specific objects. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.18954 [pdf, other]

Towards Machine Learning and Inference for Resource-constrained MCUs

Authors: Yushan Huang, Hamed Haddadi

Abstract: Machine learning (ML) is moving towards edge devices. However, ML models with high computational demands and energy consumption pose challenges for ML inference in resource-constrained environments, such as the deep sea. To address these challenges, we propose a battery-free ML inference and model personalization pipeline for microcontroller units (MCUs). As an example, we performed fish image rec… ▽ More Machine learning (ML) is moving towards edge devices. However, ML models with high computational demands and energy consumption pose challenges for ML inference in resource-constrained environments, such as the deep sea. To address these challenges, we propose a battery-free ML inference and model personalization pipeline for microcontroller units (MCUs). As an example, we performed fish image recognition in the ocean. We evaluated and compared the accuracy, runtime, power, and energy consumption of the model before and after optimization. The results demonstrate that, our pipeline can achieve 97.78% accuracy with 483.82 KB Flash, 70.32 KB RAM, 118 ms runtime, 4.83 mW power, and 0.57 mJ energy consumption on MCUs, reducing by 64.17%, 12.31%, 52.42%, 63.74%, and 82.67%, compared to the baseline. The results indicate the feasibility of battery-free ML inference on MCUs. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Poster accepted by the 21st ACM International Conference on Mobile Systems, Applications, and Services (ACM MobiSys 2023)

arXiv:2305.05257 [pdf, other]

Survey of Federated Learning Models for Spatial-Temporal Mobility Applications

Authors: Yacine Belal, Sonia Ben Mokhtar, Hamed Haddadi, Jaron Wang, Afra Mashhadi

Abstract: Federated learning involves training statistical models over edge devices such as mobile phones such that the training data is kept local. Federated Learning (FL) can serve as an ideal candidate for training spatial temporal models that rely on heterogeneous and potentially massive numbers of participants while preserving the privacy of highly sensitive location data. However, there are unique cha… ▽ More Federated learning involves training statistical models over edge devices such as mobile phones such that the training data is kept local. Federated Learning (FL) can serve as an ideal candidate for training spatial temporal models that rely on heterogeneous and potentially massive numbers of participants while preserving the privacy of highly sensitive location data. However, there are unique challenges involved with transitioning existing spatial temporal models to decentralized learning. In this survey paper, we review the existing literature that has proposed FL-based models for predicting human mobility, traffic prediction, community detection, location-based recommendation systems, and other spatial-temporal tasks. We describe the metrics and datasets these works have been using and create a baseline of these approaches in comparison to the centralized settings. Finally, we discuss the challenges of applying spatial-temporal models in a decentralized setting and by highlighting the gaps in the literature we provide a road map and opportunities for the research community. △ Less

Submitted 8 February, 2024; v1 submitted 9 May, 2023; originally announced May 2023.

ACM Class: A.1; D.4.6; H.4.3; H.5.6; I.2.6; I.5.3; I.5.8

arXiv:2304.07310 [pdf]

doi 10.1093/jamia/ocad170

Federated and distributed learning applications for electronic health records and structured medical data: A scoping review

Authors: Siqi Li, Pinyan Liu, Gustavo G. Nascimento, Xinru Wang, Fabio Renato Manzolli Leite, Bibhas Chakraborty, Chuan Hong, Yilin Ning, Feng Xie, Zhen Ling Teo, Daniel Shu Wei Ting, Hamed Haddadi, Marcus Eng Hock Ong, Marco Aurélio Peres, Nan Liu

Abstract: Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medi… ▽ More Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medical data, identifies contemporary limitations and discusses potential innovations. We searched five databases, SCOPUS, MEDLINE, Web of Science, Embase, and CINAHL, to identify articles that applied FL to structured medical data and reported results following the PRISMA guidelines. Each selected publication was evaluated from three primary perspectives, including data quality, modeling strategies, and FL frameworks. Out of the 1160 papers screened, 34 met the inclusion criteria, with each article consisting of one or more studies that used FL to handle structured clinical/medical data. Of these, 24 utilized data acquired from electronic health records, with clinical predictions and association studies being the most common clinical research tasks that FL was applied to. Only one article exclusively explored the vertical FL setting, while the remaining 33 explored the horizontal FL setting, with only 14 discussing comparisons between single-site (local) and FL (global) analysis. The existing FL applications on structured medical data lack sufficient evaluations of clinically meaningful benefits, particularly when compared to single-site analyses. Therefore, it is crucial for future FL applications to prioritize clinical motivations and develop designs and methodologies that can effectively support and aid clinical practice and research. △ Less

Submitted 14 April, 2023; originally announced April 2023.

arXiv:2304.06469 [pdf, other]

Analysing Fairness of Privacy-Utility Mobility Models

Authors: Yuting Zhan, Hamed Haddadi, Afra Mashhadi

Abstract: Preserving the individuals' privacy in sharing spatial-temporal datasets is critical to prevent re-identification attacks based on unique trajectories. Existing privacy techniques tend to propose ideal privacy-utility tradeoffs, however, largely ignore the fairness implications of mobility models and whether such techniques perform equally for different groups of users. The quantification between… ▽ More Preserving the individuals' privacy in sharing spatial-temporal datasets is critical to prevent re-identification attacks based on unique trajectories. Existing privacy techniques tend to propose ideal privacy-utility tradeoffs, however, largely ignore the fairness implications of mobility models and whether such techniques perform equally for different groups of users. The quantification between fairness and privacy-aware models is still unclear and there barely exists any defined sets of metrics for measuring fairness in the spatial-temporal context. In this work, we define a set of fairness metrics designed explicitly for human mobility, based on structural similarity and entropy of the trajectories. Under these definitions, we examine the fairness of two state-of-the-art privacy-preserving models that rely on GAN and representation learning to reduce the re-identification rate of users for data sharing. Our results show that while both models guarantee group fairness in terms of demographic parity, they violate individual fairness criteria, indicating that users with highly similar trajectories receive disparate privacy gain. We conclude that the tension between the re-identification task and individual fairness needs to be considered for future spatial-temporal data analysis and modelling to achieve a privacy-preserving fairness-aware setting. △ Less

Submitted 10 April, 2023; originally announced April 2023.

arXiv:2304.03045 [pdf, other]

Protected or Porous: A Comparative Analysis of Threat Detection Capability of IoT Safeguards

Authors: Anna Maria Mandalari, Hamed Haddadi, Daniel J. Dubois, David Choffnes

Abstract: Consumer Internet of Things (IoT) devices are increasingly common, from smart speakers to security cameras, in homes. Along with their benefits come potential privacy and security threats. To limit these threats a number of commercial services have become available (IoT safeguards). The safeguards claim to provide protection against IoT privacy risks and security threats. However, the effectivenes… ▽ More Consumer Internet of Things (IoT) devices are increasingly common, from smart speakers to security cameras, in homes. Along with their benefits come potential privacy and security threats. To limit these threats a number of commercial services have become available (IoT safeguards). The safeguards claim to provide protection against IoT privacy risks and security threats. However, the effectiveness and the associated privacy risks of these safeguards remains a key open question. In this paper, we investigate the threat detection capabilities of IoT safeguards for the first time. We develop and release an approach for automated safeguards experimentation to reveal their response to common security threats and privacy risks. We perform thousands of automated experiments using popular commercial IoT safeguards when deployed in a large IoT testbed. Our results indicate not only that these devices may be ineffective in preventing risks, but also their cloud interactions and data collection operations may introduce privacy risks for the households that adopt them. △ Less

Submitted 6 April, 2023; originally announced April 2023.

arXiv:2302.11654 [pdf, other]

Information Theory Inspired Pattern Analysis for Time-series Data

Authors: Yushan Huang, Yuchen Zhao, Alexander Capstick, Francesca Palermo, Hamed Haddadi, Payam Barnaghi

Abstract: Current methods for pattern analysis in time series mainly rely on statistical features or probabilistic learning and inference methods to identify patterns and trends in the data. Such methods do not generalize well when applied to multivariate, multi-source, state-varying, and noisy time-series data. To address these issues, we propose a highly generalizable method that uses information theory-b… ▽ More Current methods for pattern analysis in time series mainly rely on statistical features or probabilistic learning and inference methods to identify patterns and trends in the data. Such methods do not generalize well when applied to multivariate, multi-source, state-varying, and noisy time-series data. To address these issues, we propose a highly generalizable method that uses information theory-based features to identify and learn from patterns in multivariate time-series data. To demonstrate the proposed approach, we analyze pattern changes in human activity data. For applications with stochastic state transitions, features are developed based on Shannon's entropy of Markov chains, entropy rates of Markov chains, entropy production of Markov chains, and von Neumann entropy of Markov chains. For applications where state modeling is not applicable, we utilize five entropy variants, including approximate entropy, increment entropy, dispersion entropy, phase entropy, and slope entropy. The results show the proposed information theory-based features improve the recall rate, F1 score, and accuracy on average by up to 23.01% compared with the baseline models and a simpler model structure, with an average reduction of 18.75 times in the number of model parameters. △ Less

Submitted 28 April, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

arXiv:2212.14736 [pdf, other]

PRISM: Privacy Preserving Healthcare Internet of Things Security Management

Authors: Savvas Hadjixenophontos, Anna Maria Mandalari, Yuchen Zhao, Hamed Haddadi

Abstract: Consumer healthcare Internet of Things (IoT) devices are gaining popularity in our homes and hospitals. These devices provide continuous monitoring at a low cost and can be used to augment high-precision medical equipment. However, major challenges remain in applying pre-trained global models for anomaly detection on smart health monitoring, for a diverse set of individuals that they provide care… ▽ More Consumer healthcare Internet of Things (IoT) devices are gaining popularity in our homes and hospitals. These devices provide continuous monitoring at a low cost and can be used to augment high-precision medical equipment. However, major challenges remain in applying pre-trained global models for anomaly detection on smart health monitoring, for a diverse set of individuals that they provide care for. In this paper, we propose PRISM, an edge-based system for experimenting with in-home smart healthcare devices. We develop a rigorous methodology that relies on automated IoT experimentation. We use a rich real-world dataset from in-home patient monitoring from 44 households of People Living With Dementia (PLWD) over two years. Our results indicate that anomalies can be identified with accuracy up to 99% and mean training times as low as 0.88 seconds. While all models achieve high accuracy when trained on the same patient, their accuracy degrades when evaluated on different patients. △ Less

Submitted 2 June, 2023; v1 submitted 27 December, 2022; originally announced December 2022.

arXiv:2210.04791 [pdf, other]

doi 10.1145/3563766.3564111

Tango or Square Dance? How Tightly Should we Integrate Network Functionality in Browsers?

Authors: Alex Davidson, Matthias Frei, Marten Gartner, Hamed Haddadi, Jordi Subirà Nieto, Adrian Perrig, Philipp Winter, François Wirz

Abstract: The question at which layer network functionality is presented or abstracted remains a research challenge. Traditionally, network functionality was either placed into the core network, middleboxes, or into the operating system -- but recent developments have expanded the design space to directly introduce functionality into the application (and in particular into the browser) as a way to expose it… ▽ More The question at which layer network functionality is presented or abstracted remains a research challenge. Traditionally, network functionality was either placed into the core network, middleboxes, or into the operating system -- but recent developments have expanded the design space to directly introduce functionality into the application (and in particular into the browser) as a way to expose it to the user. Given the context of emerging path-aware networking technology, an interesting question arises: which layer should handle the new features? We argue that the browser is becoming a powerful platform for network innovation, where even user-driven properties can be implemented in an OS-agnostic fashion. We demonstrate the feasibility of geo-fenced browsing using a prototype browser extension, realized by the SCION path-aware networking architecture, without introducing any significant performance overheads. △ Less

Submitted 10 October, 2022; originally announced October 2022.

Comments: 1 table, 6 figures

arXiv:2210.01736 [pdf, other]

Using Entropy Measures for Monitoring the Evolution of Activity Patterns

Authors: Yushan Huang, Yuchen Zhao, Hamed Haddadi, Payam Barnaghi

Abstract: In this work, we apply information theory inspired methods to quantify changes in daily activity patterns. We use in-home movement monitoring data and show how they can help indicate the occurrence of healthcare-related events. Three different types of entropy measures namely Shannon's entropy, entropy rates for Markov chains, and entropy production rate have been utilised. The measures are evalua… ▽ More In this work, we apply information theory inspired methods to quantify changes in daily activity patterns. We use in-home movement monitoring data and show how they can help indicate the occurrence of healthcare-related events. Three different types of entropy measures namely Shannon's entropy, entropy rates for Markov chains, and entropy production rate have been utilised. The measures are evaluated on a large-scale in-home monitoring dataset that has been collected within our dementia care clinical study. The study uses Internet of Things (IoT) enabled solutions for continuous monitoring of in-home activity, sleep, and physiology to develop care and early intervention solutions to support people living with dementia (PLWD) in their own homes. Our main goal is to show the applicability of the entropy measures to time-series activity data analysis and to use the extracted measures as new engineered features that can be fed into inference and analysis models. The results of our experiments show that in most cases the combination of these measures can indicate the occurrence of healthcare-related events. We also find that different participants with the same events may have different measures based on one entropy measure. So using a combination of these measures in an inference model will be more effective than any of the single measures. △ Less

Submitted 5 October, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

arXiv:2208.10134 [pdf, other]

doi 10.1145/3670007

Machine Learning with Confidential Computing: A Systematization of Knowledge

Authors: Fan Mo, Zahra Tarkhani, Hamed Haddadi

Abstract: Privacy and security challenges in Machine Learning (ML) have become increasingly severe, along with ML's pervasive development and the recent demonstration of large attack surfaces. As a mature system-oriented approach, Confidential Computing has been utilized in both academia and industry to mitigate privacy and security issues in various ML scenarios. In this paper, the conjunction between ML a… ▽ More Privacy and security challenges in Machine Learning (ML) have become increasingly severe, along with ML's pervasive development and the recent demonstration of large attack surfaces. As a mature system-oriented approach, Confidential Computing has been utilized in both academia and industry to mitigate privacy and security issues in various ML scenarios. In this paper, the conjunction between ML and Confidential Computing is investigated. We systematize the prior work on Confidential Computing-assisted ML techniques that provide i) confidentiality guarantees and ii) integrity assurances, and discuss their advanced features and drawbacks. Key challenges are further identified, and we provide dedicated analyses of the limitations in existing Trusted Execution Environment (TEE) systems for ML use cases. Finally, prospective works are discussed, including grounded privacy definitions for closed-loop protection, partitioned executions of efficient ML, dedicated TEE-assisted designs for ML, TEE-aware ML, and ML full pipeline guarantees. By providing these potential solutions in our systematization of knowledge, we aim to build the bridge to help achieve a much stronger TEE-enabled ML for privacy guarantees without introducing computation and system costs. △ Less

Submitted 3 June, 2024; v1 submitted 22 August, 2022; originally announced August 2022.

Comments: Survey paper, 37 pages, accepted to ACM Computing Surveys

arXiv:2208.05009 [pdf, other]

Privacy-Aware Adversarial Network in Human Mobility Prediction

Authors: Yuting Zhan, Hamed Haddadi, Afra Mashhadi

Abstract: As mobile devices and location-based services are increasingly developed in different smart city scenarios and applications, many unexpected privacy leakages have arisen due to geolocated data collection and sharing. User re-identification and other sensitive inferences are major privacy threats when geolocated data are shared with cloud-assisted applications. Significantly, four spatio-temporal p… ▽ More As mobile devices and location-based services are increasingly developed in different smart city scenarios and applications, many unexpected privacy leakages have arisen due to geolocated data collection and sharing. User re-identification and other sensitive inferences are major privacy threats when geolocated data are shared with cloud-assisted applications. Significantly, four spatio-temporal points are enough to uniquely identify 95\% of the individuals, which exacerbates personal information leakages. To tackle malicious purposes such as user re-identification, we propose an LSTM-based adversarial mechanism with representation learning to attain a privacy-preserving feature representation of the original geolocated data (i.e., mobility data) for a sharing purpose. These representations aim to maximally reduce the chance of user re-identification and full data reconstruction with a minimal utility budget (i.e., loss). We train the mechanism by quantifying privacy-utility trade-off of mobility datasets in terms of trajectory reconstruction risk, user re-identification risk, and mobility predictability. We report an exploratory analysis that enables the user to assess this trade-off with a specific loss function and its weight parameters. The extensive comparison results on four representative mobility datasets demonstrate the superiority of our proposed architecture in mobility privacy protection and the efficiency of the proposed privacy-preserving features extractor. We show that the privacy of mobility traces attains decent protection at the cost of marginal mobility utility. Our results also show that by exploring the Pareto optimal setting, we can simultaneously increase both privacy (45%) and utility (32%). △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: 15 pages, PoPETs'23, July 10--14, 2023, Lausanne, Switzerland. arXiv admin note: substantial text overlap with arXiv:2201.07519

arXiv:2207.04500 [pdf, other]

FIB: A Method for Evaluation of Feature Impact Balance in Multi-Dimensional Data

Authors: Xavier F. Cadet, Sara Ahmadi-Abhari, Hamed Haddadi

Abstract: Errors might not have the same consequences depending on the task at hand. Nevertheless, there is limited research investigating the impact of imbalance in the contribution of different features in an error vector. Therefore, we propose the Feature Impact Balance (FIB) score. It measures whether there is a balanced impact of features in the discrepancies between two vectors. We designed the FIB sc… ▽ More Errors might not have the same consequences depending on the task at hand. Nevertheless, there is limited research investigating the impact of imbalance in the contribution of different features in an error vector. Therefore, we propose the Feature Impact Balance (FIB) score. It measures whether there is a balanced impact of features in the discrepancies between two vectors. We designed the FIB score to lie in [0, 1]. Scores close to 0 indicate that a small number of features contribute to most of the error, and scores close to 1 indicate that most features contribute to the error equally. We experimentally study the FIB on different datasets, using AutoEncoders and Variational AutoEncoders. We show how the feature impact balance varies during training and showcase its usability to support model selection for single output and multi-output tasks. △ Less

Submitted 10 July, 2022; originally announced July 2022.

arXiv:2206.04123 [pdf, ps, other]

Nitriding: A tool kit for building scalable, networked, secure enclaves

Authors: Philipp Winter, Ralph Giles, Moritz Schafhuber, Hamed Haddadi

Abstract: Enclave deployments often fail to simultaneously be secure (e.g., resistant to side channel attacks), powerful (i.e., as fast as an off-the-shelf server), and flexible (i.e., unconstrained by development hurdles). In this paper, we present nitriding, an open tool kit that enables the development of enclave applications that satisfy all three properties. We build nitriding on top of the recently-pr… ▽ More Enclave deployments often fail to simultaneously be secure (e.g., resistant to side channel attacks), powerful (i.e., as fast as an off-the-shelf server), and flexible (i.e., unconstrained by development hurdles). In this paper, we present nitriding, an open tool kit that enables the development of enclave applications that satisfy all three properties. We build nitriding on top of the recently-proposed AWS Nitro Enclaves whose architecture prevents side channel attacks by design, making nitriding more secure than comparable frameworks. We abstract away the constrained development model of Nitro Enclaves, making it possible to run unmodified applications inside an enclave that have seamless and secure Internet connectivity, all while making our code user-verifiable. To demonstrate nitriding's flexibility, we design three enclave applications, each a research contribution in its own right: (i) we run a Tor bridge inside an enclave, making it resistant to protocol-level deanonymization attacks; (ii) we built a service for securely revealing infrastructure configuration, empowering users to verify privacy promises like the discarding of IP addresses at the edge; (iii) and we move a Chromium browser into an enclave, thereby isolating its attack surface from the user's system. We find that nitriding enables rapid prototyping and alleviates the deployment of production-quality systems, paving the way toward usable and secure enclaves. △ Less

Submitted 29 July, 2023; v1 submitted 8 June, 2022; originally announced June 2022.

arXiv:2205.14026 [pdf, other]

On-Device Voice Authentication with Paralinguistic Privacy

Authors: Ranya Aloufi, Hamed Haddadi, David Boyle

Abstract: Using our voices to access, and interact with, online services raises concerns about the trade-offs between convenience, privacy, and security. The conflict between maintaining privacy and ensuring input authenticity has often been hindered by the need to share raw data, which contains all the paralinguistic information required to infer a variety of sensitive characteristics. Users of voice assis… ▽ More Using our voices to access, and interact with, online services raises concerns about the trade-offs between convenience, privacy, and security. The conflict between maintaining privacy and ensuring input authenticity has often been hindered by the need to share raw data, which contains all the paralinguistic information required to infer a variety of sensitive characteristics. Users of voice assistants put their trust in service providers; however, this trust is potentially misplaced considering the emergence of first-party 'honest-but-curious' or 'semi-honest' threats. A further security risk is presented by imposters gaining access to systems by pretending to be the user leveraging replay or 'deepfake' attacks. Our objective is to design and develop a new voice input-based system that offers the following specifications: local authentication to reduce the need for sharing raw voice data, local privacy preservation based on user preferences, allowing more flexibility in integrating such a system given target applications privacy constraints, and achieving good performance in these targeted applications. The key idea is to locally derive token-based credentials based on unique-identifying attributes obtained from the user's voice and offer selective sensitive information filtering before transmitting raw data. Our system consists of (i) 'VoiceID', boosted with a liveness detection technology to thwart replay attacks; (ii) a flexible privacy filter that allows users to select the level of privacy protection they prefer for their data. The system yields 98.68% accuracy in verifying legitimate users with cross-validation and runs in tens of milliseconds on a CPU and single-core ARM processor without specialized hardware. Our system demonstrates the feasibility of filtering raw voice input closer to users, in accordance with their privacy preferences, while maintaining their authenticity. △ Less

Submitted 24 February, 2023; v1 submitted 27 May, 2022; originally announced May 2022.

Comments: 15 pages

arXiv:2203.14088 [pdf]

Distributed data analytics

Authors: Richard Mortier, Hamed Haddadi, Sandra Servia, Liang Wang

Abstract: Machine Learning (ML) techniques have begun to dominate data analytics applications and services. Recommendation systems are a key component of online service providers. The financial industry has adopted ML to harness large volumes of data in areas such as fraud detection, risk-management, and compliance. Deep Learning is the technology behind voice-based personal assistants, etc. Deployment of M… ▽ More Machine Learning (ML) techniques have begun to dominate data analytics applications and services. Recommendation systems are a key component of online service providers. The financial industry has adopted ML to harness large volumes of data in areas such as fraud detection, risk-management, and compliance. Deep Learning is the technology behind voice-based personal assistants, etc. Deployment of ML technologies onto cloud computing infrastructures has benefited numerous aspects of our daily life. The advertising and associated online industries in particular have fuelled a rapid rise the in deployment of personal data collection and analytics tools. Traditionally, behavioural analytics relies on collecting vast amounts of data in centralised cloud infrastructure before using it to train machine learning models that allow user behaviour and preferences to be inferred. A contrasting approach, distributed data analytics, where code and models for training and inference are distributed to the places where data is collected, has been boosted by two recent, ongoing developments: increased processing power and memory capacity available in user devices at the edge of the network, such as smartphones and home assistants; and increased sensitivity to the highly intrusive nature of many of these devices and services and the attendant demands for improved privacy. Indeed, the potential for increased privacy is not the only benefit of distributing data analytics to the edges of the network: reducing the movement of large volumes of data can also improve energy efficiency, helping to ameliorate the ever increasing carbon footprint of our digital infrastructure, enabling much lower latency for service interactions than is possible when services are cloud-hosted. These approaches often introduce challenges in privacy, utility, and efficiency trade-offs, while having to ensure fruitful user engagement. △ Less

Submitted 26 March, 2022; originally announced March 2022.

Comments: Accepted as Chapter 8 of "Privacy by Design for the Internet of Things: Building accountability and security"

arXiv:2203.03528 [pdf, other]

Blocked or Broken? Automatically Detecting When Privacy Interventions Break Websites

Authors: Michael Smith, Peter Snyder, Moritz Haller, Benjamin Livshits, Deian Stefan, Hamed Haddadi

Abstract: A core problem in the development and maintenance of crowd-sourced filter lists is that their maintainers cannot confidently predict whether (and where) a new filter list rule will break websites. This is a result of enormity of the Web, which prevents filter list authors from broadly understanding the impact of a new blocking rule before they ship it to millions of users. The inability of filter… ▽ More A core problem in the development and maintenance of crowd-sourced filter lists is that their maintainers cannot confidently predict whether (and where) a new filter list rule will break websites. This is a result of enormity of the Web, which prevents filter list authors from broadly understanding the impact of a new blocking rule before they ship it to millions of users. The inability of filter list authors to evaluate the Web compatibility impact of a new rule before shipping it severely reduces the benefits of filter-list-based content blocking: filter lists are both overly-conservative (i.e. rules are tailored narrowly to reduce the risk of breaking things) and error-prone (i.e. blocking tools still break large numbers of sites). To scale to the size and scope of the Web, filter list authors need an automated system to detect when a new filter rule breaks websites, before that breakage has a chance to make it to end users. In this work, we design and implement the first automated system for predicting when a filter list rule breaks a website. We build a classifier, trained on a dataset generated by a combination of compatibility data from the EasyList project and novel browser instrumentation, and find it is accurate to practical levels (AUC 0.88). Our open source system requires no human interaction when assessing the compatibility risk of a proposed privacy intervention. We also present the 40 page behaviors that most predict breakage in observed websites. △ Less

Submitted 2 May, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

arXiv:2202.08174 [pdf, other]

doi 10.1145/3508396.3512877

Towards Battery-Free Machine Learning and Inference in Underwater Environments

Authors: Yuchen Zhao, Sayed Saad Afzal, Waleed Akbar, Osvy Rodriguez, Fan Mo, David Boyle, Fadel Adib, Hamed Haddadi

Abstract: This paper is motivated by a simple question: Can we design and build battery-free devices capable of machine learning and inference in underwater environments? An affirmative answer to this question would have significant implications for a new generation of underwater sensing and monitoring applications for environmental monitoring, scientific exploration, and climate/weather prediction. To an… ▽ More This paper is motivated by a simple question: Can we design and build battery-free devices capable of machine learning and inference in underwater environments? An affirmative answer to this question would have significant implications for a new generation of underwater sensing and monitoring applications for environmental monitoring, scientific exploration, and climate/weather prediction. To answer this question, we explore the feasibility of bridging advances from the past decade in two fields: battery-free networking and low-power machine learning. Our exploration demonstrates that it is indeed possible to enable battery-free inference in underwater environments. We designed a device that can harvest energy from underwater sound, power up an ultra-low-power microcontroller and on-board sensor, perform local inference on sensed measurements using a lightweight Deep Neural Network, and communicate the inference result via backscatter to a receiver. We tested our prototype in an emulated marine bioacoustics application, demonstrating the potential to recognize underwater animal sounds without batteries. Through this exploration, we highlight the challenges and opportunities for making underwater battery-free inference and machine learning ubiquitous. △ Less

Submitted 16 February, 2022; originally announced February 2022.

Comments: 6 pages, HotMobile '22, March 9-10, 2022, Tempe, AZ, USA

arXiv:2201.12614 [pdf, other]

BatteryLab: A Collaborative Platform for Power Monitoring

Authors: Matteo Varvello, Kleomenis Katevas, Mihai Plesa, Hamed Haddadi, Fabian Bustamante, Ben Livshits

Abstract: Advances in cloud computing have simplified the way that both software development and testing are performed. This is not true for battery testing for which state of the art test-beds simply consist of one phone attached to a power meter. These test-beds have limited resources, access, and are overall hard to maintain; for these reasons, they often sit idle with no experiment to run. In this paper… ▽ More Advances in cloud computing have simplified the way that both software development and testing are performed. This is not true for battery testing for which state of the art test-beds simply consist of one phone attached to a power meter. These test-beds have limited resources, access, and are overall hard to maintain; for these reasons, they often sit idle with no experiment to run. In this paper, we propose to share existing battery testbeds and transform them into vantage points of BatteryLab, a power monitoring platform offering heterogeneous devices and testing conditions. We have achieved this vision with a combination of hardware and software which allow to augment existing battery test-beds with remote capabilities. BatteryLab currently counts three vantage points, one in Europe and two in the US, hosting three Android devices and one iPhone 7. We benchmark BatteryLab with respect to the accuracy of its battery readings, system performance, and platform heterogeneity. Next, we demonstrate how measurements can be run atop of BatteryLab by developing the "Web Power Monitor" (WPM), a tool which can measure website power consumption at scale. We released WPM and used it to report on the energy consumption of Alexa's top 1,000 websites across 3 locations and 4 devices (both Android and iOS). △ Less

Submitted 29 January, 2022; originally announced January 2022.

Comments: 25 pages, 11 figures, Passive and Active Measurement Conference 2022 (PAM '22). arXiv admin note: text overlap with arXiv:1910.08951

arXiv:2201.07519 [pdf, other]

Privacy-Aware Human Mobility Prediction via Adversarial Networks

Authors: Yuting Zhan, Alex Kyllo, Afra Mashhadi, Hamed Haddadi

Abstract: As various mobile devices and location-based services are increasingly developed in different smart city scenarios and applications, many unexpected privacy leakages have arisen due to geolocated data collection and sharing. While these geolocated data could provide a rich understanding of human mobility patterns and address various societal research questions, privacy concerns for users' sensitiv… ▽ More As various mobile devices and location-based services are increasingly developed in different smart city scenarios and applications, many unexpected privacy leakages have arisen due to geolocated data collection and sharing. While these geolocated data could provide a rich understanding of human mobility patterns and address various societal research questions, privacy concerns for users' sensitive information have limited their utilization. In this paper, we design and implement a novel LSTM-based adversarial mechanism with representation learning to attain a privacy-preserving feature representation of the original geolocated data (mobility data) for a sharing purpose. We quantify the utility-privacy trade-off of mobility datasets in terms of trajectory reconstruction risk, user re-identification risk, and mobility predictability. Our proposed architecture reports a Pareto Frontier analysis that enables the user to assess this trade-off as a function of Lagrangian loss weight parameters. The extensive comparison results on four representative mobility datasets demonstrate the superiority of our proposed architecture and the efficiency of the proposed privacy-preserving features extractor. Our results show that by exploring Pareto optimal setting, we can simultaneously increase both privacy (45%) and utility (32%). △ Less

Submitted 19 January, 2022; originally announced January 2022.

arXiv:2112.06498 [pdf, other]

Proof of Steak

Authors: Jon Crowcroft, Hamed Haddadi, Arthur Gervais, Tristan Henderson

Abstract: We introduce Proof-of-Steak (PoS) as a fundamental net-zero block generation technique, often accompanied by Non-Frangipane Tokens. Genesis cut is gradually heated and minted (using the appropriate sauce), enabling the miners to redirect the extracted gold and the dissipated heat into the furnace, hence enabling the first fully-circular economy ever built using blockchain technology, utilising tam… ▽ More We introduce Proof-of-Steak (PoS) as a fundamental net-zero block generation technique, often accompanied by Non-Frangipane Tokens. Genesis cut is gradually heated and minted (using the appropriate sauce), enabling the miners to redirect the extracted gold and the dissipated heat into the furnace, hence enabling the first fully-circular economy ever built using blockchain technology, utilising tamper-evident steak haché. In this paper we present the basic ingredients for building Proof-of-Steak, assessing its global impact, and opportunities to save the world and beyond! △ Less

Submitted 13 December, 2021; originally announced December 2021.

Comments: This is a silly article

arXiv:2112.06324 [pdf, other]

Pool-Party: Exploiting Browser Resource Pools as Side-Channels for Web Tracking

Authors: Peter Snyder, Soroush Karami, Arthur Edelstein, Benjamin Livshits, Hamed Haddadi

Abstract: We identify class of covert channels in browsers that are not mitigated by current defenses, which we call "pool-party" attacks. Pool-party attacks allow sites to create covert channels by manipulating limited-but-unpartitioned resource pools. These class of attacks have been known, but in this work we show that they are both more prevalent, more practical for exploitation, and allow exploitation… ▽ More We identify class of covert channels in browsers that are not mitigated by current defenses, which we call "pool-party" attacks. Pool-party attacks allow sites to create covert channels by manipulating limited-but-unpartitioned resource pools. These class of attacks have been known, but in this work we show that they are both more prevalent, more practical for exploitation, and allow exploitation in more ways, than previously identified. These covert channels have sufficient bandwidth to pass cookies and identifiers across site boundaries under practical and real-world conditions. We identify pool-party attacks in all popular browsers, and show they are practical cross-site tracking techniques (i.e., attacks take 0.6s in Chrome and Edge, and 7s in Firefox and Tor Browser). In this paper we make the following contributions: first, we describe pool-party covert channel attacks that exploit limits in application-layer resource pools in browsers. Second, we demonstrate that pool-party attacks are practical, and can be used to track users in all popular browsers; we also share open source implementations of the attack and evaluate them through a representative web crawl. Third, we show that in Gecko based-browsers (including the Tor Browser) pool-party attacks can also be used for cross-profile tracking (e.g., linking user behavior across normal and private browsing sessions). Finally, we discuss possible mitigation strategies and defenses △ Less

Submitted 21 March, 2023; v1 submitted 12 December, 2021; originally announced December 2021.

arXiv:2110.13941 [pdf, other]

doi 10.1145/3488659.3493777

Rapid IoT Device Identification at the Edge

Authors: Oliver Thompson, Anna Maria Mandalari, Hamed Haddadi

Abstract: Consumer Internet of Things (IoT) devices are increasingly common in everyday homes, from smart speakers to security cameras. Along with their benefits come potential privacy and security threats. To limit these threats we must implement solutions to filter IoT traffic at the edge. To this end the identification of the IoT device is the first natural step. In this paper we demonstrate a novel me… ▽ More Consumer Internet of Things (IoT) devices are increasingly common in everyday homes, from smart speakers to security cameras. Along with their benefits come potential privacy and security threats. To limit these threats we must implement solutions to filter IoT traffic at the edge. To this end the identification of the IoT device is the first natural step. In this paper we demonstrate a novel method of rapid IoT device identification that uses neural networks trained on device DNS traffic that can be captured from a DNS server on the local network. The method identifies devices by fitting a model to the first seconds of DNS second-level-domain traffic following their first connection. Since security and privacy threat detection often operate at a device specific level, rapid identification allows these strategies to be implemented immediately. Through a total of 51,000 rigorous automated experiments, we classify 30 consumer IoT devices from 27 different manufacturers with 82% and 93% accuracy for product type and device manufacturers respectively. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Journal ref: 2nd Workshop on Distributed Machine Learning, co-located with CoNEXT 2021

arXiv:2109.10074 [pdf, other]

doi 10.1145/3548606.3560631

STAR: Secret Sharing for Private Threshold Aggregation Reporting

Authors: Alex Davidson, Peter Snyder, E. B. Quirk, Joseph Genereux, Benjamin Livshits, Hamed Haddadi

Abstract: Threshold aggregation reporting systems promise a practical, privacy-preserving solution for developers to learn how their applications are used "\emph{in-the-wild}". Unfortunately, proposed systems to date prove impractical for wide scale adoption, suffering from a combination of requiring: \emph{i)} prohibitive trust assumptions; \emph{ii)} high computation costs; or \emph{iii)} massive user bas… ▽ More Threshold aggregation reporting systems promise a practical, privacy-preserving solution for developers to learn how their applications are used "\emph{in-the-wild}". Unfortunately, proposed systems to date prove impractical for wide scale adoption, suffering from a combination of requiring: \emph{i)} prohibitive trust assumptions; \emph{ii)} high computation costs; or \emph{iii)} massive user bases. As a result, adoption of truly-private approaches has been limited to only a small number of enormous (and enormously costly) projects. In this work, we improve the state of private data collection by proposing $\mathsf{STAR}$, a highly efficient, easily deployable system for providing cryptographically-enforced $κ$-anonymity protections on user data collection. The $\mathsf{STAR}$ protocol is easy to implement and cheap to run, all while providing privacy properties similar to, or exceeding the current state-of-the-art. Measurements of our open-source implementation of $\mathsf{STAR}$ find that it is $1773\times$ quicker, requires $62.4\times$ less communication, and is $24\times$ cheaper to run than the existing state-of-the-art. △ Less

Submitted 7 September, 2022; v1 submitted 21 September, 2021; originally announced September 2021.

Journal ref: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (CCS '22), November 7--11, 2022, Los Angeles, CA, USA

arXiv:2109.04833 [pdf, other]

Multimodal Federated Learning on IoT Data

Authors: Yuchen Zhao, Payam Barnaghi, Hamed Haddadi

Abstract: Federated learning is proposed as an alternative to centralized machine learning since its client-server structure provides better privacy protection and scalability in real-world applications. In many applications, such as smart homes with Internet-of-Things (IoT) devices, local data on clients are generated from different modalities such as sensory, visual, and audio data. Existing federated lea… ▽ More Federated learning is proposed as an alternative to centralized machine learning since its client-server structure provides better privacy protection and scalability in real-world applications. In many applications, such as smart homes with Internet-of-Things (IoT) devices, local data on clients are generated from different modalities such as sensory, visual, and audio data. Existing federated learning systems only work on local data from a single modality, which limits the scalability of the systems. In this paper, we propose a multimodal and semi-supervised federated learning framework that trains autoencoders to extract shared or correlated representations from different local data modalities on clients. In addition, we propose a multimodal FedAvg algorithm to aggregate local autoencoders trained on different data modalities. We use the learned global autoencoder for a downstream classification task with the help of auxiliary labelled data on the server. We empirically evaluate our framework on different modalities including sensory data, depth camera videos, and RGB camera videos. Our experimental results demonstrate that introducing data from multiple modalities into federated learning can improve its classification performance. In addition, we can use labelled data from only one modality for supervised learning on the server and apply the learned model to testing data from other modalities to achieve decent F1 scores (e.g., with the best performance being higher than 60%), especially when combining contributions from both unimodal clients and multimodal clients. △ Less

Submitted 18 February, 2022; v1 submitted 10 September, 2021; originally announced September 2021.

Comments: 12 pages, IoTDI '22, May 3-6, 2022, Milan, Italy

arXiv:2107.10045 [pdf, other]

A Tandem Framework Balancing Privacy and Security for Voice User Interfaces

Authors: Ranya Aloufi, Hamed Haddadi, David Boyle

Abstract: Speech synthesis, voice cloning, and voice conversion techniques present severe privacy and security threats to users of voice user interfaces (VUIs). These techniques transform one or more elements of a speech signal, e.g., identity and emotion, while preserving linguistic information. Adversaries may use advanced transformation tools to trigger a spoofing attack using fraudulent biometrics for a… ▽ More Speech synthesis, voice cloning, and voice conversion techniques present severe privacy and security threats to users of voice user interfaces (VUIs). These techniques transform one or more elements of a speech signal, e.g., identity and emotion, while preserving linguistic information. Adversaries may use advanced transformation tools to trigger a spoofing attack using fraudulent biometrics for a legitimate speaker. Conversely, such techniques have been used to generate privacy-transformed speech by suppressing personally identifiable attributes in the voice signals, achieving anonymization. Prior works have studied the security and privacy vectors in parallel, and thus it raises alarm that if a benign user can achieve privacy by a transformation, it also means that a malicious user can break security by bypassing the anti-spoofing mechanism. In this paper, we take a step towards balancing two seemingly conflicting requirements: security and privacy. It remains unclear what the vulnerabilities in one domain imply for the other, and what dynamic interactions exist between them. A better understanding of these aspects is crucial for assessing and mitigating vulnerabilities inherent with VUIs and building effective defenses. In this paper,(i) we investigate the applicability of the current voice anonymization methods by deploying a tandem framework that jointly combines anti-spoofing and authentication models, and evaluate the performance of these methods;(ii) examining analytical and empirical evidence, we reveal a duality between the two mechanisms as they offer different ways to achieve the same objective, and we show that leveraging one vector significantly amplifies the effectiveness of the other;(iii) we demonstrate that to effectively defend from potential attacks against VUIs, it is necessary to investigate the attacks from multiple complementary perspectives(security and privacy). △ Less

Submitted 21 July, 2021; originally announced July 2021.

Comments: 14 pages, 6 figures. arXiv admin note: text overlap with arXiv:2008.03648, arXiv:2010.13995, arXiv:1911.01601 by other authors

arXiv:2107.07818 [pdf, other]

Revisiting IoT Device Identification

Authors: Roman Kolcun, Diana Andreea Popescu, Vadim Safronov, Poonam Yadav, Anna Maria Mandalari, Richard Mortier, Hamed Haddadi

Abstract: Internet-of-Things (IoT) devices are known to be the source of many security problems, and as such, they would greatly benefit from automated management. This requires robustly identifying devices so that appropriate network security policies can be applied. We address this challenge by exploring how to accurately identify IoT devices based on their network behavior, while leveraging approaches pr… ▽ More Internet-of-Things (IoT) devices are known to be the source of many security problems, and as such, they would greatly benefit from automated management. This requires robustly identifying devices so that appropriate network security policies can be applied. We address this challenge by exploring how to accurately identify IoT devices based on their network behavior, while leveraging approaches previously proposed by other researchers. We compare the accuracy of four different previously proposed machine learning models (tree-based and neural network-based) for identifying IoT devices. We use packet trace data collected over a period of six months from a large IoT test-bed. We show that, while all models achieve high accuracy when evaluated on the same dataset as they were trained on, their accuracy degrades over time, when evaluated on data collected outside the training set. We show that on average the models' accuracy degrades after a couple of weeks by up to 40 percentage points (on average between 12 and 21 percentage points). We argue that, in order to keep the models' accuracy at a high level, these need to be continuously updated. △ Less

Submitted 16 July, 2021; originally announced July 2021.

Comments: To appear in TMA 2021 conference. 9 pages, 6 figures. arXiv admin note: text overlap with arXiv:2011.08605

arXiv:2105.13929 [pdf, other]

Quantifying and Localizing Usable Information Leakage from Neural Network Gradients

Authors: Fan Mo, Anastasia Borovykh, Mohammad Malekzadeh, Soteris Demetriou, Deniz Gündüz, Hamed Haddadi

Abstract: In collaborative learning, clients keep their data private and communicate only the computed gradients of the deep neural network being trained on their local data. Several recent attacks show that one can still extract private information from the shared network's gradients compromising clients' privacy. In this paper, to quantify the private information leakage from gradients we adopt usable inf… ▽ More In collaborative learning, clients keep their data private and communicate only the computed gradients of the deep neural network being trained on their local data. Several recent attacks show that one can still extract private information from the shared network's gradients compromising clients' privacy. In this paper, to quantify the private information leakage from gradients we adopt usable information theory. We focus on two types of private information: original information in data reconstruction attacks and latent information in attribute inference attacks. Furthermore, a sensitivity analysis over the gradients is performed to explore the underlying cause of information leakage and validate the results of the proposed framework. Finally, we conduct numerical evaluations on six benchmark datasets and four well-known deep models. We measure the impact of training hyperparameters, e.g., batches and epochs, as well as potential defense mechanisms, e.g., dropout and differential privacy. Our proposed framework enables clients to localize and quantify the private information leakage in a layer-wise manner, and enables a better understanding of the sources of information leakage in collaborative learning, which can be used by future studies to benchmark new attacks and defense mechanisms. △ Less

Submitted 25 July, 2022; v1 submitted 28 May, 2021; originally announced May 2021.

Comments: 13 pages

arXiv:2105.05162 [pdf, other]

Blocking without Breaking: Identification and Mitigation of Non-Essential IoT Traffic

Authors: Anna Maria Mandalari, Daniel J. Dubois, Roman Kolcun, Muhammad Talha Paracha, Hamed Haddadi, David Choffnes

Abstract: Despite the prevalence of Internet of Things (IoT) devices, there is little information about the purpose and risks of the Internet traffic these devices generate, and consumers have limited options for controlling those risks. A key open question is whether one can mitigate these risks by automatically blocking some of the Internet connections from IoT devices, without rendering the devices inope… ▽ More Despite the prevalence of Internet of Things (IoT) devices, there is little information about the purpose and risks of the Internet traffic these devices generate, and consumers have limited options for controlling those risks. A key open question is whether one can mitigate these risks by automatically blocking some of the Internet connections from IoT devices, without rendering the devices inoperable. In this paper, we address this question by developing a rigorous methodology that relies on automated IoT-device experimentation to reveal which network connections (and the information they expose) are essential, and which are not. We further develop strategies to automatically classify network traffic destinations as either required (i.e., their traffic is essential for devices to work properly) or not, hence allowing firewall rules to block traffic sent to non-required destinations without breaking the functionality of the device. We find that indeed 16 among the 31 devices we tested have at least one blockable non-required destination, with the maximum number of blockable destinations for a device being 11. We further analyze the destination of network traffic and find that all third parties observed in our experiments are blockable, while first and support parties are neither uniformly required or non-required. Finally, we demonstrate the limitations of existing blocklists on IoT traffic, propose a set of guidelines for automatically limiting non-essential IoT traffic, and we develop a prototype system that implements these guidelines. △ Less

Submitted 11 May, 2021; originally announced May 2021.

Journal ref: Privacy Enhancing Technologies Symposium (PETS) 2021

arXiv:2105.03941 [pdf, other]

Stronger Privacy for Federated Collaborative Filtering with Implicit Feedback

Authors: Lorenzo Minto, Moritz Haller, Hamed Haddadi, Benjamin Livshits

Abstract: Recommender systems are commonly trained on centrally collected user interaction data like views or clicks. This practice however raises serious privacy concerns regarding the recommender's collection and handling of potentially sensitive data. Several privacy-aware recommender systems have been proposed in recent literature, but comparatively little attention has been given to systems at the inte… ▽ More Recommender systems are commonly trained on centrally collected user interaction data like views or clicks. This practice however raises serious privacy concerns regarding the recommender's collection and handling of potentially sensitive data. Several privacy-aware recommender systems have been proposed in recent literature, but comparatively little attention has been given to systems at the intersection of implicit feedback and privacy. To address this shortcoming, we propose a practical federated recommender system for implicit data under user-level local differential privacy (LDP). The privacy-utility trade-off is controlled by parameters $ε$ and $k$, regulating the per-update privacy budget and the number of $ε$-LDP gradient updates sent by each user respectively. To further protect the user's privacy, we introduce a proxy network to reduce the fingerprinting surface by anonymizing and shuffling the reports before forwarding them to the recommender. We empirically demonstrate the effectiveness of our framework on the MovieLens dataset, achieving up to Hit Ratio with K=10 (HR@10) 0.68 on 50k users with 5k items. Even on the full dataset, we show that it is possible to achieve reasonable utility with HR@10>0.5 without compromising user privacy. △ Less

Submitted 28 July, 2021; v1 submitted 9 May, 2021; originally announced May 2021.

Comments: Accepted for publication at RecSys 2021

arXiv:2104.14380 [pdf, other]

PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments

Authors: Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, Nicolas Kourtellis

Abstract: We propose and implement a Privacy-preserving Federated Learning ($PPFL$) framework for mobile systems to limit privacy leakages in federated learning. Leveraging the widespread presence of Trusted Execution Environments (TEEs) in high-end and mobile devices, we utilize TEEs on clients for local training, and on servers for secure aggregation, so that model/gradient updates are hidden from adversa… ▽ More We propose and implement a Privacy-preserving Federated Learning ($PPFL$) framework for mobile systems to limit privacy leakages in federated learning. Leveraging the widespread presence of Trusted Execution Environments (TEEs) in high-end and mobile devices, we utilize TEEs on clients for local training, and on servers for secure aggregation, so that model/gradient updates are hidden from adversaries. Challenged by the limited memory size of current TEEs, we leverage greedy layer-wise training to train each model's layer inside the trusted area until its convergence. The performance evaluation of our implementation shows that $PPFL$ can significantly improve privacy while incurring small system overheads at the client-side. In particular, $PPFL$ can successfully defend the trained model against data reconstruction, property inference, and membership inference attacks. Furthermore, it can achieve comparable model utility with fewer communication rounds (0.54$\times$) and a similar amount of network traffic (1.002$\times$) compared to the standard federated learning of a complete model. This is achieved while only introducing up to ~15% CPU time, ~18% memory usage, and ~21% energy consumption overhead in $PPFL$'s client-side. △ Less

Submitted 28 June, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

Comments: 15 pages, 8 figures, accepted to MobiSys 2021

arXiv:2104.00766 [pdf, other]

Configurable Privacy-Preserving Automatic Speech Recognition

Authors: Ranya Aloufi, Hamed Haddadi, David Boyle

Abstract: Voice assistive technologies have given rise to far-reaching privacy and security concerns. In this paper we investigate whether modular automatic speech recognition (ASR) can improve privacy in voice assistive systems by combining independently trained separation, recognition, and discretization modules to design configurable privacy-preserving ASR systems. We evaluate privacy concerns and the ef… ▽ More Voice assistive technologies have given rise to far-reaching privacy and security concerns. In this paper we investigate whether modular automatic speech recognition (ASR) can improve privacy in voice assistive systems by combining independently trained separation, recognition, and discretization modules to design configurable privacy-preserving ASR systems. We evaluate privacy concerns and the effects of applying various state-of-the-art techniques at each stage of the system, and report results using task-specific metrics (i.e. WER, ABX, and accuracy). We show that overlapping speech inputs to ASR systems present further privacy concerns, and how these may be mitigated using speech separation and optimization techniques. Our discretization module is shown to minimize paralinguistics privacy leakage from ASR acoustic models to levels commensurate with random guessing. We show that voice privacy can be configurable, and argue this presents new opportunities for privacy-preserving applications incorporating ASR. △ Less

Submitted 1 April, 2021; originally announced April 2021.

Comments: 5 pages, 1 figure

arXiv:2101.00235 [pdf, other]

MoSen: Activity Modelling in Multiple-Occupancy Smart Homes

Authors: Yuting Zhan, Hamed Haddadi

Abstract: Smart home solutions increasingly rely on a variety of sensors for behavioral analytics and activity recognition to provide context-aware applications and personalized care. Optimizing the sensor network is one of the most important approaches to ensure classification accuracy and the system's efficiency. However, the trade-off between the cost and performance is often a challenge in real deployme… ▽ More Smart home solutions increasingly rely on a variety of sensors for behavioral analytics and activity recognition to provide context-aware applications and personalized care. Optimizing the sensor network is one of the most important approaches to ensure classification accuracy and the system's efficiency. However, the trade-off between the cost and performance is often a challenge in real deployments, particularly for multiple-occupancy smart homes or care homes. In this paper, using real indoor activity and mobility traces, floor plans, and synthetic multi-occupancy behavior models, we evaluate several multi-occupancy household scenarios with 2-5 residents. We explore and quantify the trade-offs between the cost of sensor deployments and expected labeling accuracy in different scenarios. Our evaluation across different scenarios show that the performance of the desired context-aware task is affected by different localization resolutions, the number of residents, the number of sensors, and varying sensor deployments. To aid in accelerating the adoption of practical sensor-based activity recognition technology, we design MoSen, a framework to simulate the interaction dynamics between sensor-based environments and multiple residents. By evaluating the factors that affect the performance of the desired sensor network, we provide a sensor selection strategy and design metrics for sensor layout in real environments. Using our selection strategy in a 5-person scenario case study, we demonstrate that MoSen can significantly improve overall system performance without increasing the deployment costs. △ Less

Submitted 1 January, 2021; originally announced January 2021.

arXiv:2011.08605 [pdf, other]

The Case for Retraining of ML Models for IoT Device Identification at the Edge

Authors: Roman Kolcun, Diana Andreea Popescu, Vadim Safronov, Poonam Yadav, Anna Maria Mandalari, Yiming Xie, Richard Mortier, Hamed Haddadi

Abstract: Internet-of-Things (IoT) devices are known to be the source of many security problems, and as such they would greatly benefit from automated management. This requires robustly identifying devices so that appropriate network security policies can be applied. We address this challenge by exploring how to accurately identify IoT devices based on their network behavior, using resources available at th… ▽ More Internet-of-Things (IoT) devices are known to be the source of many security problems, and as such they would greatly benefit from automated management. This requires robustly identifying devices so that appropriate network security policies can be applied. We address this challenge by exploring how to accurately identify IoT devices based on their network behavior, using resources available at the edge of the network. In this paper, we compare the accuracy of five different machine learning models (tree-based and neural network-based) for identifying IoT devices by using packet trace data from a large IoT test-bed, showing that all models need to be updated over time to avoid significant degradation in accuracy. In order to effectively update the models, we find that it is necessary to use data gathered from the deployment environment, e.g., the household. We therefore evaluate our approach using hardware resources and data sources representative of those that would be available at the edge of the network, such as in an IoT deployment. We show that updating neural network-based models at the edge is feasible, as they require low computational and memory resources and their structure is amenable to being updated. Our results show that it is possible to achieve device identification and categorization with over 80% and 90% accuracy respectively at the edge. △ Less

Submitted 17 November, 2020; originally announced November 2020.

Comments: 13 pages, 8 figures, 4 tables

Showing 1–50 of 107 results for author: Haddadi, H