subscribe to arXiv mailings

GuaranTEE: Towards Attestable and Private ML with CCA

Authors: Sandra Siby, Sina Abdollahi, Mohammad Maheri, Marios Kogias, Hamed Haddadi

Abstract: Machine-learning (ML) models are increasingly being deployed on edge devices to provide a variety of services. However, their deployment is accompanied by challenges in model privacy and auditability. Model providers want to ensure that (i) their proprietary models are not exposed to third parties; and (ii) be able to get attestations that their genuine models are operating on edge devices in acco… ▽ More Machine-learning (ML) models are increasingly being deployed on edge devices to provide a variety of services. However, their deployment is accompanied by challenges in model privacy and auditability. Model providers want to ensure that (i) their proprietary models are not exposed to third parties; and (ii) be able to get attestations that their genuine models are operating on edge devices in accordance with the service agreement with the user. Existing measures to address these challenges have been hindered by issues such as high overheads and limited capability (processing/secure memory) on edge devices. In this work, we propose GuaranTEE, a framework to provide attestable private machine learning on the edge. GuaranTEE uses Confidential Computing Architecture (CCA), Arm's latest architectural extension that allows for the creation and deployment of dynamic Trusted Execution Environments (TEEs) within which models can be executed. We evaluate CCA's feasibility to deploy ML models by developing, evaluating, and openly releasing a prototype. We also suggest improvements to CCA to facilitate its use in protecting the entire ML deployment pipeline on edge devices. △ Less

Submitted 29 March, 2024; originally announced April 2024.

Comments: Accepted at the 4th Workshop on Machine Learning and Systems (EuroMLSys '24)

arXiv:2401.14278 [pdf, other]

CHIRON: Accelerating Node Synchronization without Security Trade-offs in Distributed Ledgers

Authors: Ray Neiheiser, Arman Babaei, Giannis Alexopoulos, Marios Kogias, Eleftherios Kokoris Kogias

Abstract: Blockchain performance has historically faced challenges posed by the throughput limitations of consensus algorithms. Recent breakthroughs in research have successfully alleviated these constraints by introducing a modular architecture that decouples consensus from execution. The move toward independent optimization of the consensus layer has shifted attention to the execution layer. While concu… ▽ More Blockchain performance has historically faced challenges posed by the throughput limitations of consensus algorithms. Recent breakthroughs in research have successfully alleviated these constraints by introducing a modular architecture that decouples consensus from execution. The move toward independent optimization of the consensus layer has shifted attention to the execution layer. While concurrent transaction execution is a promising solution for increasing throughput, practical challenges persist. Its effectiveness varies based on the workloads, and the associated increased hardware requirements raise concerns about undesirable centralization. This increased requirement results in full nodes and stragglers synchronizing from signed checkpoints, decreasing the trustless nature of blockchain systems. In response to these challenges, this paper introduces Chiron, a system designed to extract execution hints for the acceleration of straggling and full nodes. Notably, Chiron achieves this without compromising the security of the system or introducing overhead on the critical path of consensus. Evaluation results demonstrate a notable speedup of up to 30%, effectively addressing the gap between theoretical research and practical deployment. The quantification of this speedup is achieved through realistic blockchain benchmarks derived from a comprehensive analysis of Ethereum and Solana workloads, constituting an independent contribution. △ Less

Submitted 31 January, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

arXiv:2312.15403 [pdf, other]

SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol

Authors: Konstantinos Prasopoulos, Edouard Bugnion, Marios Kogias

Abstract: Datacenter congestion management protocols must navigate the throughput-latency buffering trade-off in the presence of growing constraints due to switching hardware trends, oversubscribed topologies, and varying network configurability and features. In this context, receiver-driven protocols, which schedule packet transmissions instead of reacting to congestion, have shown great promise and work e… ▽ More Datacenter congestion management protocols must navigate the throughput-latency buffering trade-off in the presence of growing constraints due to switching hardware trends, oversubscribed topologies, and varying network configurability and features. In this context, receiver-driven protocols, which schedule packet transmissions instead of reacting to congestion, have shown great promise and work exceptionally well when the bottleneck lies at the ToR-to-receiver link. However, independent receiver schedules may collide if a shared link is the bottleneck instead. We present SIRD, a receiver-driven congestion control protocol designed around the simple insight that single-owner links should be scheduled while shared links should be managed through traditional congestion control algorithms. The approach achieves the best of both worlds by allowing precise control of the most common bottleneck and robust bandwidth sharing for shared bottlenecks. SIRD is implemented by end hosts and does not depend on Ethernet priorities or extensive network configuration. We compare SIRD to state-of-the-art receiver-driven protocols (Homa, dcPIM, and ExpressPass) and production-grade reactive protocols (Swift and DCTCP) and show that SIRD is the only one that can consistently maximize link utilization, minimize queuing, and obtain near-optimal latency across a wide set of workloads and traffic patterns. SIRD causes 12x less peak buffering than Homa and achieves competitive latency and utilization without requiring Ethernet priorities. Unlike dcPIM, SIRD operates without latency-inducing message exchange rounds and outperforms it in utilization, buffering, and tail latency by 9%, 43%, and 46% respectively. Finally, SIRD achieves 10x lower tail latency and 26% higher utilization than ExpressPass. △ Less

Submitted 11 July, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

arXiv:2309.14821 [pdf, other]

Expedited Data Transfers for Serverless Clouds

Authors: Dmitrii Ustiugov, Shyam Jesalpura, Mert Bora Alper, Michal Baczun, Rustem Feyzkhanov, Edouard Bugnion, Boris Grot, Marios Kogias

Abstract: Serverless computing has emerged as a popular cloud deployment paradigm. In serverless, the developers implement their application as a set of chained functions that form a workflow in which functions invoke each other. The cloud providers are responsible for automatically scaling the number of instances for each function on demand and forwarding the requests in a workflow to the appropriate funct… ▽ More Serverless computing has emerged as a popular cloud deployment paradigm. In serverless, the developers implement their application as a set of chained functions that form a workflow in which functions invoke each other. The cloud providers are responsible for automatically scaling the number of instances for each function on demand and forwarding the requests in a workflow to the appropriate function instance. Problematically, today's serverless clouds lack efficient support for cross-function data transfers in a workflow, preventing the efficient execution of data-intensive serverless applications. In production clouds, functions transmit intermediate, i.e., ephemeral, data to other functions either as part of invocation HTTP requests (i.e., inline) or via third-party services, such as AWS S3 storage or AWS ElastiCache in-memory cache. The former approach is restricted to small transfer sizes, while the latter supports arbitrary transfers but suffers from performance and cost overheads. This work introduces Expedited Data Transfers (XDT), an API-preserving high-performance data communication method for serverless that enables direct function-to-function transfers. With XDT, a trusted component of the sender function buffers the payload in its memory and sends a secure reference to the receiver, which is picked by the load balancer and autoscaler based on the current load. Using the reference, the receiver instance pulls the transmitted data directly from the sender's memory. XDT is natively compatible with existing autoscaling infrastructure, preserves function invocation semantics, is secure, and avoids the cost and performance overheads of using an intermediate service for data transfers. We prototype our system in vHive/Knative deployed on a cluster of AWS EC2 nodes, showing that XDT improves latency, bandwidth, and cost over AWS S3 and ElasticCache. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: latest version

MSC Class: 68 ACM Class: D.4.4

arXiv:2101.09355 [pdf, other]

doi 10.1145/3445814.3446714

Benchmarking, Analysis, and Optimization of Serverless Function Snapshots

Authors: Dmitrii Ustiugov, Plamen Petrov, Marios Kogias, Edouard Bugnion, Boris Grot

Abstract: Serverless computing has seen rapid adoption due to its high scalability and flexible, pay-as-you-go billing model. In serverless, developers structure their services as a collection of functions, sporadically invoked by various events like clicks. High inter-arrival time variability of function invocations motivates the providers to start new function instances upon each invocation, leading to si… ▽ More Serverless computing has seen rapid adoption due to its high scalability and flexible, pay-as-you-go billing model. In serverless, developers structure their services as a collection of functions, sporadically invoked by various events like clicks. High inter-arrival time variability of function invocations motivates the providers to start new function instances upon each invocation, leading to significant cold-start delays that degrade user experience. To reduce cold-start latency, the industry has turned to snapshotting, whereby an image of a fully-booted function is stored on disk, enabling a faster invocation compared to booting a function from scratch. This work introduces vHive, an open-source framework for serverless experimentation with the goal of enabling researchers to study and innovate across the entire serverless stack. Using vHive, we characterize a state-of-the-art snapshot-based serverless infrastructure, based on industry-leading Containerd orchestration framework and Firecracker hypervisor technologies. We find that the execution time of a function started from a snapshot is 95% higher, on average, than when the same function is memory-resident. We show that the high latency is attributable to frequent page faults as the function's state is brought from disk into guest memory one page at a time. Our analysis further reveals that functions access the same stable working set of pages across different invocations of the same function. By leveraging this insight, we build REAP, a light-weight software mechanism for serverless hosts that records functions' stable working set of guest memory pages and proactively prefetches it from disk into memory. Compared to baseline snapshotting, REAP slashes the cold-start delays by 3.7x, on average. △ Less

Submitted 5 February, 2021; v1 submitted 15 January, 2021; originally announced January 2021.

Comments: To appear in ASPLOS 2021

arXiv:1404.2946 [pdf]

Algorithms for Packet Routing in Switching Networks with Reconfiguration Overhead

Authors: Timotheos Aslanidis, Marios-Evangelos Kogias

Abstract: Given a set of messages to be transmitted in packages from a set of sending stations to a set of receiving stations, we are required to schedule the packages so as to achieve the minimum possible time from the moment the 1st transmission initiates to the concluding of the last. Preempting packets in order to reroute message remains, as part of some other packet to be transmitted at a later time wo… ▽ More Given a set of messages to be transmitted in packages from a set of sending stations to a set of receiving stations, we are required to schedule the packages so as to achieve the minimum possible time from the moment the 1st transmission initiates to the concluding of the last. Preempting packets in order to reroute message remains, as part of some other packet to be transmitted at a later time would be a great means to achieve our goal, if not for the fact that each preemption will come with a reconfiguration cost that will delay our entire effort. The problem has been extensively studied in the past and various algorithms have been proposed to handle many variations of the problem. In this paper we propose an improved algorithm that we call the Split-Graph Algorithm (SGA). To establish its efficiency we compare it, to two of the algorithms developed in the past. These two are the best presented in bibliography so far, one in terms of approximation ratio and one in terms of experimental results. △ Less

Submitted 10 April, 2014; originally announced April 2014.

Comments: 7 pages, 2 figures, CCNET 2014 Conference

Showing 1–6 of 6 results for author: Kogias, M