7th Computing Systems Research Day - 9 January 2024

Schedule

  • 11:45-12.00 | Welcome

  • Abstract

    Cloud systems are experiencing significant shifts both in their hardware, with an increased adoption of heterogeneity, and their software, with the prevalence of microservices and serverless frameworks. These trends require fundamentally rethinking how the cloud system stack should be designed. In this talk, I will briefly describe the challenges these hardware and software trends introduce, and discuss under what conditions hardware acceleration can be beneficial to these new application classes, as well as how applying machine learning (ML) to systems problems can improve the cloud’s performance, efficiency, and ease of use. I will first present Sage, a performance debugging system that leverages ML to identify and resolve the root causes of performance issues in cloud microservices. I will then discuss Ursa, an analytically-driven cluster manager for microservices that addresses some of the shortcomings of applying ML to large-scale systems problems.

    Bio

    Christina Delimitrou is an Associate Professor at MIT, where she works on computer architecture and computer systems. She focuses on improving the performance, predictability, and resource efficiency of large-scale cloud infrastructures by revisiting the way they are designed and managed. Christina is the recipient of the 2020 TCCA Young Computer Architect Award, an Intel Rising Star Award, a Microsoft Research Faculty Fellowship, an NSF CAREER Award, a Sloan Research Scholarship, two Google Faculty Research Awards, and a Facebook Faculty Research Award. Her work has also received 5 IEEE Micro Top Picks awards and several best paper awards. Before joining MIT, Christina was an Assistant Professor at Cornell University, and received her PhD from Stanford University. She had previously earned an MS also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. More information can be found at: http://people.csail.mit.edu/delimitrou/

  • 13:00-13:45 | Lunch Break

  • Abstract

    Datacenters have witnessed a staggering evolution in networking technologies, driven by insatiable application demands for larger datasets and inter-server data transfers. Modern NICs can already handle 100s of Gbps of traffic, a bandwidth capability equivalent to several memory channels. Direct Cache Access mechanisms like DDIO that contain network traffic inside the CPU’s caches are therefore essential to effectively handle growing network traffic rates. However, at high rates, a large fraction of network traffic “leaks” from the CPU’s caches to memory, a problem often referred to as “leaky DMA”, significantly capping the network bandwidth a server can effectively utilize. This talk will present an analysis of network data leaks in the era of high-speed networking and our insights around the interactions between network buffers and the cache and memory hierarchy. We will present Sweeper, our proposed hardware extension and API that allows applications to efficiently manage the coherence state of network buffers in the cache-memory hierarchy, drastically reducing memory bandwidth consumption and boosting a server’s peak sustainable network bandwidth by up to 2.6×.

    Bio

    Short Bio: Marina is a 5th year PhD student in the School of Computer Science at Georgia Tech, advised by assistant professor Alexandros Daglis. Her research focuses on designing new interfaces between hardware, network stacks and applications to unlock the performance potential of emerging datacenter technologies. Her work on hardware-software co-design for emerging network and memory technologies has been published at MICRO 2021 and MICRO 2022.

  • Abstract

    Distributed transaction processing is a fundamental building block for large-scale data management in the cloud. Given the threats of security violations in untrusted cloud environments, our work focuses on: How to design a distributed transactional KV store that achieves high-performance serializable transactions, while providing strong security properties? We introduce TREATY, a secure distributed transactional KV storage system that supports serializable ACID transactions while guaranteeing strong security properties: confidentiality, integrity, and freshness. TREATY leverages trusted execution environments (TEEs) to bootstrap its security properties, but it extends the trust provided by the limited enclave (volatile) memory region within a single node to build a secure (stateful) distributed transactional KV store over the untrusted storage, network and machines. To achieve this, TREATY embodies a secure two-phase commit protocol co-designed with a high-performance network library for TEEs. Further, TREATY ensures secure and crash-consistent persistency of committed transactions using a stabilization protocol. Our evaluation on a real hardware testbed based on the YCSB and TPC-C benchmarks shows that TREATY incurs reasonable overheads, while achieving strong security properties

    Bio

    Dimitra Giantsidi is a final-year PhD student at the University of Edinburgh (UoE), member of the Institute for Computing Systems Architecture (ICSA) and the Chair of Distributed and Operating Systems, advised by Prof. Pramod Bhatotia. Her research lies in the field of dependability in distributed systems with focus on the fault tolerance and security. Exploring the applications of modern hardware, such as Trusted Execution Environments and Direct I/O for net-working and storage, my work aims to increase the security and performance of widely adopted distributed systems. Before joining ICSA, Dimitra graduated from School of Electrical and Computer Engineering, NTUA, Greece.

  • Abstract

    GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN) applications. However, DNN applications often underutilize GPUs, even when using large batch sizes and eliminating input data processing or communication stalls. DNN workloads consist of data-dependent operators, with different compute and memory requirements. While an operator may saturate GPU compute units or memory bandwidth, it often leaves other GPU resources idle. Despite the prevalence of GPU sharing techniques, current approaches are not sufficiently fine-grained or interference-aware to maximize GPU utilization while minimizing interference at the granularity of 10s of 𝜇s. We present Orion, a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU. Orion schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements. We integrate Orion in PyTorch and demonstrate its benefits in various DNN workload collocation use cases.

    Bio

    Foteini Strati is a 3rd year PhD student at the Systems Group of ETH Zurich, working on systems for Machine Learning. She is interested in increasing resource utilization and fault tolerance of Machine Learning workloads. She has obtained a MSc degree in Computer Science from ETH Zurich, and a Diploma in Electrical and Computer Engineering from NTUA.

  • Abstract

    Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. We demonstrate that the current multi-GPU BLAS libraries target very specific problems and data characteristics, resulting in serious performance degradation for any slightly deviating workload, and do not take into account energy efficiency at all. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into the PARALiA framework coupled with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems.

    Bio

    Last-year PhD candidate at cslab NTUA after graduating its Electrical and Computer Engineering (ECE) department and specializing in computer engineering through a master-integrated bachelor. My PhD explores the optimization of Linear Algebra routines on multi-GPU clusters with model-based autotuning, and my research interests include accelerators, parallel processing, HPC and performance engineering.

  • 15:15-15:45 | Coffee Break

  • Abstract

    Fully Homomorphic Encryption (FHE) enables computing directly on encrypted data, letting clients securely offload computation to untrusted servers. While enticing, FHE suffers from two key challenges. First, it incurs very high overheads: it is about 10,000x slower than native, unencrypted computation on a CPU. Second, FHE is extremely hard to program: translating even simple applications like neural networks takes months of tedious work by FHE experts. In this talk, I will describe a hardware and software stack that tackles these challenges and enables the widespread adoption of FHE. First, I will give a systems-level introduction to FHE, describing its programming interface, key characteristics, and performance tradeoffs while abstracting away its complex, cryptography-heavy implementation details. Then, I will introduce a programmable hardware architecture that accelerates FHE programs by 5,000x vs. a CPU with similar area and power, erasing most of the overheads of FHE. Finally, I will introduce a new compiler that abstracts away the details of FHE. This compiler exposes a simple, numpy-like tensor programming interface, and produces FHE programs that match or outperform painstakingly optimized manual versions. Together, these techniques make FHE fast and easy to use across many domains, including deep learning, tensor algebra, and other learning and analytic tasks.

    Bio

    Daniel Sanchez is a Professor of Electrical Engineering and Computer Science at MIT. His research interests include scalable memory hierarchies, architectural support for parallelization, and accelerators for sparse computations and secure computing. He earned a Ph.D. in Electrical Engineering from Stanford University in 2012 and received the NSF CAREER award in 2015.

  • 16:45-17:00 | Closing Remarks

Venue

Ceremonial Hall of the Central Administration Building (Zografou Campus)