On September 7th, 8th and 9th, we held the IEEE Services 2021 – IEEE International Symposium on Cloud HPC (CloudHPC) (computer.org) as part of IEEE CLOUD 2021 (computer.org). The program consisted of a plenary panel and 8 sessions with three speakers in each. Each session was recorded and in this note we will describe four of these sessions. The videos for two sessions are still unavailable, but we will update this document when we can access them. The other organizers of this event were Rong N Chang, Geoffrey Fox, James Sexton, Christoph Hagleitner, Bruce D’Amora, Ian Foster and Andrew Lumsdaine. In the paragraphs below we provide the links to the videos and brief thumbnail abstract to introduce the talks.
The first session was a plenary panel that was part of the IEEE Cloud 2021 conference.
EXPLORING THE GROWING SYNERGY BETWEEN CLOUD AND HIGH-PERFORMANCE COMPUTING
Katherine Yelick, UC Berkeley and Lawrence Berkeley National Laboratory
Ian Foster, Argonne National Laboratory, University of Chicago
Geoffrey Fox, University of Virginia
Kate Keahey, Argonne National Laboratory, University of Chicago
This group of panelists have been involved in every aspect of HPC and HPC in the cloud. Kathy Yelick has tons of experience ranging from managing a HPC center to designing Programming Languages for parallel computing and, now, being a partner in running a major NSF program CloudBank to help with cloud access. Ian Foster was a pioneer who conceived of the Grid, a network of supercomputer services which, in many ways foreshadowed the Cloud concept and he continues to lead research in HPC and the cloud. Geoffrey Fox has been in the HPC business from the early days of MPI and he is now doing groundbreaking work using cloud native middleware with MPI to design new deep learning paradigms for science. Kate Keahey is a leader in high performance distributed computing, and she designed one of the first open-source Cloud platforms and now runs a major cloud devoted to academic research for NSF.
The link to the panel discussion is SERVICES Plenary Panel 1: Cloud HPC – Exploring Growing Synergy btw Cloud & HPC – YouTube.
The overriding theme of the discussion involved an exploration of how cloud HCP is, and will be, different from traditional supercomputer based HPC and where the opportunities for exploiting the strengths of both can advance research and science.
Ian started the discussion by making the case that there were some fundamental differences between the cloud and traditional HPC systems. Most significant of these was that the cloud made it possible to deliver and ecosystem of on-demand services. Among these services are tools to interactively explore large data collections and services that enable users to easily build machine learning solutions for difficult data analysis problems. HPC on the other hand is about providing the most powerful computing platform for very large scientific applications. Kathy also observed that what cloud lacked in high end power, the cloud excel in resilience. This resilience is critical for teaching such as when student assignment require just-in-time access to resources. Kate made the addition point that there are classes of HPC problems where having the interactivity clouds provide can be very important for exploring new areas where existing HPC application codes are not yet available. Geoffrey made the point that access to HPC systems capable of doing some of the most advanced Neural Net training experiments is out of reach of must academics. That leaves clouds, but a national funding model for research is clearly needed. Kathy made the point that the new NSF CloudBank provides a way for university to access commercial cloud platform with NSF and other funding. Foster argues that the future challenge is to figure out how to combine the capabilities of both supercomputer centers and clouds for the next generation of scientific endeavors. The panel also addressed a number of important issues including economics, advances in technology and scalability. It was an extremely lively and interesting discussion. Check it out.
Session 2. HPCI in Biology & Medicine in the Cloud (Chair: Dennis Gannon)
There are three talks in this session.
- Computational Biology at the Exascale
Katherine Yelick, UC Berkeley and Lawrence Berkeley National Laboratory
- HySec-Flow: Privacy-Preserving Genomic Computing with SGX-based Big-Data Analytics Framework
Judy Fox, Professor, University of Virginia
- An automated self-service multi-cloud HPC platform applied to the simulation of cardiac valve disease with machine learning
Wolfgang Gentzsch, UberCloud, Founder & President
Unfortunately, the recording of this session on YouTube is flawed (my fault). Consequently, I have collected the videos together elsewhere. A summary of the session and link to the induvial talks are here: Talks from the first IEEE Symposium on Cloud & HPC | The eScience Cloud (esciencegroup.com)
Session 3. Using HPC to Enable AI at Scale (Chair: Dennis Gannon)
Session 3 included three talks that addressed how the private sector was approaching Cloud HPC to build new tools to solve critical global problems. The session recording is here: CloudHPC Symposium – Session 3 – YouTube. The talks were
- Grand Challenges for Humanity: Cloud Scale Impact and Opportunities
Debra Goldfarb, Amazon, Director HPC Products & Strategy
- Enabling AI at scale on Azure
Prabhat Ram, Microsoft, Azure HPC
- Benchmarking for AI for Science in the Cloud: Challenges and Opportunities
Jeyan Thiyagalingam, STFC, UK, Head of SciML Group
Debra Goldfarb passionately addressed the importance of bring HPC and the cloud to bear on many important global challenges. Chief among this is the health crises brought on by viral pandemics. She went on to describe work done in collaboration with the research community on providing the computation and data resources needed. Of particular importance is the fact that the cloud could deliver scale that goes beyond single data centers. One of the exciting outcomes was the work AWS did with the Covid-19 HPC consortium to provide the resources to Moderna to deliver their mRNA vaccine in an incredibly short time. Debra also discussed the work on building the national strategic computing reserve that will play an important role in solving future problems.
Prabhat Ram, who recently moved from the DOE’s LBNL facility to Microsoft to build a team to address AI and HPC on Azure, provided an in-depth overview of the current state of Azure HPC Cloud capabilities. He pointed out that for scientific computing problems of modest scale that it is easy to good speed-up on batch job with up to 128 cores. Focusing on AI workloads Prabhat described their impressive performance on the HPL-AI benchmark. Microsoft has now a deep collaboration with OpenAI and is building on the results from a few years ago with GPT-3. Now the results now demonstrate that the cloud man be classified as leadership class computing.
Jeyan Thiyagalingam addressed the problem of designing benchmarks for AI and machine learning for science applications. This is important for many reasons. We have hundreds of different ML models and the applications and platforms differ widely. Having a reasonable set of benchmarks will enable us to understand the applications better and provide a guide to tell us when our solutions are improving. Developing these benchmarks is not easy. One of the big problems is the size of realistic data sets for realistic applications. If a data set require 100 terabytes, how do we make it available? The answer, it would seem, is the cloud. However, Thiyagalingam observes that there are still many challenges to making that work in a practical way. He makes a very strong case that the science community needs help from the cloud providers and other institutions to make this possible.
Session 4. Applications of Cloud Native Technology to HPC in the Cloud (Chair: Christoph Hagleitner)
Session 4 focused more on the Cloud software tools that are important for science. There were three talks and the recorded video is here: CloudHPC Symposium – Session 4 – YouTube.
- Serverless Supercomputing: High Performance Function as a Service
Kyle Chard, Professor, University of Chicago
- Minding the Gap: Navigating the transition from traditional HPC to cloud native development
Bruce D’Amora, IBM Research
- Composable Systems: An Early Application Experience
Ilkay Altintas, SDSC, Chief Data Science Officer
Kyle’s talk was about how the function-as-a-service (FAS) model can be applied to large scale science. In the talk he points out that the cloud computing industry has provided a number of important ideas, some of which have been adopted by HPC (virtualization, containers, etc.). He discussed FAS and introduced the FuncX system that they have built. (We have described FuncX elsewhere in this blog). He also described a recent application of using FuncX for drug discovery pipelines for COVID-19 therapeutics.
Kubernetes was developed by Google are released to the open-source community where it has become a cloud standard. Bruce D’Amora considers the important issue of using Kubernetes on large scale scientific environments. More specifically he described how they are using it to serve as a control plane for bare-metal system with high performance networks.
Ilkay Altintas, discussed a very ambitious project involving dynamic composability of computational services, AI and distributed real-time data integration. This work goes well beyond static service integration into workflow. The applications she discussed involved challenges such as the monitoring, measurement and prediction of the spread of wildfires and other rapidly changing input from widely disturbed sensor networks. Large teams of participants need to be interacting with the system, so the workflow (or teamflow) requires dynamic scaling, on-demand interactive access and performance analysis. The AI involved goes beyond the optimization of simulations to the optimization of a rapidly changing configuration of services and data streams. The complexity of this task and its solution is extraordinary.
Session 5. Distributed Computing Issues for HPC in the Cloud (Chair Geoffrey Fox)
Session 5 contains 3 papers and the videos are available the following link: CloudHPC Symposium – Session 5 – YouTube
The papers are:
- Challenges of Distributed Computing for Pandemic Spread Prediction based on Large Scale Human Interaction Data
Haiying Shen, Professor, University of Virginia
- GreenDataFlow: Minimizing the Energy Footprint of Cloud/HPC Data Movement
Tevfik Kosar, Professor, University of Buffalo & NSF
- IMPECCABLE: A Dream Pipeline for High-Throughput Virtual Screening, or a Pipe Dream?
Shantenu Jha, Professor, Rutgers University
Haiying Shen describes the challenges of epidemic simulation using distributed computing. In the case of epidemic simulations, the population is represented as a very large graph with edges representing interactions of human communities. If we use a static partition of this graph to distribute the simulation over a distributed network of servers, we encounter the problem that the graph is dynamic and so the partition must reflect this fact. Optimizing the performance requires careful repartitioning and some replication.
Tevfik Kosar’s team has been look at the problem of reducing the cost of moving data over the Internet. On a global basis this costs $40 billion/year. Their approach is to use application-level tuning of cross-layer parameters to save significant amounts of power. Their approach involves clever clustering optimization. They build a model that is best to capture the real-time conditions. They applied this to a cloud-based experiment. They achieve 80% higher throughput and up to 48% lower energy consumption.
Shantenu Jha described a massive effort by the DOE and other HPC laboratories to build a drug discovery pipeline, called IMPECCABLE that combines traditional HPC simulations with new AI methods that can greatly reduce the search spaces involved in molecular dynamics modeling. IMPECCABLE is clearly vastly more sophisticated and compete than any of the current cloud-based HT virtual screening. Professor Jha makes the point that more work is needed on system software to support heterogeneous workflows for large scale drug discovery.
There are two sessions for which we are still waiting for the IEEE to upload the videos. We will update this document when we have access to recordings.
Session 1 Cloud & Heterogeneous Architectures & Opportunities for HPC (Chair: Ian Foster)
Advancing Hybrid Cloud HPC Workflows Across State of the Art Heterogeneous Infrastructures
Steve Hebert, Nimbix Founder and CEO
The impact of the rise in cloud-based HPC
Brent Gorda, ARM Director HPC Business
HPC in a box: accelerating research with Google Cloud
Alexander Titus, Google Cloud
Session 6. Cloud HPC Barriers & Opportunities (Chair: Bruce D’Amora)
The Future of OpenShift
Carlos Eduardo Arango Gutierrez, Red Hat, HPC OpenShift Manager
Scientific Computing On Low-cost Transient Cloud Servers
Prateek Sharma, Indiana University
HW-accelerated HPC in the cloud: Barriers and Opportunities
Christoph Hagleitner, IBM Research