A Look at Cloud-based Automated Machine Learning Services

AutoML

AI is the hottest topic in the tech industry.   While it is unclear if this is a passing infatuation or a fundamental shift in the IT industry, it certainly has captured the attention of many.   Much of the writing in the popular press about AI involves wild predictions or dire warnings.  However, for enterprises and researchers the immediate impact of this evolving technology concerns the more prosaic subset of AI known as Machine Learning.    The reason for this is easy to see.   Machine learning holds the promise of optimizing business and research practices.    The applications of ML range from improving the interaction between an enterprise and its clients/customers (How can we better understand our clients’ needs?), to speeding up advanced R&D discovery (How do we improve the efficiency of the search for solutions?).

Unfortunately, it is not that easy to deploy the latest ML methods without access to experts who understand the technology and how to best apply it.  The successful application of machine learning methods is notoriously difficult.   If one has a data collection or sensor array that may be useful for training an AI model,  the challenge is how to clean and condition that data so that it can be used effectively.  The goal is to build and deploy a model that can be used to predict behavior or spot anomalies.   This may involve testing a dozen candidate architectures over a large space of tuning hyperparameters.  The best method may be a hybrid model derived from standard approaches.   One such hybrid is ensemble learning in which many models, such as neural networks or decision trees, are trained in parallel to solve the same problem.  Their predictions are combined linearly when classifying new instances.  Another approach (called stacking) is to use the results of the sub-models as input to a second level model which selects the combination dynamically.  It is also possible to use AI methods to simplify labor intensive tasks such as collecting the best features from the input data tables (called feature engineering) for model building.    In fact, the process of building the entire data pipeline and workflow to train a good model itself a task well suited to AI optimization.  The result is automated machine learning.   The cloud vendors have now provided expert autoML services that can lead the user to the construction of a solid and reliable machine learning solutions.

Work on autoML has been going on for a while.  In 2013, Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown introduced Auto-WEKA and many others followed.  In 2019, the AutoML | Home  research groups led by Frank Hutter at the University of Freiburg,  and Prof. Marius Lindauer at the Leibniz University of Hannover published Automated Machine Learning: Methods, Systems, Challenges (which can be accessed on the AutoML website).

For an amateur looking to use an autoML system, the first step is to identify the problem that must be solved.  These systems support a surprising number of capabilities. For example, one may be interested in image related problems like image identification or object detection.   Another area is text analysis.  It may also be regression or predictions from streaming data.  One of the biggest challenges involve building models that can handle tabular data which may be contain not only columns of numbers but also images and text.    All of these are possible with the available autoML systems.

While all autoML systems are different in details of use, the basic idea is they automate a pipeline like the one illustrated in Figure 1 below.

Figure 1.   Basic AutoML pipeline workflow to generate an optimal model based on the data available.

Automating the model test and evaluation is a process that involves exploring the search space of model combinations and parameters.   Doing this search is a non-trivial process that involves intelligent pruning of possible combinations if they seem like to be poor performers.   As we shall show below, the autoML system may test dozens of candidates before ranking them and picking the best.    

Amazon AWS, Micosoft Azure,  Google cloud and IBM cloud all have automated machine learning services they provide to their customers.  In the following paragraphs we will look at two of these,   Amazon AWS autoGluon which is both open source and part of their SageMaker service,   and Microsoft Azure AutoML service which is part of the Azure Machine Learning Studio.   We will also provide a very brief look at Google’s Vertex AI cloud service.   We will not provide an in-depth analysis of these services, but give a brief overview and example from each. 

AWS and AutoGluon

AutoGluon was developed by a team at Amazon Web Services which they have also released as open source.  Consequently, it can be used as part of their SageMaker service or complete separately.  An interesting tutorial on AutoGluon is here.   While the types of problems AutoGluon can be applied to is extremely broad, we will illustrate it for only a tiny classical problem:  regression based on a tabular input. 

The table we use is the Kaggle bike share challenge.   The input is a pandas data frame with records of bike shares per day for about two years.  For each day, there is an indicator to say if this is a holiday and a workday.  There is weather information consisting of temperature, humidity and windspeed.  The last column is the “count” of the number of rentals for that day.     The first few rows are shown below in Figure 2.    Our experiment differs from the Kaggle competition in that we will use a small sample (27%) of the data to train a regression model and then use the remainder for the test so that we can easily illustrate the fit to the true data.

Figure 2.  Sample of the bike rental data used in this and the following example.

While AutoGluon, we believe, can be deployed on Windows, we will use Linux because it deploys easily there.   We used Google Colab and Ubuntu 18.04 deployed on Windows 11.  In both cases the installation from a Jupyter notebook was very easy  and went as follows.  First we need to install the packages.

The full notebook for this experiment is here and we encourage the reader to follow along as we will only sketch the details below. 

As can be seen from the data in Figure 2, the  “count” number jumps wildly from day to day.   Plotting the count vs time we can see this clearly.

A more informative way to look at this is a “weekly” average shown below.

 The training data that is available is a random selection about 70% of the complete dataset, so this is not a perfect weekly average, but it is seven consecutive days of the data.  

Our goal is to compute the regression model based on a small training sample and then use the model to predict the “count” values for the test data.   We can then compare that with the actual test data “count” values.    Invoking the AutoGluon is now remarkably easy.

We have given this a time limit of 20 minutes.   The predictor is finished well before that time.   We can now ask to see how well the different models did (Figure 3) and also ask for the best.

Running the predictor on our test data is also easy.   We first drop the “count” column from the test data and invoke the predict method on the predictor

Figure 3.    The leaderboard shows the performance of the various methods tested

One trivial graphical way to illustrate the fit of the prediction to the actual data is a simple scatter plot.

As should be clear to the reader, this is far from perfect.  Another simple visualization is to plot the two “count” values along the time axis.  As we did above, the picture is clearer if plot a smoothed average.  In this case each point is an average of the following 100 points.  The results, which shows the true data in blue over the prediction in orange, does indicate that the model does capture the qualitative trends. 

The mean squared error is 148.   Note: we also tried training with a larger fraction of the data and the result was similar. 

Azure Automated Machine Learning

The azure AutoML system is also designed to support classification, regression, forecasting and computer vision.   There are two basic modes in which Azure autoML works:  use the ML studio on Azure for the entire experience,  or use the Python SDK, running in Jupyter on your laptop with remote execution in the Azure ML studio.   (In simple cases you can run everything on you laptop, but taking advantage of the studio managing a cluster for you in the background is a big win.) We will use the Azure studio for this example.  We will run a Jupyter notebook locally and connect to the studio remotely.  To do so we must first install the Python libraries.  Starting with Anaconda on windows 10 or 11, it can be challenging to find the libraries that will all work together.   The following combination will work with our example.

conda create -n azureml python=3.6.13

conda activate azureml

pip install azureml-train-automl-client

pip install numpy==1.18

pip install azureml-train-automl-runtime==1.35.1

pip install xgboost==0.90

pip install jupyter

pip install pandas

pip install matplotlib

pip install azureml.widgets

jupyter notebook

Next clone the Azure/MachineLearningNotebooks from Github and grab the notebook configuration.ipynb.   If you don’t have an azure subscription, you can create a new free one.  Running the configuration successfully in you jupyter notebook will set up your connection to the Azure ML studio.  

The example we will use is a standard regression demo from the AzureML collection.  In order to better illustrate the results, we use the same bike-share demand data from the Kaggle competition as used above where we sample both the training and test data from the official test data.  The train data we use is 27% of the total and the remainder is used for test.  As we did with the AutoGluon example, we delete two columns: “registered” and “casual”.

You can see the entire notebook and results here:

azure-automl/bike-regression-drop-.3.ipynb at main · dbgannon/azure-automl (github.com)

If you want to understand the details, this is needed.   In the following we only provide a sketch of the process and results.

We are going to rely on autoML to do the entire search for the best model, but we do need to give it some basic configuration parameters as shown below.

We have given it a much longer execution time than is needed.   One line is then used to send the job to Azure ML studio.

After waiting for the experiment to run, we see the results of the search

  As can be seen, the search progressed through various methods and combination with a stack ensemble finally providing the best results.

We can now use the trained model to make our predictions as follows.  We begin by extracting the fitted model.  We can then drop the “count” column from the test file and feed it to the model.   The result can be plotted as a scatter plot.

As before we can now use a simple visualization based on a sliding window average of 100 points to “smooth” the data and show the results of the true values against the prediction. 

As can be seen the fit is pretty good.  Of course, this is not a rigorous statistical analysis, but it does show the model captures the trends of the data fairly well. 

In this case the mean squared error was 49.

Google Vertex AI

Google introduced their autoML service, called VertexAI in 2020.    Like AutoGluon and Azure AutoML there is a python binding where there is a function  aiplatform.TabularDataset.create() that can be used to initiate a training job in a manner similar to AutoMLConfig() in the Azure API.  Rather than use that we decided to use their full VertexAI cloud service on the same dataset and regression problem we described above.  

The first step was to upload our dataset, here called “untitled_1637280702451”.   The VertexAI system steps us through the process in a very deliberate and simple manner.   The first step is to tell it we want to do regression (the other choice for this data set was classification). 

The next step is to identify the target column and the columns that are included in the training.  We used the default data slit of 80% for training, 10% validation and 10% testing.  

After that there is a button to launch the training.   We gave it one hour.   It took two hours and produced a model

Once complete, we can deploy the model in a container and attach an endpoint.  The root mean squared error if 127 is in line with the AutoGluon result and more than the Azure autoML value.   One problem with the graphical interactive view is that I did not see the calculation to see if we are comparing the VeretexAI result to the same result for to the RMSE for the others.   

Conclusions

Among the three autoML methods used here, the easiest to deploy was VertexAI because we only used the Graphical interface on the Google Cloud.   AutoGluon was trivial to deploy on  Google Collab and on a local Ubuntu installation.   Azure AutoML was installable on  Windows 11, but it took some effort to find the right combination of libraries and Python versions.   While we did not study the performance of the VertexAI model,  the performance of the Azure AutoML model was quite good.

As it is like obvious to the reader, we did not push these systems to produce the best results.  Our goal was to see what was easy to do. Consequently, this brief evaluation of the three autoML offerings did not do justice to any of them.   All three have capabilities that go well beyond simple regression.  All three systems can handle streaming data, image classification and recognition as well as text analysis and prediction.  If time permits, we will follow up this article with more interesting examples.

IEEE Symposium on Cloud HPC

On September 7th, 8th and 9th, we held the IEEE Services 2021 – IEEE International Symposium on Cloud HPC (CloudHPC) (computer.org) as part of IEEE CLOUD 2021 (computer.org).   The program consisted of a plenary panel and 8 sessions with three speakers in each.   Each session was recorded and  in this note we will describe four of these sessions. The videos for two sessions are still unavailable, but we will update this document when we can access them. The other organizers of this event were Rong N Chang,  Geoffrey Fox, James Sexton, Christoph Hagleitner, Bruce D’Amora, Ian Foster and Andrew Lumsdaine. In the paragraphs below we provide the links to the videos and brief thumbnail abstract to introduce the talks.

The first session was a plenary panel that was part of the IEEE Cloud 2021 conference.

EXPLORING THE GROWING SYNERGY BETWEEN CLOUD AND HIGH-PERFORMANCE COMPUTING

Panelists:
Katherine Yelick, UC Berkeley and Lawrence Berkeley National Laboratory
Ian Foster, Argonne National Laboratory, University of Chicago
Geoffrey Fox, University of Virginia
Kate Keahey, Argonne National Laboratory, University of Chicago

This group of panelists have been involved in every aspect of HPC and HPC in the cloud. Kathy Yelick has tons of experience ranging from managing a HPC center to designing Programming Languages for parallel computing and, now, being  a partner in running a major NSF program CloudBank to help with cloud access. Ian Foster was a pioneer who conceived of the Grid, a network of supercomputer services which, in many ways foreshadowed the Cloud concept and he continues to lead research in HPC and the cloud.  Geoffrey Fox has been in the HPC business from the early days of MPI and he is now doing groundbreaking work using cloud native middleware with MPI to design new deep learning paradigms for science.   Kate Keahey is a leader in high performance distributed computing, and she designed one of the first open-source Cloud platforms and now runs a major cloud devoted to academic research for NSF.

The link to the panel discussion is SERVICES Plenary Panel 1: Cloud HPC – Exploring Growing Synergy btw Cloud & HPC – YouTube.

The overriding theme of the discussion involved an exploration of how cloud HCP is, and will be, different from traditional supercomputer based HPC and where the opportunities for exploiting the strengths of both can advance research and science.

Ian started the discussion by making the case that there were some fundamental differences between the cloud and traditional HPC systems.   Most significant of these was that the cloud made it possible to deliver and ecosystem of on-demand services.  Among these services are tools to interactively explore large data collections and services that enable users to easily build machine learning solutions for difficult data analysis problems. HPC on the other hand is about providing the most powerful computing platform for very large scientific applications.  Kathy also observed that what cloud lacked in high end power, the cloud excel in resilience. This resilience is critical for teaching such as when student assignment require just-in-time access to resources. Kate made the addition point that there are classes of HPC problems where having the interactivity clouds provide can be very important for exploring new areas where existing HPC application codes are not yet available.  Geoffrey made the point that access to HPC systems capable of doing some of the most advanced Neural Net training experiments is out of reach of must academics. That leaves clouds, but a national funding model for research is clearly needed. Kathy made the point that the new NSF CloudBank provides a way for university to access commercial cloud platform with NSF and other funding. Foster argues that the future challenge is to figure out how to combine the capabilities of both supercomputer centers and clouds for the next generation of scientific endeavors. The panel also addressed a number of important issues including economics, advances in technology and scalability.  It was an extremely lively and interesting discussion.  Check it out.  

Session 2.  HPCI in Biology & Medicine in the Cloud  (Chair: Dennis Gannon)

There are three talks in this session.

  • Computational Biology at the Exascale
    Katherine Yelick, UC Berkeley and Lawrence Berkeley National Laboratory
  • HySec-Flow: Privacy-Preserving Genomic Computing with SGX-based Big-Data Analytics Framework
    Judy Fox, Professor, University of Virginia
  • An automated self-service multi-cloud HPC platform applied to the simulation of cardiac valve disease with machine learning
    Wolfgang Gentzsch, UberCloud, Founder & President

Unfortunately, the recording of this session on YouTube is flawed (my fault).   Consequently, I have collected the videos together elsewhere.   A summary of the session and link to the induvial talks are here: Talks from the first IEEE Symposium on Cloud & HPC | The eScience Cloud (esciencegroup.com) 

Session 3. Using HPC to Enable AI at Scale (Chair: Dennis Gannon)

Session 3 included three talks that addressed how the private sector was approaching Cloud HPC to build new tools to solve critical global problems.  The session recording is here: CloudHPC Symposium – Session 3 – YouTube.  The talks were

  • Grand Challenges for Humanity: Cloud Scale Impact and Opportunities
    Debra Goldfarb, Amazon, Director HPC Products & Strategy
  • Enabling AI at scale on Azure
    Prabhat Ram, Microsoft, Azure HPC
  • Benchmarking for AI for Science in the Cloud: Challenges and Opportunities
    Jeyan Thiyagalingam, STFC, UK, Head of SciML Group

Debra Goldfarb passionately addressed the importance of bring HPC and the cloud to bear on many important global challenges.   Chief among this is the health crises brought on by viral pandemics.   She went on to describe work done in collaboration with the research community on providing the computation and data resources needed.  Of particular importance is the fact that the cloud could deliver scale that goes beyond single data centers.   One of the exciting outcomes was the work AWS did with the Covid-19 HPC consortium to provide the resources to Moderna to deliver their mRNA vaccine in an incredibly short time.   Debra also discussed the work on building the national strategic computing reserve that will play an important role in solving future problems.

Prabhat Ram, who recently moved from the DOE’s LBNL facility to Microsoft to build a team to address AI and HPC on Azure, provided an in-depth overview of the current state of Azure HPC Cloud capabilities.   He pointed out that for scientific computing problems of modest scale that it is easy to good speed-up on batch job with up to 128 cores.  Focusing on AI workloads Prabhat described their impressive performance on the HPL-AI benchmark.  Microsoft has now a deep collaboration with OpenAI and is building on the results from a few years ago with GPT-3.   Now the results now demonstrate that the cloud man be classified as leadership class computing.

Jeyan Thiyagalingam addressed the problem of designing benchmarks for AI and machine learning for science applications.   This is important for many reasons.  We have hundreds of different ML models and the applications and platforms differ widely.   Having a reasonable set of benchmarks will enable us to understand the applications better and provide a guide to tell us when our solutions are improving.   Developing these benchmarks is not easy.  One of the big problems is the size of realistic data sets for realistic applications. If a data set require 100 terabytes, how do we make it available?  The answer, it would seem, is the cloud.  However, Thiyagalingam observes that there are still many challenges to making that work in a practical way.   He makes a very strong case that the science community needs help from the cloud providers and other institutions to make this possible.

Session 4. Applications of Cloud Native Technology to HPC in the Cloud (Chair: Christoph Hagleitner)

Session 4 focused more on the Cloud software tools that are important for science.   There were three talks and the recorded video is here: CloudHPC Symposium – Session 4 – YouTube.

  • Serverless Supercomputing: High Performance Function as a Service
    Kyle Chard, Professor, University of Chicago
  • Minding the Gap: Navigating the transition from traditional HPC to cloud native development
    Bruce D’Amora, IBM Research
  • Composable Systems: An Early Application Experience
    Ilkay Altintas, SDSC, Chief Data Science Officer

Kyle’s talk was about how the function-as-a-service (FAS) model can be applied to large scale science.   In the talk he points out that the cloud computing industry has provided a number of important ideas, some  of which have been adopted by HPC (virtualization, containers, etc.).  He discussed FAS and introduced the FuncX system that they have built. (We have described FuncX elsewhere in this blog).  He also described a recent application of using FuncX for drug discovery pipelines for COVID-19 therapeutics.   

Kubernetes was developed by Google are released to the open-source community where it has become a cloud standard.  Bruce D’Amora considers the important issue of using Kubernetes on large scale scientific environments.  More specifically he described how they are using it to serve as a control plane for bare-metal system with high performance networks.

Ilkay Altintas, discussed a very ambitious project involving dynamic composability of computational services, AI and distributed real-time data integration.   This work goes well beyond static service integration into workflow.   The applications she discussed involved challenges such as the monitoring, measurement and prediction of the spread of wildfires and other rapidly changing input from widely disturbed sensor networks.   Large teams of participants need to be interacting with the system, so the workflow (or teamflow) requires dynamic scaling, on-demand interactive access and performance analysis.  The AI involved goes beyond the optimization of simulations to the optimization of a rapidly changing configuration of services and data streams.   The complexity of this task and its solution is extraordinary.   

Session 5.  Distributed Computing Issues for HPC in the Cloud (Chair Geoffrey Fox)

Session 5 contains 3 papers and the videos are available the following link: CloudHPC Symposium – Session 5 – YouTube

The papers are:

  • Challenges of Distributed Computing for Pandemic Spread Prediction based on Large Scale Human Interaction Data
    Haiying Shen, Professor, University of Virginia
  • GreenDataFlow: Minimizing the Energy Footprint of Cloud/HPC Data Movement
    Tevfik Kosar, Professor, University of Buffalo & NSF
  • IMPECCABLE: A Dream Pipeline for High-Throughput Virtual Screening, or a Pipe Dream?
    Shantenu Jha, Professor, Rutgers University

Haiying Shen describes the challenges of epidemic simulation using distributed computing.  In the case of epidemic simulations, the population is represented as a very large graph with edges representing interactions of human communities.   If we use a static partition of this graph to distribute the simulation over a distributed network  of servers, we encounter the problem that the graph is dynamic and so the partition must reflect this fact.  Optimizing the performance requires careful repartitioning and some replication.

Tevfik Kosar’s team has been look at the problem of reducing the cost of  moving data over the Internet.  On a global basis this costs $40 billion/year.  Their approach is to use application-level tuning of cross-layer parameters to save significant amounts of power.  Their approach involves clever clustering optimization.   They build a model that is best to capture the real-time conditions.   They applied this to a cloud-based experiment.  They achieve 80% higher throughput and up to 48% lower energy consumption.

Shantenu Jha described a massive effort by the DOE and other HPC laboratories to build a drug discovery pipeline, called IMPECCABLE that combines traditional HPC simulations with new AI methods that can greatly reduce the search spaces involved in molecular dynamics modeling.   IMPECCABLE is clearly vastly more sophisticated and compete than any of the current cloud-based HT virtual screening.  Professor Jha makes the point that more work is needed on system software to support heterogeneous workflows for large scale drug discovery. 

Final Sessions

There are two sessions for which we are still waiting for the IEEE to upload the videos.   We will update this document when we have access to recordings.

Session 1 Cloud & Heterogeneous Architectures & Opportunities for HPC (Chair: Ian Foster)

Advancing Hybrid Cloud HPC Workflows Across State of the Art Heterogeneous Infrastructures
Steve Hebert, Nimbix Founder and CEO

The impact of the rise in cloud-based HPC
Brent Gorda, ARM Director HPC Business

HPC in a box: accelerating research with Google Cloud
Alexander Titus, Google Cloud

Session 6.  Cloud HPC Barriers & Opportunities (Chair: Bruce D’Amora)

The Future of OpenShift
Carlos Eduardo Arango Gutierrez, Red Hat, HPC OpenShift Manager

Scientific Computing On Low-cost Transient Cloud Servers
Prateek Sharma, Indiana University

HW-accelerated HPC in the cloud: Barriers and Opportunities
Christoph Hagleitner, IBM Research