Background
The University Corporation for Atmospheric Research (UCAR) was created in 1960 to be the US national hub for research and education in the atmospheric and earth sciences. UCAR manages the National Center for Atmospheric Research (NCAR) which is devoted to fundamental research around predictions about our atmosphere. UCAR is also the home to Unidata, a community program to facilitate weather data access and interactive analysis for the university community. Unidata’s director Mohan Ramamurthy succinctly described Unidata’s mission as reducing the “data friction”, lowering the barriers for accessing and using data, and shrinking the “time to science”. Unidata supports 30 different streams of real-time weather data from sources including radar, satellite and model output to over 1000 computers worldwide. Unidata’s outbound traffic is about 31 terabytes/day. They move more data via Internet 2 than any other advanced application.
While Unidata’s program is incredibly successful they see a critical turning point ahead. The data volumes are growing so fast that the model of distributing raw data may no longer be sustainable. Ramamurthy writes “We need to move from ‘bringing the data to the scientist’ to ‘bringing the science to the data’.” For these and other reasons, Unidata has made a decision to transition data services to the cloud.
Last week (May 31-June 2, 2017) UCAR and Unidata hosted a workshop on the role cloud computing can play in atmospheric science research. The workshop explored the role that cloud computing is playing in atmospheric science research and the potential that it could play in the future. Science is composed of many sub-disciplines. Some disciplines, such as life science, have rapidly embraced cloud computing as an important enabling technology. Other disciplines, especially those with a long tradition of large scale computational simulation on supercomputers, have been slower to see a role for the cloud. Atmospheric science (AS) is in this latter category. What this workshop has made very clear is that AS community has now recognized that the cloud has the potential to revolutionize both research and education. But there are still challenges that must be addressed.
Using the Cloud for Computation: Challenges and Opportunities
One of the great breakthroughs in Numerical Weather Prediction (NWP) has been the development of the “Weather Research and Forecasting” (WRF) model. This is the state-of-the-art mesoscale numerical weather prediction software system. WRF is basically a specialized computational fluid dynamics program and data assimilation system that can generate simulations based on real observational data or idealized conditions. It can generate predictions conforming to computational meshes that can range in scale from 10s of meters to many kilometers. WRF has 30,000 registered users in 150 countries.
WRF runs as a classic MPI-based parallel program. For really big simulations, such as those used in production in major forecast centers, it can run for hours on 10,000 cores on a supercomputer. In fact, the largest core count for a single WRF job (that we know of) is about 140,000 on the Blue Waters supercomputer at the University of Illinois. For small experiments and student training it can run on 16 cores in a few hours and still provide useful results for researchers. A big challenge for using WRF is that it is an incredibly complex application that requires an experienced professional to properly deploy it and configure it for use. Training students to use WRF is a major challenge if the students are required to deploy it themselves. The students face additional hurdles if they don’t have access to sufficient computational resources to run the program.
A Cloud Solution for Modest WRF Runs
As part of the Big Weather Web NSF project, J. Hacker, J. Exby and K. Fossell from NCAR recognized the problem confronting educators and came up with a great solution. They demonstrated that by putting WRF and all its required components and configuration files in a Docker container that could be easily run on a AWS EC2 instance, they could revolutionize the use of WRF in classroom and lab experimentation. Deploying WRF is no longer an obstacle for students. Their container allows the user to change input data sets and boundary conditions, change the physics assumptions, and generate bit-level reproducible simulation results. In an experiment with the University of North Dakota, students used the container on AWS to create an ensemble output of a tornadic supercell over North Dakota. The total AWS cost of an 11-day student team project was $40.
Running containerized versions of WRF on a 32-core cloud VM addresses a very large set of educational and research use cases. One concern that frequently came up at the workshop was the fear of spending too much money. Because the cloud charges by the hour, it is feared that users may be discouraged from experimenting. Users that are accustomed to doing work on a dedicated cluster were the most concerned with this problem. However, Roland Stull, the director of the Geophysical Disaster CFD Center at the University of British Columbia was convinced by his graduate students that the cloud was a great platform for experimentation. They used the Google cloud to build a virtual HPC cluster. After experimenting with different configurations they found that 32 8-core VMs was the best configuration for WRF and produced results in about the same time as their local 448 core HPC cluster. Their entire experience grew from one graduate student, David Siuta, doing experiments in the cloud using a tiny Google free cloud account.
The Biggest Cloud Challenge: Data
While doing large computational simulations with tools like WRF is important, that was not the central topic of the workshop. The most important opportunity for using cloud resources in AS is that it provides a great way to share data and post-processing analysis and simplify the data distribution challenges that face Unidata. The cloud vendors, notably Amazon and Google have done a great job in making important data collections available in the cloud. Kevin Jorissen from Amazon described some of the data they make available to the community. Their data collections are open to the public and include data from many scientific disciplines. In the AS case, this includes 270TB of individual NEXRAD radar volume scan files and real-time chunks. These are available as objects on Amazon S3 and accessible via standard REST APIs. In addition, Amazon Simple Notification Service (SNS) can be used to subscribe to notifications when new data arrives. For geoscience AWS has over 400,000 LANSAT satellite image scenes in their archive. They also have the NOAA Global Forecast System Model data and High-Resolution Rapid Refresh (HRRR) model data available in rolling one week archives.
While these AWS resources are free to download, the challenge for Unidata and others that have moved model execution and data to the cloud is that the cost of data downloads is a large component of their overall cloud bill. In fact, it is not possible for Unidata to put all of their data on AWS. The cost of data egress would overwhelm their budget. Matthew Alvarado shared his experiences with using AWS. Their data downloads cost $90/TB and that has encouraged them to follow the same path that Unidata is considering and it is one that Jim Gray advocated in 2005: rather than moving the data to the compute, move the compute to the data. As Alvarado states “Do post-processing and analysis on the cloud. – Use Python and R to avoid license charges – Use THREDDS to make data accessible via netCDF viewers like Panoply or via the web.”
Unidata has gone a long way in this direction. To move more data processing to the cloud, they have deployed an open source version of the Advanced Weather Interactive Processing System (AWIPS) to the Azure cloud. AWIPS which is now in use by 44 universities. AWIPS consists of a data ingest, processing, and storage server called the Environmental Data EXchange (EDEX), and user client programs including the Common AWIPS Visualization Environment (CAVE) and a Python library that can be used to pull subsets of the data to the user’s desktop.
Unidata has also “dockerized” many of their tools including the Integrated Data Viewer (IDV), the THREDDS Data Server, and the Local Data Manager (LDM) so that they can be easily deployed by users locally or in the cloud. The CloudIDV allows IDV to run in the cloud so that the only download bandwidth the user sees is the graphical images that IDV generates for the user. Unidata has also released Siphon, which is a Python library that can query NetCDF and other data hosted on the THREDS Data Server. Using this library, it is possible to build python analytic services that can explore the data while it is still in the cloud. These are great first steps in moving the compute to the data … as long as the data is in a data center in the same region as the VM running the CloudIDV container.
Final Thoughts
Weather data is among the most complex and dynamic “Big Data” topics in science. It is extremely heterogeneous. To do predictions of the weather one must understand the flow of information streaming in from sensors that range from NeXRAD radar and satellite images to ground sensors and human observations. This data feeds simulation models that generate ensembles of output data that can be mined to make predictions. It is important that the data and tools are public so that the widest possible collection of researchers can experiment with it.
We are now in the middle of a revolution in our ability to derive knowledge from big data. New statistical tools and advanced machine learning techniques are being developed every day. These have been applied to discover patterns in dull but profitable endeavors like consumer shopping preferences, but they are also central to the race to deliver self-driving cars. And they are becoming increasingly important in scientific domains like genomics and physics. Weather and climate data are also well suited to exploration with the aid of machine learning. A great example is a paper by Liu and colleagues at LBL and others that demonstrate how deep learning via convolutional neural networks can be applied to the detection of extreme weather in climate datasets. I am certain that, given access to the best data and strong collaborations between the ML and AS communities, even more progress will be made.
Making data available in public cloud archives makes it possible to build a computational environmental of shared tools and results. In the machine learning research world, there is a strong movement toward open sharing of data and research challenges and results. Sites like OpenML.org provide thousands of case studies and data collections for researchers. A similar movement could create a cloud-based OpenAS.org that build on what has already started with the Big Weather Web.
Finally, Unidata must be commended for putting together such a great workshop.