A Brief Survey of Cloud AI Services

The commercial clouds are in a race to see who can provide the most interesting and useful AI services on their cloud platform.   This work began in the research laboratories in universities and companies over the past 25 years, but the big breakthroughs came when deep learning models trained on massive data collections began to reach levels of human accuracy.  For some time now, the public cloud companies have provided custom virtual machines that make it ease for technically sophisticated customers to use state of the art ML and neural network tools like TensorFlow, CNTK and others.  (We described these in here.)  But the real competition is now to provide services for building smart applications that can be used by developers lacking advanced training in machine learning and AI. We now have speech recognition, language translation, image recognition capabilities that can be easily integrated into web and mobile applications.

In the following paragraphs we will look at the AI services provided by  IBM, Google, Microsoft and Amazon.  These are certainly not the only providers.  Salesforce has the myEinstein platform and small companies like Algorithmia and not-so-small Genpact  also provide services and consulting in this area.

What becomes abundantly clear when you study the details of the offerings is that they all cover the same basics.  This includes tools for building bots, natural language translations, speech-to-text and text-to-speech and unstructured document analysis.   But what one also discovers is that each provider has some services that standout as being a bit more innovative that that offered by the others.  We conclude with an overview of the trends we see and thoughts about the future of cloud AI services.

This is the first of a series that we will do on this topic.   Future articles will explore some of these capabilities in more technical depth.  For example, at the end of this article, we look at an example of doing text analysis with Amazon Comprehend.

IBM Cloud Watson Services

The IBM Watson services are organized into five groups.

  • Watson conversation provides a portal interface to build Bots.  The interface promps you to identify intents, entities and dialog flow.  Intents are the questions you expect your users to ask.  Entities are the components such as city names, times and other objects your bot will understand.   Dialog flow is the tree of intents and responses you anticipate in the dialog.   The result is a bot you can deploy and later improve.
  • The discovery service is a tool that allow you to quickly ingest and explore data collection.  A query language can be used to subset results and identify important features and anomalies.  Discovery news is a service to crawl news and blogs looking for patterns in sentiment, new concepts and relationships.   It allows you to see trends and key events.
  • The visual recognition service has been used to analyze aerial images to better understand drought and water use.   It can do image content analysis including detecting faces and making age and gender estimates.   If you have your own collection of labeled images the system can be easily trained to incorporate these into its model.
  • Speech. Watson has speed-to-text and text-to-speech services.   These services work reasonably well but the quality of the output speech does not seem as good  as  Amazon Poly.
  • The Watson natural language classifier is designed to classify intent of text passages such as deciding that a question about the weather is looking for current temperatures.   As with the other services it is update it with additional training data.
  • The Watson empathy services allow prediction of personality characteristics and emotions through text.

Google Cloud AI services

The Google cloud has an extensive set of AI services available.

  • AutoML is Google’s tool for training their vision models on your data. If you image data is labeled it will help create better labels.   If it is not labeled they will help label it.    It uses transfer learning which is a method to retrain a neural network to recognize new inputs.  By leaving many of the early layers in the previously trained network unchanged basic features such as edges and shapes can be used again and only the last few layers need to be relearned.  (This method is widely used by the other image services described here.)  Google also has a powerful vision api that is capable of recognizing thousands of categories of images.
  • Cloud Machine Learning Engine is a cloud service that help you manage a large cluster for very large ML tasks. It also allows you to use your trained algorithm with terabytes of data and thousands of concurrent users.
  • DilogFlow is Google’s tool for building bots and interfaces that support natural and rich interactions.
  • Video Intelligence. Suppose you have a large collection of videos and you want to be able to search for occurrences of specific words.  The Google cloud video intelligence API makes this possible.
  • Cloud Speech. Google has a long history with speech-to-text recognition that is widely used in their android product and Google search.   The Google cloud Speech API recognizes over 100 languages and variants.   It has context aware recognition that filters out lots of background noise.   (The also have a very nice speech recognition kit that works with a raspberry pi.   I have used it.   It is fun little project.)
  • Natural Language. Google’s text analysis is very good at parsing documents.   It is very good at entity recognition (tagging many phrases and words with Wikipedia articles).   It can also give you lists of relevant categories.   For syntax analysis is used a version of their parsyMcParseface parser that I used in my demo of building an application for Algorithmia described in this post.
  • Cloud Translation. Google had one of the earliest cloud translation services and it has become better over time.   It supports more than 100 languages.

Microsoft Azure

Azure’s machine learning services are divided into two categories: Artificial Intelligence and cognitive services.  There are currently three AI services:

  • ML services which is based on their machine learning workbench. The workbench is designed to guide you through the process of creating a data package from your data and then build a python/pyspark script to evaluate it.   You can invoke the script locally, on an azure vm or in a container.   The workbench has the capability to fill in the gaps of data cleaning and algorithm selection in generating a final solution.
  • Batch AI services consist of a set of tools to help you marshal GPU and CPU cluster resources for parallel machine learning training using CNTK, TensorFlow and Chainer.
  • Azure AI services include Bot Builder, an SDK for creating bots and a suite of bot template.

The cognitive services are divided into four main categories.

  • This includes a vision API for content analysis which works with a Jupyter notebook that allows you to upload images and return a description in terms of recognized entities.  It also provides a caption.  A content moderator service allows you to flag images that may have unwanted content.  The custom vision service allows you to quickly train a vision app to recognize images from classes you provide.   The classes can be small (30 images in each) but it does not recognize your images when they are embedded in more complex scenes.   However it does allow you to export the trained model as TensorFlow to run in offline applications.  Face and Emotion APIs allow you to detect faces in images and detect the mood of each.   The video indexer is impressive.  It  can provide audio transcription, face tracking and identification, speaker indexing, visual text recognition, sentiment analysis and language translation.
  • The speech to text and text to speech services are there but there is also a Custom Speech Service that allows you to add knowledge about specific jargon to the language model.  A Speaker Recognition API allows your apps to automatically verify and authenticate users using their voice and speech.  The Translator service is based on the work that was done for the skype realtime speech translation system.   It can recognize languages and translate the spoken sentences into the target language.
  • The Language Understanding Service allows your application to understand spoken commands like “Turn off the light” or home automation tasks.  The Linguistic Analysis API provides sentence separation, part-of-speech tagging and constituency parsing.   The Text Analysis Service provide sentiment analysis and key phrase extraction.  A Web Language Model is based on the Web N-Gram Corpus for analysis of Web documents.
  • The Custom Decision Service uses reinforcement learning algorithms to extract features from a set of candidates when ranking articles and images for automatic inclusion in a web site.  The Entity Linking Intelligence Service API provides a tool to understand when an word is uses as an actual entity rather than a part of speech or a general noun or verb.  This is done by looking at the context of the use of the word.  The Academic Knowledge API provides access to the Microsoft academic graph which is data mind from the Bing index.   The QnA Maker is a REST API that trains a machine learning system to help bots respond in a more natural way to user requests.

AWS AI services

Amazon’s web services cloud AI services has seven major APIs

  • Image and video Rekognition. The image recognition service allows the full set of computer vision features that are available anywhere.  Object, scene and activity detection is continuously learning.  It can recognize objects and scenes.  Text in images like street names or product names can be read.  If you have a private library of photos it can identify a people. When it is analyzing video it can identify certain activities happening in the frame.   Facial analysis recognizes age ranges and emotions.   When analyzing video it can track individual people as they go in and out of a frame.  Sending live or recorded video to a Kinesis Video Stream  can be routed to rekognition video and identified object can be sent to lambda functions that can react in near real time.   Alternatively, video can be periodically loaded into S3 buckets which trigger lambda functions that will invoke rekognition for analysis.
  • Amazon Lex is a tool for build bots with voice and text input and response. It is the same technology that powers Echo’s Alexa.    The Lex console allows you to build  a bot with ease.  Conversation flow is an import part of the Bot interaction.   Lex supports simple mechanisms to allow you to tailor the flow to your application.
  • Comprehend. There are two main components to Amazon Comprehend.   The first is a set of tools to extract named entities (“Person”, “Organization”, “Locations”, etc.) and key phrases from a document. The more impressive capability of Amazon Comprehend is the topic modeling subsystem.   This is of interest if you have a large collection of documents and you would like to see then classified into a set of N bins where N is a number you pick.   Comprehend will take you collection of documents and apply a Latent Dirichlet Allocation-based learning model to separate them into N bins with each bin defined by a set of key words it has discovered.   (At the end of this article we will demonstrate Amazon Comprehend.
  • Translate. This service provides real-time and batch language translation.   The service is protected by SSL encryption.
  • If you have a mp3 or wave video and you want to add subtitles, the transcribe service will render all of the voice audio to text and also insert timestamps for each word.   You can then use Translate to convert the audio to another language.   They say they are adding specific voice identification soon.
  • Poly is the Amazon text to speech API.   It is far from the robotic sounding speech generation we saw in the past.   It has 47 different voices spread over many languages.   (I have used it and it is both impressive and fun.)


If you need to build a bot that understands English, French and Mandarin and replies with spoken and correctly accented Italian that can help you identify your friends and celebrities in your Instagram photos and also mine your twitter feed, you are in luck.  The tools are there.  But if you are expecting emergent artificial intelligence, you are out of luck.  Alexa, Cortana and Seri are each good at fast facts but otherwise dumb as a post.

It is also now clear that this technology is also a boon to those with more nefarious goals.  If you are a government security agency with access to lots of cameras in public places, keeping track of your citizens is now a snap.   We see that social media is now swarming with bots that sell not only soap but also promote and propagate lies and propaganda.    Serious questions are being raised about the potential threat to modern democracies that these technologies enable.   The social media companies are aware of the challenge of eliminating the bots that skew our national discussions and we hope they are up to the cleanup task.

There is also much to be excited about.   The technology behind these AI services is also helping us use vision and sensing that can truly help mankind.   We can “see” the planet at an enlarged scale.  We can spot droughts, crop disease and the effects global warming is having on the planet in greater detail because of the speed and accuracy of image analysis.    We can monitor thousands of sensors in our environment that help us improve our quality of air and water and we can better predict potential problems before they occur.  The same sensor and vision technology help us scan x-ray and other medical images.

All of these AI advances are going to give us safer roads with driverless cars and robots magnify the power of the individual worker in almost every domain.   I look forward to the time Alexa and or Cortana can become a real research partner helping me scan and review scientific literature and point me to discoveries that I most certainly miss today.


In the following paragraphs we look at one of the cloud services in depth.   In future articles we will examine other capabilities and applications.

Text Analysis with Amazon Comprehend

As with everything in AWS, their services can be accessed by the command line interface or the APIs.   However, the console provides a very simple way to use them.   We will test Amazon’s Comprehend using the named entity and key phrases interface.  The service is accessed via their API explorer.

We selected a paragraph about the discovery of DNA from Wikipedia and pasted it into the entity/key phrase extractor.    The results are shown in figures 1, 2 and 3.


Figure 1.  Inserting a paragraph into the API explorer.


Figure 2.   The list of Entities


Figure 3. The key phrases

As can be seen the system does a very good job with both the entity and key phrase tasks.   In particular it does a great job of categorizing the named entities.

Topic Modeling

The more impressive capability of Amazon Comprehend is the topic modeling subsystem.   This is of interest if you have a large collection of documents and you would like to see then classified into a set of N bins where N is a number you pick.   To try this out, I  used the Arxiv science abstract collection I have used in previous document classifier experiments.   Each document is the text of an abstract of a scientific research paper.   To use comprehend you put the documents in an AWS S3 bucket.   I have 7108 documents and they are in  the bucket  (If you are interested, the individual files can be accessed by this url*arxiv  where * is a an integer between 0 and 7108.)

Invoking the topic modeler from the console is trivial.  You simply fill in a form.  The form for my experiment is shown below in Figure 4.


Figure 4.   Invoking the Topic modeler.

In this test case the topic modeler ran in about five minutes and produce a pair of CSV files.   One file contained a set of tuples for each document.  Each tuple is a triple consisting of the document name, the name of a topic bin and a score for fit for that bin.   For example, here is the first 11 tuples.  The abstract documents are drawn from five fields of science: physics, biology, computer science, math and finance. We have added a fourth column that provides the science category for the listed document.

Document no. Topic Score Actual topic
0 0 0.242696 compsci
0 5 0.757304 compsci
1 1 1 math
2 0 0.546125 Physics
2 4 0.438275 Physics
2 5 0.015599 Physics
3 1 1 math
4 8 1 Physics
5 0 0.139652 Physics
5 3 0.245669 Physics
5 5 0.614679 Physics

As can be seen, document 0 is computer science and scores in topic 0 and highly in topic 5.    Documents 1 and 3 are math and squarely land in topic 1.  Documents 2, 4 and 5 are physics and are distributed over topics 0,3,4,5 and 8.   The algorithm used in the topic modeler is described in Amazon’s documentation as follows.

“Amazon Comprehend uses a Latent Dirichlet Allocation-based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word. The set of words that frequently belong to the same context across the entire document set make up a topic.”

If we look at the corpus as a whole we can see how well the topic modeler did in relation to the known topics.    The result is in Figure 5 below which gives the percent of papers in each science area that had scores in  each modeler topic.


Figure 5.  Topics selected by the model for each science discipline

As can be seen the modeler topic 000 did not differentiate very well between physics, bio and compsci.   To look closer at this we can look at the other csv file generated by the modeler.   This file lists the key words the modeler used to define each topic.   In the case of topic 000 the words were:


As can be seen these are words that one would expect to see in many articles from those three areas.  If we look beyond topic 000, we see physics is strong in topic 3 which is defined by the words


This topic is clearly physics.  Looking at computer science, we see the papers score strongly is topics 005 and 007.   These words are


We included machine learning in the computer science topics so this result is also reasonable.   For math the strong topics were 001 and 006 and the corresponding words were

'distribution','method','function','sample','estimator','estimate','process','parameter','random','rate','space','prove','mathbb','group','graph', 'algebra’,'theorem','finite','operator','set'

which constitutes a nice collection of words we would expect to see in math papers.  For finance topic 009 stands out with the following words.

'market', 'price', 'risk', 'optimal', 'problem', 'function', 'measure', 'financial', 'strategy', 'option'.

The only area where the topic modeler failed to be very clear was in the area of biology where topics 004 and 005 were the best.   Those words were not very indicative of biology papers:

'model', 'parameter', 'data', 'propose', 'distribution', 'inference', 'simulation', 'bayesian', 'fit', 'variable' , 'method', 'analysis', 'learn', 'base', 'network', 'approach', 'value', 'regression', 'gene'.

As an unsupervised document classifier, the Amazon Comprehend modeler is impressive.   Classifying these science abstracts is not easy because science is very multidisciplinary and many documents cross the boundary between fields.   We have looked at this problem in a previous post Algorithmia Part 2. Document Embedding with Gensim and Deploying a Trained Model in the Algorithmia Cloud and in our book Cloud Computing for Science and Engineering where we describe many of the challenges.   One short coming of the Amazon modeler is that it does not provide  an easy way to model a new document against the models built from the corpus.  This should be easy to do. In the analysis above we looked at how broad scientific domains are mapped over the detected category  bins.  One thing we also need to look at is how well the individual categories are at grouping similar abstracts.  This is equivalent to looking at the columns of the table in Figure 5 above.   If we take a look at topic 006 that is heavily associated with math we can print the titles and the ArXiv sub-categories they came from.   A sample is shown below.

‘Differential Calculus on Cayley Graphs [cs.DM]’,
‘Coherent rings, fp-injective modules, and dualizing complexes [math.CT]’,
‘Self-dual metrics with maximally superintegrable geodesic flows [gr-qc]’,
‘New atomic decompositons for Bergman spaces on the unit ball [math.CV]’,
‘Presenting Finite Posets [cs.LO]’,
‘The Whyburn property and the cardinality of topological spaces [math.GN]’,
‘Absolutely Self Pure Modules [math.RA]’,
‘Polynomials and harmonic functions on discrete groups [math.GR]’,
‘Free Resolutions of Some Schubert Singularities in the Lagrangian  Grassmannian [math.AG]’,
‘Connectedness properties of the set where the iterates of an entire  unction are unbounded [math.DS]’,
‘A Purely Algebraic Proof of the Fundamental Theorem of Algebra [math.HO]’,
‘A cell filtration of the restriction of a cell module [math.RT]’,
‘Higher dimensional Thompson groups have subgroups with infinitely many   relative ends [math.GR]’,
‘PI spaces with analytic dimension 1 and arbitrary topological dimension [math.MG]’,
‘Eigenvalues of Gram Matrices of a class of Diagram Algebras [math.RA]’

With the exception of the first, third and fifth documents they are all math and even those two documents look like math.   On the other hand looking at a sample from category 000 we see a true hodgepodge of topics.

‘A stochastic model of B cell affinity maturation and a network model of   immune memory [q-bio.MN]’,
‘Precise determination of micromotion for trapped-ion optical clocks [physics.atom-ph]’,
‘Quantum delocalization directs antenna absorption to photosynthetic   reaction centers []’,
‘Fluorescence energy transfer enhancement in aluminum nanoapertures [physics.optics]’,
‘Direct Cortical Control of Primate Whole-Body Navigation in a Mobile   Robotic Wheelchair [q-bio.NC]’,
‘Condition for the burning of hadronic stars into quark stars [nucl-th]’,
‘Joint Interference Alignment and Bi-Directional Scheduling for MIMO   Two-Way Multi-Link Networks [cs.IT]’,
‘MCViNE — An object oriented Monte Carlo neutron ray tracing simulation   package [physics.comp-ph]’,
‘Coherent addressing of individual neutral atoms in a 3D optical lattice [quant-ph]’,
‘Theoretical analysis of degradation mechanisms in the formation of   morphogen gradients []’,
‘Likely detection of water-rich asteroid debris in a metal-polluted white   dwarf [astro-ph.SR]’,
‘A Study of the Management of Electronic Medical Records in Fijian   Hospitals [cs.CY]’,
‘Self-assembling interactive modules: A research programme [cs.FL]’,
‘Proceedings Tenth International Workshop on Logical Frameworks and Meta   Languages: Theory and Practice [cs.LO]’,

Setting aside this topic bin 000, we certainly see strong coherence of the documents.

Reflections on the Unidata Cloud Computing Workshop


The University Corporation for Atmospheric Research (UCAR) was created in 1960 to be the US national hub for research and education in the atmospheric and earth sciences.  UCAR manages the National Center for Atmospheric Research (NCAR) which is devoted to fundamental research around predictions about our atmosphere.  UCAR is also the home to Unidata, a community program to facilitate weather data access and interactive analysis for the university community.  Unidata’s director Mohan Ramamurthy succinctly described Unidata’s mission as reducing the “data friction”, lowering the barriers for accessing and using data, and shrinking the “time to science”.  Unidata supports 30 different streams of real-time weather data from sources including radar, satellite and model output to over 1000 computers worldwide.  Unidata’s outbound traffic is about 31 terabytes/day.  They move more data via Internet 2 than any other advanced application.

While Unidata’s program is incredibly successful they see a critical turning point ahead.   The data volumes are growing so fast that the model of distributing raw data may no longer be sustainable.    Ramamurthy writes “We need to move from ‘bringing the data to the scientist’ to ‘bringing the science to the data’.” For these and other reasons, Unidata has made a decision to transition data services to the cloud.

Last week (May 31-June 2, 2017) UCAR and Unidata hosted a workshop on the role cloud computing can play in atmospheric science research.  The workshop explored the role that cloud computing is playing in atmospheric science research and the potential that it could play in the future.   Science is composed of many sub-disciplines.   Some disciplines, such as life science, have rapidly embraced cloud computing as an important enabling technology.  Other disciplines, especially those with a long tradition of large scale computational simulation on supercomputers, have been slower to see a role for the cloud.  Atmospheric science (AS) is in this latter category. What this workshop has made very clear is that AS community has now recognized that the cloud has the potential to revolutionize both research and education.   But there are still challenges that must be addressed.

Using the Cloud for Computation: Challenges and Opportunities

One of the great breakthroughs in Numerical Weather Prediction (NWP) has been the development of the “Weather Research and Forecasting” (WRF) model.  This is the state-of-the-art mesoscale numerical weather prediction software system.   WRF is basically a specialized computational fluid dynamics program and data assimilation system that can generate simulations based on real observational data or idealized conditions.   It can generate predictions conforming to computational meshes that can range in scale from 10s of meters to many kilometers.   WRF has 30,000 registered users in 150 countries.

WRF runs as a classic MPI-based parallel program.  For really big simulations, such as those used in production in major forecast centers, it can run for hours on 10,000 cores on a supercomputer.  In fact, the largest core count for a single WRF job (that we know of) is about 140,000 on the Blue Waters supercomputer at the University of Illinois.  For small experiments and student training it can run on 16 cores in a few hours and still provide useful results for researchers.   A big challenge for using WRF is that it is an incredibly complex application that requires an experienced professional to properly deploy it and configure it for use.   Training students to use WRF is a major challenge if the students are required to deploy it themselves.  The students face additional hurdles if they don’t have access to sufficient computational resources to run the program.

A Cloud Solution for Modest WRF Runs

As part of the Big Weather Web NSF project, J. Hacker, J. Exby and K. Fossell from NCAR recognized the problem confronting educators and came up with a great solution. They demonstrated that by putting WRF and all its required components and configuration files in a Docker container that could be easily run on a AWS EC2 instance, they could revolutionize the use of WRF in classroom and lab experimentation. Deploying WRF is no longer an obstacle for students. Their container allows the user to change input data sets and boundary conditions, change the physics assumptions, and generate bit-level reproducible simulation results.  In an experiment with the University of North Dakota, students used the container on AWS to create an ensemble output of a tornadic supercell over North Dakota.   The total AWS cost of an 11-day student team project was $40.

Running containerized versions of WRF on a 32-core cloud VM addresses a very large set of educational and research use cases.   One concern that frequently came up at the workshop was the fear of spending too much money.  Because the cloud charges by the hour, it is feared that users may be discouraged from experimenting.   Users that are accustomed to doing work on a dedicated cluster were the most concerned with this problem.   However, Roland Stull, the director of the Geophysical Disaster CFD Center at the University of British Columbia was convinced by his graduate students that the cloud was a great platform for experimentation.   They used the Google cloud to build a virtual HPC cluster.  After experimenting with different configurations  they found that 32 8-core VMs was the best configuration for WRF and  produced results in about the same time as their local 448 core HPC cluster.  Their entire experience grew from one graduate student, David Siuta, doing experiments in the cloud using a tiny Google free cloud account.

The Biggest Cloud Challenge: Data

While doing large computational simulations with tools like WRF is important, that was not the central topic of the workshop.  The most important opportunity for using cloud resources in AS is that it provides a great way to share data and post-processing analysis and simplify the data distribution challenges that face Unidata.   The cloud vendors, notably Amazon and Google have done a great job in making important data collections available in the cloud.  Kevin Jorissen from Amazon described some of the data they make available to the community.   Their data collections are open to the public and include data from many scientific disciplines. In the AS case,  this includes 270TB of  individual NEXRAD radar volume scan files and real-time chunks.  These are available as objects on Amazon S3 and accessible via standard REST APIs.   In addition, Amazon Simple Notification Service (SNS) can be used to subscribe to notifications when new data arrives.  For geoscience AWS has over 400,000 LANSAT satellite image scenes in their archive.   They also have the NOAA Global Forecast System Model data and High-Resolution Rapid Refresh (HRRR) model data available in rolling one week archives.

While these AWS resources are free to download, the challenge for Unidata and others that have moved model execution and data to the cloud is that the cost of data downloads is a large component of their overall cloud bill.  In fact, it is not possible for Unidata to put all of their data on AWS.   The cost of data egress would overwhelm their budget.   Matthew Alvarado shared his experiences with using AWS.  Their data downloads cost $90/TB and that has encouraged them to follow the same path that Unidata is considering and it is one that Jim Gray advocated in 2005:  rather than moving the data to the compute, move the compute to the data.   As Alvarado states “Do post-processing and analysis on the cloud. – Use Python and R to avoid license charges – Use THREDDS to make data accessible via netCDF viewers like Panoply or via the web.”

Unidata has gone a long way in this direction.  To move more data processing to the cloud, they have deployed an open source version of the Advanced Weather Interactive Processing System (AWIPS)  to the Azure cloud.  AWIPS which is now in use by 44 universities.   AWIPS consists of a data ingest, processing, and storage server called the Environmental Data EXchange (EDEX), and user client programs including the Common AWIPS Visualization Environment (CAVE) and a Python library that can be used to pull subsets of the data to the user’s desktop.

Unidata has also “dockerized” many of their tools including the Integrated Data Viewer (IDV), the THREDDS Data Server, and the Local Data Manager (LDM) so that they can be easily deployed by users locally or in the cloud.   The CloudIDV allows IDV to run in the cloud so that the only download bandwidth the user sees is the graphical images that IDV generates for the user.  Unidata has also released Siphon, which is a Python library that can query NetCDF and other data hosted on the THREDS Data Server.   Using this library, it is possible to build python analytic services that can explore the data while it is still in the cloud. These are great first steps in moving the compute to the data … as long as the data is in a data center in the same region as the VM running the CloudIDV container.

Final Thoughts

Weather data is among the most complex and dynamic “Big Data” topics in science.   It is extremely heterogeneous.  To do predictions of the weather one must understand the flow of information streaming in from sensors that range from NeXRAD radar and satellite images to ground sensors and human observations.   This data feeds simulation models that generate ensembles of output data that can be mined to make predictions.    It is important that the data and tools are public so that the widest possible collection of researchers can experiment with it.

We are now in the middle of a revolution in our ability to derive knowledge from big data.   New statistical tools and advanced machine learning techniques are being developed every day.   These have been applied to discover patterns in dull but profitable endeavors like consumer shopping preferences, but they are also central to the race to deliver self-driving cars.  And they are becoming increasingly important in scientific domains like genomics and physics.    Weather and climate data are also well suited to exploration with the aid of machine learning.   A great example is a paper by Liu and colleagues at LBL and others that demonstrate how deep learning via convolutional neural networks can be applied to the detection of extreme weather in climate datasets.    I am certain that, given access to the best data and strong collaborations between the ML and AS communities, even more progress will be made.

Making data available in public cloud archives makes it possible to build a computational environmental of shared tools and results.   In the machine learning research world, there is a strong movement toward open sharing of data and research challenges and results.  Sites like provide thousands of case studies and data collections for researchers.   A similar movement could create a cloud-based that build on what has already started with the Big Weather Web.

Finally, Unidata must be commended for putting together such a great workshop.



Our goal for this site is to provide you with the online content for the book.  We are also very interested in getting your feedback concerning errors or suggestions for future content.