Cloud Services for Transfer Learning on Deep Neural Networks

The breakthroughs in deep learning over the last decade have revolutionized computer image recognition.   The state-of-the-art deep neural networks have 10s of millions of parameters and they require training sets of similar size.   The training can take days on a large GPU cluster.   The most advanced deep learning models can recognize over 1000 different objects in images with surprising accuracy.   But suppose you have a computer vision task that requires that you classify a few dozen different objects.   For example, suppose you need to identify ten different subspecies of wolf, or different styles of ancient Korean pottery or paintings by Van Gogh?    These tasks are far too specific for any of the top-of-the-line pretrained models.  You could try to train an entire deep network from scratch but you if you only have a small number of images of each of your specialized classes this approach will not work.

Fortunately, there is a very clever technique that allows you to “retrain” one of the existing large vision models for your specific task.  It is called Transfer Learning and it has been around in various form from the mid 1990s.   Sebastien Ruder has an excellent blog that describes many aspect of transfer learning and it is well worth a read.

In this article we look at the progress that has been made turning transfer learning into easy-to-use cloud services.    Specifically, we will look at four different cloud services for building custom recognition systems.   Two of them are systems that have well developed on-line portal interfaces and require virtually no machine learning expertise.   They are the IBM Watson Visual Recognition Tool and Microsoft Azure Cognitive Services Custom Vision service.   The other two are tools that require a bit of programming skill and knowledge about deep networks.  These are the Google “Tensorflow for Poets” Transfer learning package and the Amazon Sagemaker toolkit.    To illustrate these four tools, we will apply each systems to the task of classifying images of galaxies.   The result is not deep from an astronomy perspective (because I am not even an amateur astronomer!), but it illustrates the power of the tools.   We will classify the galaxies into four types: barred spiral, elliptical, irregular and spiral as illustrated in Figure 1.     We will do the training with very small training sets:  19 images of each class that were gathered from Bing searches.

The classification task is not as completely trivial as one might assume.   Barred spiral galaxies are a subspecies of spiral galaxy that are distinguished by a “bar” of stars at the origin of the spirals.  Consequently, these two classes are easy to misidentify. Irregular galaxies can be very irregular.  (I like to think of them as galaxies that have not “got it together” enough to take on one of the other forms.)   And elliptical can often look like spiral or irregular galaxies.


Figure 1.   Two samples each of the four galaxy types.  The images were taken from Bing searches.

We have made these image files available at AWS S3 in two forms: a zip files barred, elliptical, irregular, spiral and test and in REC format as galaxies-train.rec and galaxies-test.rec.

Transfer Learning for DNNs

Before we launch into the examples, it is worth taking a dive into how transfer learning work with a pre-built deep learning vision model.   A good example, and one we will use, is the Inception-V3 model shown in Figure 2.


Figure 2.  Inception-V3 deep network schematic.   Image from the Google Research blog “Train your own image classifier with Inception in TensorFlow“.

In  Figure 2, each colored blob is a subnetwork with many parameters.  The remarkable thing about deep networks is how much of lower layers of convolution, pooling, concatenation seem to capture abstract qualities of images such as shapes and lines and regions.  Suppose the network has L layers  At the risk of greatly oversimplifying one can say that it is only at the last few layers that specific image classification takes place.   A simple way to do transfer learning is to replace the last two layers with two new ones and retrain the trained parameters of layers 0 to L-2 “constant” (or nearly so).

The paper  “Rethinking the Inception Architecture for Computer Vision” by  Szegedy et al. describes InceptionV3 in some detail.   The last two layers are a fully connected layer with 2024 inputs and 1000 softmax outputs.     To retrain it for 4 outputs  we replace last layers as illustrated in Figure 3 with two new layers. We now have only one matrix W of dimension 2024 by 4 of parameters we need to learntransfer-net

Figure 3.   Modified network for transfer learning.

If the training algorithm converges, it will be literally thousands of time faster than training the original.  A nice paper by Yosinski et al takes an in-depth look at feature transferability in deep networks.   There are other ways to do transfer learning on deep nets than just holding the L-2 layers fixed.   Instead one can allow some fine tuning of the top most layers with the new data.    There is much more that can be said on this subject, but our goal here is to evaluate some of the tools available.

The IBM Watson Visual Recognition tool.

This transfer learning service is incredibly easy to use.   There is an excellent drop-and-drag interface and a nodeJS API as well as a Python API.   To test it we clicked on the create classifier button and dragged the zip files for our four classes of galaxies onto the interface as shown below.


Figure 4. Visual recognition tool Interface with dragged zip files for the galaxy classes.

Within a few minutes we had a view of the classifier that we could test.   The figure below illustrates the results from dragging three examples from the training set to the classifier interface.  As you can see the interface returns the relative strength of membership in each of the classes.

To invoke the service, you need three things:

  1. your IBM bluemix api_key which you were given when you logged into the service the first time to build the model.
  2. Once your model has been built you need the classifier ID which is visible on the tool interface.
  3. you must install the watson_developer_cloud module with pip.


Figure 5:  Three vertical panels show the result of dragging one of the training images onto the classifier interface.

import watson_developer_cloud
from watson_developer_cloud import VisualRecognitionV3
visual_recognition = VisualRecognitionV3( 
      api_key = '1fc969d38 your key here 7f7d3d27334', 
      version = '2016-05-20')
classifier_id = 'galaxies_1872954591'
image_url =
param = {"url": image_url, "classifier_ids":[classifier_id]}

The key elements of the code are shown above.  (There may be other versions of the Python API.  This one was discovered by digging through the source code.   There is little other documentation.)  We built a Jupyter notebook that uses the api to compute the confusion matrix for our test set.  The Watson classifier will sometimes refuse to classify an image into one of our categories, so we had to create a “none” tag to identify these cases.  The results are very good, with the exception of the confusion of spiral and barred spiral galaxies.


Figure 6: Results from the Watson classifier for our 40 image test set.

Computing the confusion matrix for the training set gives a perfect score as shown below.


Figure 7: Confusion matrix given the training set as input.

The Jupyter notebook in HTML and IPYNB formats are available in S3.  One additional comment is needed.   Because this service is a black box, we have no idea what transfer learning service is use.

Microsoft Azure Cognitive Services Custom Vision.

The Microsoft Azure Custom Vison Service is another very well designed and easy to use system.   It is also a black box, so we have no idea how it works.  The assumption is that intended users don’t need to know and the designers are free to change the algorithm if they fine better ones.

Once you log in you create a new project as shown in Figure 8 below.   Then you can upload your training data using another panel in the interface.


Figure 8.   The  left panel defines the galaxy name and type.  The right panel is for uploading the training set.

Once the training set is in place you can see your project with a view of some of your images as shown in Figure 9.   There is a button to click to start the training.   In this case it takes less than a minute to see the results (Figure 10).


Figure 9.   The view of a sample of your training set.    The green button starts the training.


Figure 10.   The results from 2 iterations of the training.

If you are not pleased with the result of the training, you can try adding or removing images from the training set and train it again.

During the training with this data we made 3 iterations. the first was with the initial data. The system recognized that one of the elliptical galaxies was a duplicate, so the second iteration included an additional elliptical galaxy. The system will not allow a new iteration until you have modified the data, so the third iteration replaced a random spiral galaxy with another.  The results here are not great, but not bad for the small size of the training set.    As shown in Figure 11, the confusion matrix is better than the IBM case for distinguishing barred elliptical from elliptical but not as good at recognizing the irregular galaxies.


Figure 11.  Confusion matrix for Azure Custom Vision test.

Using the training data to compute the confusion we get an almost perfect score, but one barred spiral galaxy is recognized as spiral.


Figure 12.  Confusion Matrix for Azure Custom vison with training data

We looked at the case that confused the classifier and it can be seen to one that is on the border between barred spiral and spiral.   The image is contained in the full Jupyter notebook (html versionipynb version).

To use the notebook you need to have your prediction and training keys and the project id for the trained model.   You will also need to update your version of the Azure Python SDK.   The code below shows how to invoke the predictor.  The notebook gives the full details.

from import prediction_endpoint
from import training_api
training_key = 'aaab25your training key here 8a8b0' 
prediction_key = "09199your prediction key here b9ae" 
trainer = training_api.TrainingApi(training_key) 
project_id = 'fcbccf40-1bce-4bc4-b4ea-025d63f1014d' 
project = trainer.get_project(project_id)
iteration = trainer.get_iterations([2]
image = “”
predictor = prediction_endpoint.PredictionEndpoint(prediction_key)
results = predictor.predict_image_url(,, url=image)
for prediction in results.predictions: 
  print("\t" + prediction.tag + ": {0:.2f}%".format(prediction.probability * 100)

The printed results give the name of each class and the probability that it fits the provided image.

Tensorflow transfer learning with Inception_v3

Google has built a nice package called Tensorflow For Poets that we will use for the next test.  This is part of their Google Developer Codelabs.

You need to clone the github repo with the command

git clone
cd tensorflow-for-poets-2

Next go to the subdirectory tf_files and create a new directory there called “galaxies” and put four subdirectories there: barredspiral, spiral, elliptical, irregular with each containing the corresponding training images. Next do

pip install --upgrade tensorflow

The Tensorflow code to do transfer learning and retrain a model is in the subdirectory scripts in a file   It follows the transfer learning method we described earlier by replacing the top to layers of the model with a new, smaller fully connected layer and a softmax layer.   We can’t go into the details here without a deep dive into Tenorflow code which is beyond the scope of this article.   Suffice it to say that it works very nicely.

The command to “retrain” the inception model is

python -m scripts.retrain \   
     --bottleneck_dir=tf_files/bottlenecks \   
     --how_many_training_steps=500 \   
     --model_dir=tf_files/models/ \   
     --summaries_dir=tf_files/training_summaries/"inception_v3" \   
     --output_graph=tf_files/retrained_graph.pb \   
     --output_labels=tf_files/retrained_labels.txt \   
     --architecture="inception_v3" \   

If all goes well you will finally get the results that look like this

INFO:tensorflow:Final test accuracy = 88.9% (N=9)

Invoking the re-trained model is simple and you don’t need to know much Tensorflow to do it.  You essentially load the image as a tensor and load the model graph and invoke it with the input tensor.  The complete Python code for this in in the Jupyter notebook (in html and ipynb formats).

As with the other examples we have computed the confusion matrix for the test set and training set as shown below.


Figure 13.  Tensorflow test results.


Figure 14.  Tensorflow results on the training set

As can be seen the retrained model as the usual difficulty distinguishing between spiral and barred spiral and irregular sometimes looks like elliptical and sometimes spiral.   Otherwise the results are not too bad.

Amazon SageMaker

SageMaker is a very different system from the tools described above.  This article will not attempt to cover SageMaker thoroughly and we will devote a more complete article to it soon.

Briefly, it consists of a complete system for training and hosting ML models.  There is a web portal but the primary user interface is Jupyter notebooks.   Figure 15 illustrate the view of the portal after we created several experiments.  It nicely illustrates the phases of SageMaker execution.

  • You first create a Jupyter instance and a notebook. When you create a Jupyter notebook instance from the portal you are actually deploying a virtual machine on AWS.
  • You use the notebook to create ML training jobs. The training jobs take place on a dynamically allocated container cluster.
  • When training is complete you create a model which is stored and managed by SageMaker.
  • When you have a model you can create an endpoint that can be used to invoke the model from your application.


Figure 15.  SageMaker portal interface.

To train a new model you provide the name of an AWS S3 bucket where your data is stored and a bucket where the output is going to be placed.

When the Jupyter VM spins up you see it in your browser.   The first thing you discover is a large collection of demo notebooks covering a host of topics.   You are not restricted to these.  There is also a library of tools to use Apache Spark from SageMaker.  You can also upload your own notebooks with TensorFlow or MXNet models for training.   Our you can create a docker image with your own algorithms.

In the example are interested here we discovered a SageMaker example notebook, Image-classification-transfer-learning.ipynb and made a copy we called sagemaker-galaxy-predict that you can access (in html or in ipynb  format).   As with the IBM and Microsoft examples, the actual transfer learning algorithm used is a black box, but there are some hints and parameters you can adjust.

When you train a deep neural network, you are find values for the millions of parameters in the network.  (As we have described above there are many fewer parameters in transfer learning.)  But there are an additional set of parameters, called hyperparameters, that describe the network architecture and the learning process.   In the case of the transfer learning notebook you must specify the following hyperparameters:  the number of layers in the network, the training minibatch size, the training rate and the number of training epochs.   There are defaults for these based on the example that SageMaker provides, but they did poorly for the galaxy experiment.    This left us with a four-dimensional hyperparameter space to explore.   After spending about two hours trying different combinations we came up with the table below.

# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200
num_layers = 101
# we need to specify the input image shape for the training data
image_shape = "3,224,224"
# we also need to specify the number of training samples in the training set
num_training_samples = 19*4
# specify the number of output classes
num_classes = 5
# batch size for training
mini_batch_size =  21
# number of epochs
epochs = 5
# learning rate
learning_rate = 0.0018
# Since we are using transfer learning, we set use_pretrained_model to 1 so that weights can be 
# initialized with pre-trained weights
use_pretrained_model = 1

We are absolutely certain that these are far from optimal.   Once again we computed a confusion matrix for the test set and the training set and they are shown in Figure 16 and 17 below.


Figure 16.   Confusion matrix for SageMaker test data.


Figure 17.  Confusion matrix for SageMaker on training data.

As can be seen, these are not as good as our other three examples.    The failure is largely due to poor choices for the hyperparameters.  It should be noted that the Amazon team is just now starting a hyperparameter optimization project.   We will return to this example after that capability is available.


In this report we examined four computer vision transfer learning service.   We did this study using a very tiny example to see how well each service performed.   We used the simple confusion matrix to give us a qualitative picture of performance.  Indeed, these matrices showed us that distinguishing the barred spiral galaxies from the non-barred spiral ones was often challenging and that irregular galaxies are easy to misclassify.   If we want a quantitative evaluation we can compute the accuracy of each method using the test data.  The results are Azure = 0.75, Watson = 0.72, Tensoflow = 0.67 and SageMaker = 0.6.   However, given the very small size of the data sets, we argue that it is surprising that we could get reasonable results with such little effort.

Building the best galaxy classifier was not our goal here.  Real astronomers can do a much better job building systems that can answer much more interesting questions the classification task posed here. The goal of this project has been to show what you can do with cloud transfer learning tools.   The IBM and Azure tools were extremely easy to use and, within a few minutes you had a model constructed. It was not hard to access and use these models from a Python client.  The Tensorflow example from Google allowed us to do the transfer learning on a laptop.  SageMaker was fun to use (if you like Jupyter), but tuning the hyperparameters is a challenge.   A follow-up article will look at additional SageMaker capabilities.

Finally,  if any reader can improve on any of these results for this small dataset, please let me know!

A Brief Survey of Cloud AI Services

The commercial clouds are in a race to see who can provide the most interesting and useful AI services on their cloud platform.   This work began in the research laboratories in universities and companies over the past 25 years, but the big breakthroughs came when deep learning models trained on massive data collections began to reach levels of human accuracy.  For some time now, the public cloud companies have provided custom virtual machines that make it ease for technically sophisticated customers to use state of the art ML and neural network tools like TensorFlow, CNTK and others.  (We described these in here.)  But the real competition is now to provide services for building smart applications that can be used by developers lacking advanced training in machine learning and AI. We now have speech recognition, language translation, image recognition capabilities that can be easily integrated into web and mobile applications.

In the following paragraphs we will look at the AI services provided by  IBM, Google, Microsoft and Amazon.  These are certainly not the only providers.  Salesforce has the myEinstein platform and small companies like Algorithmia and not-so-small Genpact  also provide services and consulting in this area.

What becomes abundantly clear when you study the details of the offerings is that they all cover the same basics.  This includes tools for building bots, natural language translations, speech-to-text and text-to-speech and unstructured document analysis.   But what one also discovers is that each provider has some services that standout as being a bit more innovative that that offered by the others.  We conclude with an overview of the trends we see and thoughts about the future of cloud AI services.

This is the first of a series that we will do on this topic.   Future articles will explore some of these capabilities in more technical depth.  For example, at the end of this article, we look at an example of doing text analysis with Amazon Comprehend.

IBM Cloud Watson Services

The IBM Watson services are organized into five groups.

  • Watson conversation provides a portal interface to build Bots.  The interface promps you to identify intents, entities and dialog flow.  Intents are the questions you expect your users to ask.  Entities are the components such as city names, times and other objects your bot will understand.   Dialog flow is the tree of intents and responses you anticipate in the dialog.   The result is a bot you can deploy and later improve.
  • The discovery service is a tool that allow you to quickly ingest and explore data collection.  A query language can be used to subset results and identify important features and anomalies.  Discovery news is a service to crawl news and blogs looking for patterns in sentiment, new concepts and relationships.   It allows you to see trends and key events.
  • The visual recognition service has been used to analyze aerial images to better understand drought and water use.   It can do image content analysis including detecting faces and making age and gender estimates.   If you have your own collection of labeled images the system can be easily trained to incorporate these into its model.
  • Speech. Watson has speed-to-text and text-to-speech services.   These services work reasonably well but the quality of the output speech does not seem as good  as  Amazon Poly.
  • The Watson natural language classifier is designed to classify intent of text passages such as deciding that a question about the weather is looking for current temperatures.   As with the other services it is update it with additional training data.
  • The Watson empathy services allow prediction of personality characteristics and emotions through text.

Google Cloud AI services

The Google cloud has an extensive set of AI services available.

  • AutoML is Google’s tool for training their vision models on your data. If you image data is labeled it will help create better labels.   If it is not labeled they will help label it.    It uses transfer learning which is a method to retrain a neural network to recognize new inputs.  By leaving many of the early layers in the previously trained network unchanged basic features such as edges and shapes can be used again and only the last few layers need to be relearned.  (This method is widely used by the other image services described here.)  Google also has a powerful vision api that is capable of recognizing thousands of categories of images.
  • Cloud Machine Learning Engine is a cloud service that help you manage a large cluster for very large ML tasks. It also allows you to use your trained algorithm with terabytes of data and thousands of concurrent users.
  • DilogFlow is Google’s tool for building bots and interfaces that support natural and rich interactions.
  • Video Intelligence. Suppose you have a large collection of videos and you want to be able to search for occurrences of specific words.  The Google cloud video intelligence API makes this possible.
  • Cloud Speech. Google has a long history with speech-to-text recognition that is widely used in their android product and Google search.   The Google cloud Speech API recognizes over 100 languages and variants.   It has context aware recognition that filters out lots of background noise.   (The also have a very nice speech recognition kit that works with a raspberry pi.   I have used it.   It is fun little project.)
  • Natural Language. Google’s text analysis is very good at parsing documents.   It is very good at entity recognition (tagging many phrases and words with Wikipedia articles).   It can also give you lists of relevant categories.   For syntax analysis is used a version of their parsyMcParseface parser that I used in my demo of building an application for Algorithmia described in this post.
  • Cloud Translation. Google had one of the earliest cloud translation services and it has become better over time.   It supports more than 100 languages.

Microsoft Azure

Azure’s machine learning services are divided into two categories: Artificial Intelligence and cognitive services.  There are currently three AI services:

  • ML services which is based on their machine learning workbench. The workbench is designed to guide you through the process of creating a data package from your data and then build a python/pyspark script to evaluate it.   You can invoke the script locally, on an azure vm or in a container.   The workbench has the capability to fill in the gaps of data cleaning and algorithm selection in generating a final solution.
  • Batch AI services consist of a set of tools to help you marshal GPU and CPU cluster resources for parallel machine learning training using CNTK, TensorFlow and Chainer.
  • Azure AI services include Bot Builder, an SDK for creating bots and a suite of bot template.

The cognitive services are divided into four main categories.

  • This includes a vision API for content analysis which works with a Jupyter notebook that allows you to upload images and return a description in terms of recognized entities.  It also provides a caption.  A content moderator service allows you to flag images that may have unwanted content.  The custom vision service allows you to quickly train a vision app to recognize images from classes you provide.   The classes can be small (30 images in each) but it does not recognize your images when they are embedded in more complex scenes.   However it does allow you to export the trained model as TensorFlow to run in offline applications.  Face and Emotion APIs allow you to detect faces in images and detect the mood of each.   The video indexer is impressive.  It  can provide audio transcription, face tracking and identification, speaker indexing, visual text recognition, sentiment analysis and language translation.
  • The speech to text and text to speech services are there but there is also a Custom Speech Service that allows you to add knowledge about specific jargon to the language model.  A Speaker Recognition API allows your apps to automatically verify and authenticate users using their voice and speech.  The Translator service is based on the work that was done for the skype realtime speech translation system.   It can recognize languages and translate the spoken sentences into the target language.
  • The Language Understanding Service allows your application to understand spoken commands like “Turn off the light” or home automation tasks.  The Linguistic Analysis API provides sentence separation, part-of-speech tagging and constituency parsing.   The Text Analysis Service provide sentiment analysis and key phrase extraction.  A Web Language Model is based on the Web N-Gram Corpus for analysis of Web documents.
  • The Custom Decision Service uses reinforcement learning algorithms to extract features from a set of candidates when ranking articles and images for automatic inclusion in a web site.  The Entity Linking Intelligence Service API provides a tool to understand when an word is uses as an actual entity rather than a part of speech or a general noun or verb.  This is done by looking at the context of the use of the word.  The Academic Knowledge API provides access to the Microsoft academic graph which is data mind from the Bing index.   The QnA Maker is a REST API that trains a machine learning system to help bots respond in a more natural way to user requests.

AWS AI services

Amazon’s web services cloud AI services has seven major APIs

  • Image and video Rekognition. The image recognition service allows the full set of computer vision features that are available anywhere.  Object, scene and activity detection is continuously learning.  It can recognize objects and scenes.  Text in images like street names or product names can be read.  If you have a private library of photos it can identify a people. When it is analyzing video it can identify certain activities happening in the frame.   Facial analysis recognizes age ranges and emotions.   When analyzing video it can track individual people as they go in and out of a frame.  Sending live or recorded video to a Kinesis Video Stream  can be routed to rekognition video and identified object can be sent to lambda functions that can react in near real time.   Alternatively, video can be periodically loaded into S3 buckets which trigger lambda functions that will invoke rekognition for analysis.
  • Amazon Lex is a tool for build bots with voice and text input and response. It is the same technology that powers Echo’s Alexa.    The Lex console allows you to build  a bot with ease.  Conversation flow is an import part of the Bot interaction.   Lex supports simple mechanisms to allow you to tailor the flow to your application.
  • Comprehend. There are two main components to Amazon Comprehend.   The first is a set of tools to extract named entities (“Person”, “Organization”, “Locations”, etc.) and key phrases from a document. The more impressive capability of Amazon Comprehend is the topic modeling subsystem.   This is of interest if you have a large collection of documents and you would like to see then classified into a set of N bins where N is a number you pick.   Comprehend will take you collection of documents and apply a Latent Dirichlet Allocation-based learning model to separate them into N bins with each bin defined by a set of key words it has discovered.   (At the end of this article we will demonstrate Amazon Comprehend.
  • Translate. This service provides real-time and batch language translation.   The service is protected by SSL encryption.
  • If you have a mp3 or wave video and you want to add subtitles, the transcribe service will render all of the voice audio to text and also insert timestamps for each word.   You can then use Translate to convert the audio to another language.   They say they are adding specific voice identification soon.
  • Poly is the Amazon text to speech API.   It is far from the robotic sounding speech generation we saw in the past.   It has 47 different voices spread over many languages.   (I have used it and it is both impressive and fun.)


If you need to build a bot that understands English, French and Mandarin and replies with spoken and correctly accented Italian that can help you identify your friends and celebrities in your Instagram photos and also mine your twitter feed, you are in luck.  The tools are there.  But if you are expecting emergent artificial intelligence, you are out of luck.  Alexa, Cortana and Seri are each good at fast facts but otherwise dumb as a post.

It is also now clear that this technology is also a boon to those with more nefarious goals.  If you are a government security agency with access to lots of cameras in public places, keeping track of your citizens is now a snap.   We see that social media is now swarming with bots that sell not only soap but also promote and propagate lies and propaganda.    Serious questions are being raised about the potential threat to modern democracies that these technologies enable.   The social media companies are aware of the challenge of eliminating the bots that skew our national discussions and we hope they are up to the cleanup task.

There is also much to be excited about.   The technology behind these AI services is also helping us use vision and sensing that can truly help mankind.   We can “see” the planet at an enlarged scale.  We can spot droughts, crop disease and the effects global warming is having on the planet in greater detail because of the speed and accuracy of image analysis.    We can monitor thousands of sensors in our environment that help us improve our quality of air and water and we can better predict potential problems before they occur.  The same sensor and vision technology help us scan x-ray and other medical images.

All of these AI advances are going to give us safer roads with driverless cars and robots magnify the power of the individual worker in almost every domain.   I look forward to the time Alexa and or Cortana can become a real research partner helping me scan and review scientific literature and point me to discoveries that I most certainly miss today.


In the following paragraphs we look at one of the cloud services in depth.   In future articles we will examine other capabilities and applications.

Text Analysis with Amazon Comprehend

As with everything in AWS, their services can be accessed by the command line interface or the APIs.   However, the console provides a very simple way to use them.   We will test Amazon’s Comprehend using the named entity and key phrases interface.  The service is accessed via their API explorer.

We selected a paragraph about the discovery of DNA from Wikipedia and pasted it into the entity/key phrase extractor.    The results are shown in figures 1, 2 and 3.


Figure 1.  Inserting a paragraph into the API explorer.


Figure 2.   The list of Entities


Figure 3. The key phrases

As can be seen the system does a very good job with both the entity and key phrase tasks.   In particular it does a great job of categorizing the named entities.

Topic Modeling

The more impressive capability of Amazon Comprehend is the topic modeling subsystem.   This is of interest if you have a large collection of documents and you would like to see then classified into a set of N bins where N is a number you pick.   To try this out, I  used the Arxiv science abstract collection I have used in previous document classifier experiments.   Each document is the text of an abstract of a scientific research paper.   To use comprehend you put the documents in an AWS S3 bucket.   I have 7108 documents and they are in  the bucket  (If you are interested, the individual files can be accessed by this url*arxiv  where * is a an integer between 0 and 7108.)

Invoking the topic modeler from the console is trivial.  You simply fill in a form.  The form for my experiment is shown below in Figure 4.


Figure 4.   Invoking the Topic modeler.

In this test case the topic modeler ran in about five minutes and produce a pair of CSV files.   One file contained a set of tuples for each document.  Each tuple is a triple consisting of the document name, the name of a topic bin and a score for fit for that bin.   For example, here is the first 11 tuples.  The abstract documents are drawn from five fields of science: physics, biology, computer science, math and finance. We have added a fourth column that provides the science category for the listed document.

Document no. Topic Score Actual topic
0 0 0.242696 compsci
0 5 0.757304 compsci
1 1 1 math
2 0 0.546125 Physics
2 4 0.438275 Physics
2 5 0.015599 Physics
3 1 1 math
4 8 1 Physics
5 0 0.139652 Physics
5 3 0.245669 Physics
5 5 0.614679 Physics

As can be seen, document 0 is computer science and scores in topic 0 and highly in topic 5.    Documents 1 and 3 are math and squarely land in topic 1.  Documents 2, 4 and 5 are physics and are distributed over topics 0,3,4,5 and 8.   The algorithm used in the topic modeler is described in Amazon’s documentation as follows.

“Amazon Comprehend uses a Latent Dirichlet Allocation-based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word. The set of words that frequently belong to the same context across the entire document set make up a topic.”

If we look at the corpus as a whole we can see how well the topic modeler did in relation to the known topics.    The result is in Figure 5 below which gives the percent of papers in each science area that had scores in  each modeler topic.


Figure 5.  Topics selected by the model for each science discipline

As can be seen the modeler topic 000 did not differentiate very well between physics, bio and compsci.   To look closer at this we can look at the other csv file generated by the modeler.   This file lists the key words the modeler used to define each topic.   In the case of topic 000 the words were:


As can be seen these are words that one would expect to see in many articles from those three areas.  If we look beyond topic 000, we see physics is strong in topic 3 which is defined by the words


This topic is clearly physics.  Looking at computer science, we see the papers score strongly is topics 005 and 007.   These words are


We included machine learning in the computer science topics so this result is also reasonable.   For math the strong topics were 001 and 006 and the corresponding words were

'distribution','method','function','sample','estimator','estimate','process','parameter','random','rate','space','prove','mathbb','group','graph', 'algebra’,'theorem','finite','operator','set'

which constitutes a nice collection of words we would expect to see in math papers.  For finance topic 009 stands out with the following words.

'market', 'price', 'risk', 'optimal', 'problem', 'function', 'measure', 'financial', 'strategy', 'option'.

The only area where the topic modeler failed to be very clear was in the area of biology where topics 004 and 005 were the best.   Those words were not very indicative of biology papers:

'model', 'parameter', 'data', 'propose', 'distribution', 'inference', 'simulation', 'bayesian', 'fit', 'variable' , 'method', 'analysis', 'learn', 'base', 'network', 'approach', 'value', 'regression', 'gene'.

As an unsupervised document classifier, the Amazon Comprehend modeler is impressive.   Classifying these science abstracts is not easy because science is very multidisciplinary and many documents cross the boundary between fields.   We have looked at this problem in a previous post Algorithmia Part 2. Document Embedding with Gensim and Deploying a Trained Model in the Algorithmia Cloud and in our book Cloud Computing for Science and Engineering where we describe many of the challenges.   One short coming of the Amazon modeler is that it does not provide  an easy way to model a new document against the models built from the corpus.  This should be easy to do. In the analysis above we looked at how broad scientific domains are mapped over the detected category  bins.  One thing we also need to look at is how well the individual categories are at grouping similar abstracts.  This is equivalent to looking at the columns of the table in Figure 5 above.   If we take a look at topic 006 that is heavily associated with math we can print the titles and the ArXiv sub-categories they came from.   A sample is shown below.

‘Differential Calculus on Cayley Graphs [cs.DM]’,
‘Coherent rings, fp-injective modules, and dualizing complexes [math.CT]’,
‘Self-dual metrics with maximally superintegrable geodesic flows [gr-qc]’,
‘New atomic decompositons for Bergman spaces on the unit ball [math.CV]’,
‘Presenting Finite Posets [cs.LO]’,
‘The Whyburn property and the cardinality of topological spaces [math.GN]’,
‘Absolutely Self Pure Modules [math.RA]’,
‘Polynomials and harmonic functions on discrete groups [math.GR]’,
‘Free Resolutions of Some Schubert Singularities in the Lagrangian  Grassmannian [math.AG]’,
‘Connectedness properties of the set where the iterates of an entire  unction are unbounded [math.DS]’,
‘A Purely Algebraic Proof of the Fundamental Theorem of Algebra [math.HO]’,
‘A cell filtration of the restriction of a cell module [math.RT]’,
‘Higher dimensional Thompson groups have subgroups with infinitely many   relative ends [math.GR]’,
‘PI spaces with analytic dimension 1 and arbitrary topological dimension [math.MG]’,
‘Eigenvalues of Gram Matrices of a class of Diagram Algebras [math.RA]’

With the exception of the first, third and fifth documents they are all math and even those two documents look like math.   On the other hand looking at a sample from category 000 we see a true hodgepodge of topics.

‘A stochastic model of B cell affinity maturation and a network model of   immune memory [q-bio.MN]’,
‘Precise determination of micromotion for trapped-ion optical clocks [physics.atom-ph]’,
‘Quantum delocalization directs antenna absorption to photosynthetic   reaction centers []’,
‘Fluorescence energy transfer enhancement in aluminum nanoapertures [physics.optics]’,
‘Direct Cortical Control of Primate Whole-Body Navigation in a Mobile   Robotic Wheelchair [q-bio.NC]’,
‘Condition for the burning of hadronic stars into quark stars [nucl-th]’,
‘Joint Interference Alignment and Bi-Directional Scheduling for MIMO   Two-Way Multi-Link Networks [cs.IT]’,
‘MCViNE — An object oriented Monte Carlo neutron ray tracing simulation   package [physics.comp-ph]’,
‘Coherent addressing of individual neutral atoms in a 3D optical lattice [quant-ph]’,
‘Theoretical analysis of degradation mechanisms in the formation of   morphogen gradients []’,
‘Likely detection of water-rich asteroid debris in a metal-polluted white   dwarf [astro-ph.SR]’,
‘A Study of the Management of Electronic Medical Records in Fijian   Hospitals [cs.CY]’,
‘Self-assembling interactive modules: A research programme [cs.FL]’,
‘Proceedings Tenth International Workshop on Logical Frameworks and Meta   Languages: Theory and Practice [cs.LO]’,

Setting aside this topic bin 000, we certainly see strong coherence of the documents.

Reflections on the Unidata Cloud Computing Workshop


The University Corporation for Atmospheric Research (UCAR) was created in 1960 to be the US national hub for research and education in the atmospheric and earth sciences.  UCAR manages the National Center for Atmospheric Research (NCAR) which is devoted to fundamental research around predictions about our atmosphere.  UCAR is also the home to Unidata, a community program to facilitate weather data access and interactive analysis for the university community.  Unidata’s director Mohan Ramamurthy succinctly described Unidata’s mission as reducing the “data friction”, lowering the barriers for accessing and using data, and shrinking the “time to science”.  Unidata supports 30 different streams of real-time weather data from sources including radar, satellite and model output to over 1000 computers worldwide.  Unidata’s outbound traffic is about 31 terabytes/day.  They move more data via Internet 2 than any other advanced application.

While Unidata’s program is incredibly successful they see a critical turning point ahead.   The data volumes are growing so fast that the model of distributing raw data may no longer be sustainable.    Ramamurthy writes “We need to move from ‘bringing the data to the scientist’ to ‘bringing the science to the data’.” For these and other reasons, Unidata has made a decision to transition data services to the cloud.

Last week (May 31-June 2, 2017) UCAR and Unidata hosted a workshop on the role cloud computing can play in atmospheric science research.  The workshop explored the role that cloud computing is playing in atmospheric science research and the potential that it could play in the future.   Science is composed of many sub-disciplines.   Some disciplines, such as life science, have rapidly embraced cloud computing as an important enabling technology.  Other disciplines, especially those with a long tradition of large scale computational simulation on supercomputers, have been slower to see a role for the cloud.  Atmospheric science (AS) is in this latter category. What this workshop has made very clear is that AS community has now recognized that the cloud has the potential to revolutionize both research and education.   But there are still challenges that must be addressed.

Using the Cloud for Computation: Challenges and Opportunities

One of the great breakthroughs in Numerical Weather Prediction (NWP) has been the development of the “Weather Research and Forecasting” (WRF) model.  This is the state-of-the-art mesoscale numerical weather prediction software system.   WRF is basically a specialized computational fluid dynamics program and data assimilation system that can generate simulations based on real observational data or idealized conditions.   It can generate predictions conforming to computational meshes that can range in scale from 10s of meters to many kilometers.   WRF has 30,000 registered users in 150 countries.

WRF runs as a classic MPI-based parallel program.  For really big simulations, such as those used in production in major forecast centers, it can run for hours on 10,000 cores on a supercomputer.  In fact, the largest core count for a single WRF job (that we know of) is about 140,000 on the Blue Waters supercomputer at the University of Illinois.  For small experiments and student training it can run on 16 cores in a few hours and still provide useful results for researchers.   A big challenge for using WRF is that it is an incredibly complex application that requires an experienced professional to properly deploy it and configure it for use.   Training students to use WRF is a major challenge if the students are required to deploy it themselves.  The students face additional hurdles if they don’t have access to sufficient computational resources to run the program.

A Cloud Solution for Modest WRF Runs

As part of the Big Weather Web NSF project, J. Hacker, J. Exby and K. Fossell from NCAR recognized the problem confronting educators and came up with a great solution. They demonstrated that by putting WRF and all its required components and configuration files in a Docker container that could be easily run on a AWS EC2 instance, they could revolutionize the use of WRF in classroom and lab experimentation. Deploying WRF is no longer an obstacle for students. Their container allows the user to change input data sets and boundary conditions, change the physics assumptions, and generate bit-level reproducible simulation results.  In an experiment with the University of North Dakota, students used the container on AWS to create an ensemble output of a tornadic supercell over North Dakota.   The total AWS cost of an 11-day student team project was $40.

Running containerized versions of WRF on a 32-core cloud VM addresses a very large set of educational and research use cases.   One concern that frequently came up at the workshop was the fear of spending too much money.  Because the cloud charges by the hour, it is feared that users may be discouraged from experimenting.   Users that are accustomed to doing work on a dedicated cluster were the most concerned with this problem.   However, Roland Stull, the director of the Geophysical Disaster CFD Center at the University of British Columbia was convinced by his graduate students that the cloud was a great platform for experimentation.   They used the Google cloud to build a virtual HPC cluster.  After experimenting with different configurations  they found that 32 8-core VMs was the best configuration for WRF and  produced results in about the same time as their local 448 core HPC cluster.  Their entire experience grew from one graduate student, David Siuta, doing experiments in the cloud using a tiny Google free cloud account.

The Biggest Cloud Challenge: Data

While doing large computational simulations with tools like WRF is important, that was not the central topic of the workshop.  The most important opportunity for using cloud resources in AS is that it provides a great way to share data and post-processing analysis and simplify the data distribution challenges that face Unidata.   The cloud vendors, notably Amazon and Google have done a great job in making important data collections available in the cloud.  Kevin Jorissen from Amazon described some of the data they make available to the community.   Their data collections are open to the public and include data from many scientific disciplines. In the AS case,  this includes 270TB of  individual NEXRAD radar volume scan files and real-time chunks.  These are available as objects on Amazon S3 and accessible via standard REST APIs.   In addition, Amazon Simple Notification Service (SNS) can be used to subscribe to notifications when new data arrives.  For geoscience AWS has over 400,000 LANSAT satellite image scenes in their archive.   They also have the NOAA Global Forecast System Model data and High-Resolution Rapid Refresh (HRRR) model data available in rolling one week archives.

While these AWS resources are free to download, the challenge for Unidata and others that have moved model execution and data to the cloud is that the cost of data downloads is a large component of their overall cloud bill.  In fact, it is not possible for Unidata to put all of their data on AWS.   The cost of data egress would overwhelm their budget.   Matthew Alvarado shared his experiences with using AWS.  Their data downloads cost $90/TB and that has encouraged them to follow the same path that Unidata is considering and it is one that Jim Gray advocated in 2005:  rather than moving the data to the compute, move the compute to the data.   As Alvarado states “Do post-processing and analysis on the cloud. – Use Python and R to avoid license charges – Use THREDDS to make data accessible via netCDF viewers like Panoply or via the web.”

Unidata has gone a long way in this direction.  To move more data processing to the cloud, they have deployed an open source version of the Advanced Weather Interactive Processing System (AWIPS)  to the Azure cloud.  AWIPS which is now in use by 44 universities.   AWIPS consists of a data ingest, processing, and storage server called the Environmental Data EXchange (EDEX), and user client programs including the Common AWIPS Visualization Environment (CAVE) and a Python library that can be used to pull subsets of the data to the user’s desktop.

Unidata has also “dockerized” many of their tools including the Integrated Data Viewer (IDV), the THREDDS Data Server, and the Local Data Manager (LDM) so that they can be easily deployed by users locally or in the cloud.   The CloudIDV allows IDV to run in the cloud so that the only download bandwidth the user sees is the graphical images that IDV generates for the user.  Unidata has also released Siphon, which is a Python library that can query NetCDF and other data hosted on the THREDS Data Server.   Using this library, it is possible to build python analytic services that can explore the data while it is still in the cloud. These are great first steps in moving the compute to the data … as long as the data is in a data center in the same region as the VM running the CloudIDV container.

Final Thoughts

Weather data is among the most complex and dynamic “Big Data” topics in science.   It is extremely heterogeneous.  To do predictions of the weather one must understand the flow of information streaming in from sensors that range from NeXRAD radar and satellite images to ground sensors and human observations.   This data feeds simulation models that generate ensembles of output data that can be mined to make predictions.    It is important that the data and tools are public so that the widest possible collection of researchers can experiment with it.

We are now in the middle of a revolution in our ability to derive knowledge from big data.   New statistical tools and advanced machine learning techniques are being developed every day.   These have been applied to discover patterns in dull but profitable endeavors like consumer shopping preferences, but they are also central to the race to deliver self-driving cars.  And they are becoming increasingly important in scientific domains like genomics and physics.    Weather and climate data are also well suited to exploration with the aid of machine learning.   A great example is a paper by Liu and colleagues at LBL and others that demonstrate how deep learning via convolutional neural networks can be applied to the detection of extreme weather in climate datasets.    I am certain that, given access to the best data and strong collaborations between the ML and AS communities, even more progress will be made.

Making data available in public cloud archives makes it possible to build a computational environmental of shared tools and results.   In the machine learning research world, there is a strong movement toward open sharing of data and research challenges and results.  Sites like provide thousands of case studies and data collections for researchers.   A similar movement could create a cloud-based that build on what has already started with the Big Weather Web.

Finally, Unidata must be commended for putting together such a great workshop.



Our goal for this site is to provide you with the online content for the book.  We are also very interested in getting your feedback concerning errors or suggestions for future content.