Blog

Building a Tiny Knowledge Graph with BERT and Graph Convolutions.

Abstract

Knowledge graphs (KGs) have become an important tool for representing knowledge  and accelerating search tasks.   Formally, a knowledge graph is a graph database formed from entity triples of the form (subject, relation, object) where the subject and object are entity nodes in the graph and the relation defines the edges.  When combined with natural language understanding technology capable of generating these triples from user queries, a knowledge graph can be a fast supplement to the traditional web search methods employed by the search engines.    In this tutorial, we will show how to use Google’s Named Entity Recognition to build a tiny knowledge graph based on articles about scientific topics.  To search the KG  we will use BERT to build vectors from English queries and graph convolutions to optimize the search. The result is no match for the industrial strength KGs from the tech giants, but we hope it helps illustrate some core concepts.

Background

We have explored the topic of KGs in previous articles on this blog.  In “A ‘Chatbot’ for Scientific Research: Part 2 – AI, Knowledge Graphs and BERT.”, we postulated how a KG could be the basis for a smart digital assistant for science. If you search for Knowledge Graph on the web or in Wikipedia you will lean that the KG is the one introduced by Google in 2012 and it is simply known as “Knowledge Graph”.  In  fact, it is very large (over 70 billion nodes) and is consulted in a large fraction of searches.   Having the KG available means that a search can quickly surface many related items by looking at nearby nodes linked to the target of the search.   This is illustrated in Figure 1.  This  is the result of a Google search for “differential equation” which is displayed an information panel to the right of the search results.  

This image has an empty alt attribute; its file name is Picture1.jpg


Figure 1.  Google information panel that appears on the right side of the page.  In this case the search was for “differential equation”.  (This image has been shortened a bit.)

The Google KG is extremely general, so it is not as good for all science queries, especially those that clash with popular culture.   For example, if you search for the term that describes the surface of a black hole,  an “event horizon” you get an image from the bad 1997 movie by that name.   Of course, there is also a search engine link to the Wikipedia page describing the real scientific thing.   Among the really giant KGs is the Facebook entity graph which is nicely described in “Under the Hood: The Entities Graph” by Eric Sun and Venky Iyer in 2013.  The entity graph has 100+ billion connections.  An excellent KG for science topics is the Microsoft Academic graph.  See “Microsoft Academic Graph: When experts are not enough” Wang et.al. 2019 which we will describe in more detain in the last section of this note.   In its earliest form the Google’s KG was based on another KG known as Freebase.  In 2014 Google began the process of shutting down Freebase and moving content to a KG associated with Wikipedia called Wikidata.  We will use Wikidata extensively below.

Wikidata was launched in 2012 with a grant from Allen Institute, Google and the Gordon and Betty Moore Foundation and it now has information that is used in 58.4% of all English Wikipedia articles.   Items in Wikidata each have an identifier (the letter Q and a number) and each item has a brief description and a list of alias names.  (For example,  the item for Earth (Q2) has alternative names: Blue Planet, Terra Mater, Terra, Planet Earth, Tellus, Sol III, Gaia, The world, Globe, The Blue Gem, and more.)  each item has a list of affiliated “statements” which are the “object-relation-object” triples that are the heart of the KG.   Relations are predicates and are identified with a P and a number.  For example, Earth is an “instance of” (P31) “inner planet” (Q3504248).  There are currently 68 million items in Wikidata and, like Wikipedia it can be edited by anyone.

In what follows we will show how to build a tiny knowledge graph for two narrow scientific topics and then using some simple deep learning techniques, we will illustrate how we can query to KG and get approximate answers. 

As mentioned above ideal way to construct a KG from text is to use NLP methods to extract triples from text.  In a very nice tutorial by Chris Thornton, this approach is used to extract information from financial news articles.   If the relations in the  object-relation-object are rich enough one may be able to more accurately answer questions about the data.   For example, if the triples are as shown in the graph in Figure 2  below

This image has an empty alt attribute; its file name is Picture2.png

Figure 2.  Graph created from the triples “Mary attended Princeton” and “Princeton is located in New Jersey”

one may have a good chance at answering the question “Where has Mary lived?”.There has been a great deal of research on the challenge of building relations for KG. (for example see Sun, et.al)     Unfortunately, scientific technical documents are not as easy to parse into useful triples.  Consider this sentence.

The precise patterns prevalent during the Hangenberg Crisis are complicated by several factors, including difficulties in stratigraphic correlation within and between marine and terrestrial settings and the overall paucity of plant remains.

Using the AllenAI’s NLP package we can pull out several triples including this one:

[ARG1: the precise patterns prevalent during the Hangenberg Crisis]

[V: complicated]

[ARG0: by several factors , including difficulties in stratigraphic correlation within and between marine and terrestrial settings and the overall paucity of plant remains]

But the nodes represented by ARG0 and ARG1 are not very concise and unlikely to fit into a graph with connections to other nodes and the verb ‘complicated’ is hard to use when hunting for facts. More sophisticated techniques are required to extract usable triples from documents than we can describe here. Our approach is to use a simpler type of relation in our triples.   

Building our Tiny Knowledge Graph

The approach we will take below is to consider scientific documents to be composed as blocks of sentences, such as paragraphs and we look at the named entities mentioned in each block. If a block has named entities, we create a node for that block called an Article node which is connected to a node for each named entity.  In some cases we can learn that a named entity is an instance of a entity class.   In other words, our triples are of the form

(Article  has named-entity)  or (named-entity instance-of  entity-class)

If entities, such as “Hangenberg Crisis”, occur in other blocks from the same paper or other papers we have an indirect connection between the articles. 

To illustrate what the graph looks like consider the following tiny text fragment.  We will use it to generate a graph with two article nodes.

The Theory of General Relativity demonstrates that Black Holes are hidden by an Event Horizon. Soon after publishing the special theory of relativity in 1905, Einstein started thinking about how to incorporate gravity into his new relativistic framework.

In 1907, beginning with a simple thought experiment involving an observer in free fall, he embarked on what would be an eight-year search for a relativistic theory of gravity. After numerous detours and false starts, his work culminated in the presentation to the Prussian Academy of Science in November 1915 of what are now known as the Einstein field equations, which form the core of Einstein’s famous Theory of General Relativity.

Each sentence is sent to the Google Named-Entity Recognition service with the following function.

This image has an empty alt attribute; its file name is Picture3.png

The NER service responds with two types of entities.   Some are simply nouns or noun phrases and some are entities that Google NER recognizes as having Wikipedia entries.  In those cases, we also use the Wikipedia API to pull out the Wikidata Identifier that is the key to Wikidata.  In our example we see three entity node types in the graph.   The blue entity nodes are the ones that have Wikipedia entries.  The green entities are nodes that are noun phrases.  We discard returned entities consisting of a single noun, like “space”, because there are too many of them, but multiword phrases are more likely to suggest technical content that may appear in other documents.  These nodes are in green.  In the case of the Wikidata (blue) nodes, we can use the Wikidata identifier to find out if the entity is an instance of a class in Wikidata.  If so, these are retrieved and represented as gray entities.   The first pair of sentences are used to create an article node labeled ‘Art0’.   The second pair of sentences are used to form node ‘Art1’. 

This image has an empty alt attribute; its file name is Picture4-1024x630.png

Figure 3.  The rendering of our two-article graph using NetworkX built-in graphing capabilities.

In this case we see that Einstein is an instance of human and an event horizon is an instance of hypersurface (and not a movie).   The elegant way to look for information in Wikidata is to use the SPARQL query service.   The code to pull the instanceOf property (wdt:P31)  of an entity given its wikiID is

This image has an empty alt attribute; its file name is fig2-1024x345.jpg

Unfortunately, we were unable to use this very often because of time-outs with the sparql server.   We had to resort to scraping the wikidata pages.  The data in the graph associated with each entity is reasonably large.   We have a simple utility function showEntity that will display it.   For example,

This image has an empty alt attribute; its file name is fig3-1024x438.jpg

The full Jupyter notebook to construct this simple graph in figure 3 is called build-simple-graph.ipynb in the repository https://github.com/dbgannon/knowledge-graph.

We will build our tiny KG from 14 short documents which provide samples in the topics climate change, extinction, human caused extinction, relativity theory, black holes, quantum gravity and cosmology.  In other words, our tiny KG will have a tiny bit of expertise in two specialized: topics relativistic physics and climate driven extinction.   

English Language Queries, Node Embeddings and Graph Convolutions.

Our tiny KG graph was built with articles about climate change, so it should be able to consider queries like ‘The major cause of climate change is increased carbon dioxide levels.’ And respond with the appropriate related items.   The standard way to do this is to take our library of text articles stored in the KG and  build a list of sentences (or paragraphs) and then use a document embedding algorithm to map each one to a vector in RN for some large N so that semantically similar sentences are mapped to nearby vectors.    There are several standard algorithms to do sentence embedding.   One is Doc2Vec and many more have been derived from Transformers like BERT. As we shall see, it is important that we have one vector for each article node in our KG.  In the example illustrated in Figure 3, we used two sentences for each article. Doc2Vec is designed to encode articles of any size, so 2 is fine.   A newer and more accurate method is based on the BERT Transformer, but it is designed for single sentences but also works with multiple sentences.   To create the BERT sentence embedding mapping we need to first load the pretrained model.

This image has an empty alt attribute; its file name is fig4-1024x89.jpg

The next step is to use the model to encode all of the sentences in our list.  Once that is done, we create a matrix mar where mar[i] contains the sentence embedding vector for the ith sentence normalized to unit length. 

This image has an empty alt attribute; its file name is Picture-new2.png

Now if we have an arbitrary sentence Text and we want to see which sentences are closest to it we simply encode the Text, normalize it and compute the dot product with all the sentences.   To print the k  best fits starting with the best we invoke the following function.

This image has an empty alt attribute; its file name is Picture11.png

For example,  if we ask a simple question:

Is there a sixth mass extinction?

We can invoke find_best to look for the parts of the KG that is the best fit.

This image has an empty alt attribute; its file name is fig5-1024x606.jpg

Which is a reasonably good answer drawn from  our small selection of documents.   However there are many instances where it is not as good.   An important thing to note is that we have not used any properties of the graph structure in this computation. 

Using the Structure of the Graph:  A Convolutional Approach.

 In our previous tutorial “Deep Learning on Graphs” we looked at the graph convolutional network as a way to improve graph node embedding for classification.   We can do the same thing here to use the structure of the graph to augment the Bert embedding.   First, for each article node x  in the graph we collect all its immediate neighbors where we define immediate neighbor to mean those other article nodes linked to an entity node shared with x.   We then computed a weighted sum (using a parameter lambda in [0,1]) of the Bert embedding vectors for each neighbor with the Bert embedding of for x.   We normalize this new vector, and we have a new embedding matrix mar2.  

This image has an empty alt attribute; its file name is Picture13.png

Figure 4.  Illustrating the convolution operation

Intuitively the new embedding captures more of the local properties of the graph. We create a new function find_best2 which we can use for question answering.   There is one problem.   In the original find_best function we convert the query text to a vector using the BERT model encoder.   We have no encoder for the convolved model. Using the original BERT encoder didn’t work… we tried.)  We decided to use properties of the Graph.  We asked the Google NER service to give us all the named entities in our question.  We then compare those entities to entities in the Graph.  We search for the  article node with the largest number of named entities that are also in our query. we can use the convolved mar2 vector for that node as a reasonable encoding for our query.   The problem comes when there is no clear winner in this search.   In that case, we use the article that find_best says is the best fit and use that article’s mar2 vector as our encoding.   Pseudo code  for our new find_best2 based on the convolution matrix mar2 is

This image has an empty alt attribute; its file name is Picture14-1024x479.png

To illustrate how the convolution changes the output consider the following cases.

This image has an empty alt attribute; its file name is fig6-1024x783.jpg

The first two responses are excellent.    Now apply find_best2 we get the following.

This image has an empty alt attribute; its file name is fig7-1024x729.jpg

In this case there were 18 article nodes which had named entities that matched the entity in the text: “carbon dioxide”.   In this case find_best2 just uses the first returned value from find_best.  After that it selected the next 3 best based on mar2.  The second is the same, but the last two are different are arguably a better fit than the result from find_best.

To see a case where find_best2 does not invoke find_best to start, consider the following.

This image has an empty alt attribute; its file name is fig8-1024x555.jpg
This image has an empty alt attribute; its file name is fig10-1024x785.jpg

Both fail on addressing the issue of “Who” (several people are mentioned that answer this question are in the documents used to construct our KG.).  However the convolutional version find_best2 addressed “solve” “field equations” better than find_best.

The process of generating  the  BERT embedding vectors mar[ ] and the results of the convolution transformation mar2[ ] below is illustrated in Figure 5 as a sequence of two operators.  

BertEncode: list(sentences1 .. N ) ->  RNx768    
GraphConv : RNx768 -> RNx768

This image has an empty alt attribute; its file name is Picture19.png

Figure 5.  Going from a list of N sentences to embedding vectors followed by graph convolution.  Additional convolution layers may be applied.

There is no reason to stop with one layer of graph convolutions.    To measure how this impacts the performance we set up a simple experiment.  This was done by computing a score for each invocation of the find_best function.   As can be seen from the results, some of the replies are correct and others are way off.   One way to  quantify  this is to see how far the responses are from the query.  Our graph was built from 14 documents which provide samples in the topics climate change, extinction, human caused extinction, relativity theory, black holes, quantum gravity and cosmology.   Allocating each of the 14 documents to an appropriate one of 6 topics, we can create have a very simple scoring function to compare the documents in the find_best response.  We say score(doc1, docN) = 1 if both doc1 and docN belong to documents associated with the same topic The score is 0 otherwise.  To arrive at a score for a single find_best invocation, we assume that the first response is likely the most accurate and we compute the score in relation to the remaining responses

(score(doc1,doc1) + score(doc1, doc2) + sore(doc1, doc3) + score(doc1, doc4))/4.

The choice of measuring four responses is arbitrary.   It is usually the case that responses 1 and 2 are good and 3 and 4 may be of lower quality. 

The six topics are  climate, extinction, human-related extinction, black holes, quantum gravity and cosmology. 

Notice that we are not measuring the semantic quality of the responses.  We only have a measure of the consistency of the responses.  .  We return to the topic of measuring the quality of response in the final section of the paper. The algorithm to get a  final score for find_best is to simply compute the score for each node in the graph.

This image has an empty alt attribute; its file name is Picture20.png

We do the same for find_best2 for different layers of convolution.  As can be seen from the chart below one convolution does improve the performance.   The performance is peaked out at 3 convolutional layers.  

This image has an empty alt attribute; its file name is Picture21-1024x656.png

Recall that the convolution operator was defined by a parameter lambda in the range 0 to 1.   We also ran a grid search to find the best value of lambda for 1, 2 and 3 convolutional layers.  We found a value of 0.75 gave reasonable results.   (note that when lambda = 1 the layers degenerate to no layers.)

This image has an empty alt attribute; its file name is Picture22.png

Plotting the Subgraph of Nodes Involved in a Query

It is an amusing exercise to look at the subgraph of our KG that is involved in the answer to a query.   It is relatively easy to generate the nodes that connect to the article nodes that are returned by our query function.  We built a simple function to do this.  Here is a sample invocation.

This image has an empty alt attribute; its file name is Picture23-1024x67.png

The resulting subgraph is shown below.

This image has an empty alt attribute; its file name is Picture24-1024x496.png

Figure 6.   Subgraph generated by the statement “’best-known cause of a mass extinction is an Asteroid  impact that killed off the Dinosaurs.”

The  subgraph  shown in figure 6 was created as follows.  We first generate an initial subgraph generated by using the article nodes returned from find_best2 or find_best and the entity nodes they are connected to.   We then compute a ‘closure’ of this subgraph by selecting all the graph nodes that are connected to nodes that are in the initial subgraph. 

Looking at the subgraph we can see interesting features.  An amusing exercise is to follow a path from one of the nodes in the graph to the other and to see what the connections tell us.   The results can form an interesting “story”.   We wrote a simple path following algorithm.

This image has an empty alt attribute; its file name is Picture25.png

Invoking this with the path from node Art166 to Art188

This image has an empty alt attribute; its file name is Picture26.png

gives the following.

This image has an empty alt attribute; its file name is fig11-1024x768.jpg

Trying another path

This image has an empty alt attribute; its file name is Picture28.png

Yields the following

This image has an empty alt attribute; its file name is fig12-1024x895.jpg

There is another, perhaps more interesting reason to look at the subgraph.  This time we use the initial subgraph prior to the ‘closure’ operation.   If we look at the query

This image has an empty alt attribute; its file name is Picture30-1024x54.png

The results include the output from find_best2 which are:

This image has an empty alt attribute; its file name is fig13-1024x680.jpg

You will notice only 2 responses seem to talk about dark energy.   Art186 is related, but Art52 is off topic.  Now, looking at the graph we see why.

This image has an empty alt attribute; its file name is Picture32-1024x499.png

Figure 7.  Subgraph for “What is dark energy”

There are three connected components, and one is clearly the “dark energy” component.   This suggests that a further refinement of the query processing could be to  filter out connected components that are “off topic”. 

We conducted a simple experiment.   In calls to find_best2(4, text)  we searched the ten best and eliminated the responses that were not in the same connected component as the first response.   Unfortunately this, sometimes resulted in  fewer than k responses, but the average score was now 83%. 

Conclusion

In this little tutorial we have illustrated how to build a simple knowledge graph based on a few scientific documents.    The graph was built by using Google’s Named Entity Recognition service to create simple entity nodes.   Article nodes were created from the sentence in the document.   We then used BERT sentence embeddings to enable a basic English language query capability to pull out relevant nodes in the graph.   Next we showed how we can optimize the BERT embedding by apply a graph convolutional transformation.  

The knowledge graph described here is obviously a toy.   The important questions are how well the ideas here scale and how accurate can this query answering system be when the graph is massive.  Years of research and countless hours of engineering have gone into the construction of the state-of-the-art KGs.   In the case of science, this is Google Scholar and Microsoft Academic.   Both of these KGs are built around whole documents as the basic node.   In the case of the Microsoft Academic graph,  there are about 250 million documents and only about 36% come from traditional Journals.  The rest come from a variety of sources discovered by Bing.   In addition to article nodes, other entities  in the graph are authors, affiliations, concepts (fields of study),  journals, conferences and venues.  (see “A Review of Microsoft Academic Services for Science of Science Studies”, Wang, et.  al.   for an excellent overview.)  They use a variety of techniques beyond simple NER including reinforcement learning to improve the quality of the graph entities and the connections.   This involves metrics like eigencentrality and statistical saliency to measure quality of the tuples and nodes.   They have a system MAKES that transforms user queries into queries for the KG.    

If you want to scale from 17 documents in your database to 1700000 you will need a better infrastructure than Python and NetworkX.  In terms of the basic graph build infrastructure there are some good choices.

  • Parallel Boost Graph Library.  Designed or graph processing on supercomputers using MPI communication. 
  • Pregel: A System for Large-Scale Graph Processing” developed by Google for their cloud.
  • GraphX: A Resilient Distributed Graph System on Spark.
  • AWS Neptune is fully managed, which means that database management tasks like hardware provisioning, software patching, setup, configuration, and backups are taken care of for you.  Building KG’s with Neptune is described here.  Gremlin and Sparql are the APIs.
  • Janus Graph is an opensource graph database that is supported on Google’s Bigtable.
  • Neo4J is a commercial product that is well supported, and it has a SQL like API.
  • Azure Cosmos is the Azure globally extreme-scale database that supports graph distributed across planet scale resources.

If we turn to the query processing challenge,  the approach we took in our toy KG, where document nodes were created from as few as a single sentence, it is obvious why the Microsoft KG and Google Academic focus on entire documents as a basic unit.  For the Microsoft graph a variety of techniques are used to extract “factoid” from documents that are part of concepts and taxonomy hierarchy in the Graph.  This is what the MAKES system searches to answer queries.    

It is absolutely unclear if  the convolutional operator applied to BERT sentence embedding described here would have any value when applied to the task of representing knowledge at the scale of MAG or Google Scholar.   It may be of value in more specialized domain specific KGs.   In the future we will experiment with increasing the scale of graph.

All of the code for the examples in this article is in the repository https://github.com/dbgannon/knowledge-graph.

Deep Learning on Graphs (a Tutorial)

Abstract

This tutorial gives an overview of some of the basic work that has been done over the last five years on the application of deep learning techniques to data represented as graphs.  Convolutional neural networks and transformers have been instrumental in the progress on computer vision and natural language understanding.    Here we look at the generalizations of these methods to solving problems where the data is represented as a graph.   We illustrate this with examples including predicting research topics by using the Microsoft co-author graph or the more heterogeneous ACM author-paper-venue citation graph.   This later case is of interest because it allows us to discuss how these techniques can be applied to the massive heterogeneous Knowledge networks being developed and used by the search engines and smart, interactive digital assistants.   Finally, we look at how knowledge is represented by families of graphs.   The example we use here is from the Tox21 dataset of chemical compounds  and their interaction with important biological pathways and targets. 

Introduction

Graphs have long been used to describe the relationships between discrete items of information.  A basic Graph consists of a set of nodes and a set edges that connect pairs of nodes together.   We can annotate the nodes and edges with the information that gives the graph meaning.   For example the nodes may be people and the edges may be the relationships  (married, child, parent) that exist between the people.   Or the nodes may be cities and the edges may be the roads that connect the cities.   Graphs like knowledge networks tend to be heterogeneous:  they have more than one type of node and more than one type of edge.   We are also interested in families of related graphs.  For example,  chemical compounds can be described by graphs where the nodes are atoms and the edges represent the chemical bonds between them. 

The most famous deep learning successes involve computer vision tasks such as recognizing objects in two dimensional images or natural language tasks such as understanding linear strings of text.   But these can also be described in graph terms.  Images are grids of pixels.  The nodes are individual pixels and edges are neighbor relations.   A string of text is a one-dimensional graph with nodes that represent words and edges are the predecessor/successor temporal relationships between them.   One of the most important tools in image understanding are the convolutional network layers that exploit the local neighborhood of each pixel to create new views of the data.  In the case of text analysis, Transformers manipulate templates related to the local context of each word in the string.  As we shall see the same concepts of locality are an essential to many of the graph deep learning algorithms that have been developed.

The Basics

Tools to manage graphs have been around for a long time.   Systems like Amazon’s Neptune and Azure CosmosDB scale to very large graphs.  The W3C standard Resource Description Framework (RDF) model and its standard query language, SPARQL are commonly used and well supported.  Apache Tinkerpop is based on the Gremlin graph traversal language.  The excellent Boost Graph Library is based on  C++ generics.

These pages will rely heavily on the tools developed to manage graphs in Python.   More specifically,  NetworkX, originally developed by Aric Hagberg, Dan Schult and Pieter Swart, and the Deep Graph Library led by Minjie Wang  and Da Zheng from AWS and Amazon AI Lab, Shanghai and a large team of collaborators (see this paper.)  These two packages are complimentary and interoperate nicely.   NetworkX  is an excellent took for build graphs, traversing them with standard algorithms and displaying them.  DGL is designed to integrate Torch deep learning methods with data stored in graph form.   Most of our examples will be derived from the excellent DGL tutorials.

To begin let’s build a simple graph with 5 nodes and a list of edges stored in a file ‘edge_list_short.txt’.  (the complete notebook is stored in the archive as  basics-of-graphs.ipynb.  We will label the nodes with their node number. 

In the figure below we have the content of the edge list file on the left and the output of the nx.draw() on the right.

Figure 1. A simple NetworkX graph on the right derived from the short edge list on the left.

As you can see this is a directed graph.  There is an edge from 4 to 0 but not one from 0 to 4.   NetworkX has a very extensive library of standard algorithm.  For example, shortest path computation and graph cycle detection.  

Graph nodes and edges can have properties that are identified with a property name.   For example we can add an edge property called “weight”  by iterating through the edges and, for each assign a value to the property.   In a similar manner we can assign a “value” to each node.

DGL graphs

It is easy to go back and forth between a NetworkX graph and a DGL graph.   Starting with our NetworkX graph G above we can create a DGL graph dG as follows.

Notices that the properties “weight” and “value” are translated over to the DGL graph and Torch tensors.  We can add node properties directly or using some DGL functions like out_degrees and in_degrees.   Here we assign a value of 1 to a property called “one” to each node and the in degree of each node to a property called “deg”.

DGL Message Passing.

One of  the most important low-level feature of DGL is the process by which messages are passed from one node to another.  This is  how we manage operations like convolutions and transformers on graphs.

Messages are associated with edges.   Each message takes a value from the source node  of the edge and delivers it to the “mailbox” of the destination of the edge.   Reduce functions empties the mailboxes and does a computation and modifies a property of the node.   For example, the following message function just sends the 2 times the value “one” from source of each edge to the destination.  After the messages are sent each node mailbox (labeled ‘x’) now has a message containing the value 2  from the nodes that have out going edge to that node.  The reduce functions add up all of those 2’s (which would total twice the in-degree of that node.  It then adds the in-degree again, so the value is now 3 times the in-degree of the node. 

To execute these functions, we simply register the functions with our graph and invoke them with broadcast send and receive messages (with an implicit barrier between the sends and receives.)

A shorthand for the send, receive pair is just dG.update_all().   The notebook for the examples above is in basics-of-graphs.ipynb in the github archive

The MS co-author Graph

Microsoft provided a graph based on their Microsoft Academic Graph from the KDD Cup 2016 challenge 3.  It is a graph of 18333 authors. Edges represent coauthors: two nodes have an edge if the authors co-authored  a paper.  Node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each.  There are 6805 keywords.  The class labels are ‘bioinformatics’, ‘machine learning’, ‘computer vision’, ‘NLP’, ‘graphics’, ‘networks’, ‘security’, ‘databases’, ‘datamining’, ‘game theory’, ‘HCI’, ‘information theory’, ‘medical informatics’, ‘robotics’, ‘theoretical_cs’.   There are two ways to get at the data.   One way is from the DGL data library.

The other way is to pull the raw data collection.

There are two parts of the cofull file that are important to us.   One is the array class_names and the other is the array of attributes which are the text phrases associated with each paper.

As mentioned above, the author node label is an assignment based on most likely topical research area based on publications.   In the following section we will use a graph convolutional network to predict the label. 

The co-author graph coau has a feature vector “feat”  that describes the features for each node.  This alone should be enough to predict the label.

For each node this is a vector of length 6805 with a non-zero entry for each feature attribute associated with that author. 

Using the cofull[‘attr_names’] list we can show what these attributes are as text.   It is an interesting exercise to see what attributes are common to neighboring nodes.   In the following we convert our coau graph to a NetworkX graph and look at the neighbors of node 0.   Then we extract the features that several of the neighbors share with node 0. We can use the fact that the feature vector is a tensor so multiplying them together we get the intersection of the common words.

The way to read this result is that author 0  has co-authored papers with authors 5111, 12716 and 12963.  Collectively they share the following phrases in their research profile.

We can create  a new  feature vector for each author node by extracting the list of key words on the ‘feat’ vector and constructing a Gensim tagged document.

 Then, using Gensim’s Doc2Vec we can create a model of vector embedding of length 100 for each of these tagged documents.  These are  saved in our graph in a node data feature “e-vector”.  

The document model does a reasonable job of separating the nodes by class.  To illustrate this we can use TSNE to project the 100 dimensional space into 2 dimensions.   This is shown below for a subset of the data consisting of the nodes associated with topics ‘bioinformatics’, ‘machine learning’, ‘robotics’, ‘theoretical cs’, ‘networks’, ‘security’ and ‘databases’.  The nodes associated with each topic are plotted with the same color.  As you can see the topics are fairly well defined and clustered together.   We will use this 100 dimension feature vector in the convolutional network we construct below.

The full details of the construction of the construction of the graph, the Doc2Vec node embedding and this visualization are in the notebook fun-with-cojauthors.ipynb

Figure 2.  Subset of MS co-author graph projected with TSNE.

The Graph Convolutional Network

In computer vision applications, convolutional neural networks are based on applying filters to small local regions of each pixel.  In the simplest case the template just involves the nearest neighbors to the central pixel and the template operation  is a dot product with the pixel values of these neighbors. 

Figure 3.   Convolution operations in an image and on a graph.

As shown in the figure, the graph version is the same, and we can use the DGL message system to do the computation.     The standard DGL graph convolutional layer is shown below.

We now create a network with three GCN layers with the first layer of size 100 by 50 because 100 is the size of our new embedded feature vector we constructed with Doc2vec above.  The second layer is 50 by 32 and the third is 32 by 15 because 15 is the number of classes.

We employ a relu activation between the GNC layers.   To train the network we divide the nodes into a training set and a test set by building a Boolean vector that is True when the node is in the test set and False when not.   To evaluate the network  on the test and train set we involve the following

The evaluate function first applies the network to the graph with the entire feature set.  This results in a tensor of size 18333 by 15.  This is a vector of length 15 for each graph node.  The index of the maximum value of this vector is the predicted index of the node.

To train the network we use a conventional Adam optimizer.   When we extract the logits array from the network we compute the log of the soft max along the length 15 axis and compute the loss using the negative log likelihood loss function.   The rest of the training is completely ordinary pytorch training.

The training converged with an accuracy of 92.4%.    Evaluating it on the test_mask gave a value of 86%, which is not great, but it does reflect the fact that the original graph labels are also a “subjective” valuation of the authors primary research interest. 

Another way to look at the results is to plot a confusion matrix as shown below 

Figure 4.   Confusion Matrix for prediction of topics by means of Graph Convolution Network

It is clear from looking at this that the label datamining is easily confused with databases.  And the label machine-learning is often confused with computer vision and datamining.    In the notebook co-author-gnc.ipynb we show how to create this matrix. 

Heterogeneous Graphs

Knowledge Graphs are the basis of much of modern search engines and smart assistants.  Knowledge Graphs are examples of directed heterogeneous graphs where nodes are not all of one type and edges represent relations.   For example, the statement “Joan likes basketball” might connect the person node “Joan” to the sports node “basketball” and the edge from Joan to basketball would be “likes”. 

A really nice and often studied example is the ACM publication dataset.   This is a collection of 12499 papers and 17431 authors and 37055 links between them.   This is still small enough to load and run on your laptop.   (We will discuss a much larger example later.)

The data set consists of a set of tables and sparse matrices.   The tables are ‘A’ for authors,  ‘C’ for conferences, ‘P’ for papers, ‘T’ for subjects (the ACM classification), ‘V’ for  proceedings and ‘F’ is for institutions.  The sparse matrices are like ‘PvsA’ which means the authors of a given paper,  ‘PvsP’ signifies a paper citing another, ‘PvsL’ is paper is of this topic.   Downloading and creating a DGL heterograph for a subset of this data is simple.

We have created a graph with three node types: authors, papers and subjects as illustrated in Figure 5 below.

Figure 5. Heterogeneous Graph with nodes for Subject, Paper, Author and links of various types.

The edge types are the link keywords in the triple that is used to identify the edges.   If we want to find the name of an author node we have to do a search in the data table.  That is easy enough.  The notebook for this example has such a trivial function:The edge types are the link keywords in the triple that is used to identify the edges.   If we want to find the name of an author node we have to do a search in the data table.  That is easy enough.  The notebook for this example has such a trivial function:

We can use the link labels to pull out a subgraph.  For example, the subgraph “authors writing paper” can be extracted by

We can then ask which papers by Dongarra are included in the data set by looking for the out edges from node 5100.  Having the id of the end of the out edge we can get the abstract of the paper (truncated here).

Having this information, we can use the written-by subgraph to get the names of the coauthors.

Or the subject of the paper.

Where C.4 is the cryptic ACM code for performance of systems.

Node classification with the heterogeneous ACM graph.

The classification task will be to match conference papers with the name of the conference it appeared in.   That is, given a  paper that appeared in a conference we train the network to identify the conference.   We have this information in the data[‘PvsC’] component of the dataset, but that was not included in our graph.   There are two solutions we have considered.   One is based on tranformers and is described in the excellent paper “Heterogeneous Graph Transformer”  by Ziniu Hu, et. al.    A notebook version of this paper applied to the ACM graph is called hetero-attention.ipynb in the Github archive.   It is based on the code  in their Github archive.  

In the following paragraphs we will describe the version based on “Modeling Relational Data with Graph Convolutional Networks”  from Schlichtkrull et. al.  which is included in the DGL tutorials.   The idea here is a version of the graph convolutional network.  Recall that for a graph convolution each node receives a message from its neighbors and we then form a reduction sum of those messages. However for a heterogeneous graph things are a bit different.   First we are interested in only one node type (paper) but each paper has several incoming edge types:  author-writing-paper,  paper-citing-paper, and subject-has-paper.  Hence to use all the information available we must consider incoming messages from other papers as well as authors and subjects.  For each edge type r we have a fully connected trainable layer W.  Then for each node i and each edge type r we compute

Where hj(l) is the message from the jth r-neighbor and (l) refers to the  lth training iteration.   Now we sum that over each edge type and apply an activation function to get

The code for this layer is below.

Notice that the forward method requires a feature dictionary for each node type because there are different number of nodes of each type.   We construct the full network using two layers of HeteroGCN layer as follows.

We can create our model with the following.

Where we have 4 output logits corresponding to the 4 conferences we are interested in (SOSP, SODA, SIGCOM, VLDB).  We next select the rows of “PvsC” that correspond to these 4 conferences and build a train and test set from that.  (The details are in our notebook GCN_hetero.ipynb .)  The core of the training is quite simple and conventional.

Notice that the convolutional operations are applied across the entire graph,  but we only compute the loss on those node associated with our 4 selected conferences. The algorithm converges very quickly and and the accuracy on the test set is 87%.   The confusion matrix shown below shows that sigcom is harder to differentiate from the others, but the size of our data set is extremely small.

As mentioned above, we have run the same computation with the heterogeneous graph transformer algorithm.   In this case the computation took much longer, but the results were slightly better as shown in the confusion matrix below.

The graphs we have presented as examples here are all small enough to run on a laptop without a GPU.  However, the techniques presented here are well suited to very large graphs.   For example Amazon’s AWS team we have a large scale graph based on Wikimedia with the Kensho Derived Dataset which is based on over 141 billion wikidata stements and 51 billion wikidata items.  The AWS team focuses on graph embeddings at scale (see this blog) for this dataset. The have a notebook for how to access and use this data.  The also use it for the Drug Repurposing Knowledge Graph (DRKG) to show which drugs can be repurposed to fight COVID-19.

Graph Classification:  The Tox21 Challenge

In the case of life science, Amazon has also contributed a package dgllife DGL-LifeSci.  To illustrate it, we will load data associated with the Tox21 challenge from the US National Institutes of Health.   This data challenge studies the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects.  The dataset consists of 7831 chemical compounds that have been tested against a  a compound toxicity screening classification on Nuclear Receptor Signaling (NR) and Stress Response (SR)] involving 12 different assay targets: Androgen Receptor (AR, AR-LBD), Aryl Hydrocarbon Receptor (AhR), Estrogen Receptor (ER, ER-LBD), Aromatase Inhibitors (aromatase), Peroxisome Proliferator-activated receptor gamma (ppar-gamma), Antioxidant Response Element (ARE), luciferase-tagged ATAD5 (ATAD5), Heat Shock Response (HSE), Mitochondrial Membrane Potential (MMP), and Agonists Of The P53 Signaling Pathway (P53).  The table below show the result for each molecule the results of the target tests.  The data is not complete, so in some cases the result are NaNs and must be excluded.   As you can see below for molecule 0 there was a positive response involving receptor NR-AhR and Sr-ARE, and the Aromatase and NR_ER data was bad. 

Table 1.

We can load the full data set, which includes the data above and a full description of each compound.  In addition, we can load a pretrained graph convolutional network that attempts to predict the toxicology screen process described in the table above.  

To extract the data associated with compound 0, we simply query  the dataset.  This returns a chemical description string (known as a smiles string), a DGL graph g, a label corresponding to the row in the table above and a mask which tells us which of the target NaN values to ignore.

In this case the graph indicates that the chemical has 16 atoms and 34 bonds.   Associated with each atom is a vector of 74  values.  The meaning of these values is given below.

Using the smiles string and a small helper function in our notebook dgllife.ipynb called display_mol, we can see what our compound looks like.

We can also draw the graph by converting it to a network graph that is graph isomorphic to the smiles graph.

As we can see from table 1 that positions 3 and 4 are no good, so using the mask we can apply the model to classify our compound.

If we compare the result to the label which constitutes the correct values we see that the strongest signal is in position 5 which, after dropping two masked positions, corresponds to label position 7 in the label, which is correct. We should also get a strong signal in position 3, but we see the model gives us an even stronger signal in position 4.

Let’s look at a different compound: #7827 in the data set.  This one has 23 atoms and 53 bonds.  

In this case, when we run the model get the following.

If we use the score the prediction  that positive numbers indicate positive reactions (1)   and negative number the opposite(0), then he predicted values are positive on the first two positions, negative on the next two and positive on the following two.   This are all correct except for the last positive.   The remaining three unmasked values are negative and that is correct.  Out of the 9 unmasked values 8 are correct, so we can give this one a “score” of 89%.   If we apply this criteria to the entire dataset we get a “score” of 83%.   We did not train this model, so we don’t know what fraction of the data was used in the training, so this number does not mean much.

Conclusion

The tutorial above just scratches the surface of the work that has been done on deep learning on graphs.  Among the topics that are currently under investigation include managing the updates of dynamic graphs. Of course the most interesting topic these days is the construction and use of large knowledge graphs.  The big challenge there is to use a knowledge graph to do question answering.  An interesting example there is Octavian’s clevr-graph which can answer questions about the knowledge graph built from the London Underground.     Extracting knowledge from KGs involves understanding how to extract key phrases from English queries and how to map those phrases to KG nodes and edges.   This is a non-trivial task and there is a lot of work going on. 

All of the notebooks for the examples described here are in Github https://github.com/dbgannon/graphs

Some Addition References

The following are research papers and blogs that I found interesting in writing this tutorial.

Predicting Tomorrows Temperature with RNNs and Gaussian Processes

Abstract

This note provides a gentle introduction to streaming data regression and prediction using Recurrent Neural Networks and Gaussian processes.  We look at two examples.  In the first we look at daily high temperature for two years at three different, but nearby NOAA weather stations. In the second example we look at daily confirmed infection of the Coronavirus in several US states.  This study shows that if you have a predictable natural pattern that changes in predicable ways from season to season one can make reasonable predictions, but in the case of complex phenomenon such as the Covid19 pandemic, simple models  do not always work well.

 In a previous post we looked at anomaly detection in streaming data using sophisticated cloud based tools, but here we use simple “classic” ML tools that you can run on a laptop.  We expect the seasoned data scientist will not get much out of this article, but if the reader is not familiar with recurrent networks or Gaussian processes this may be a reasonable introduction.  At the very least, if you are interested in studying the Coronavirus infection data, we show you how to search it using Google’s BigQuery service.

Building a Very Simple LSTM Recurrent Neural Network

Recurrent Neural Networks were invented to capture the dynamic temporal behaviors in streams of data.  They provided some early success in areas related natural language understanding, but in that area they have been superseded by Transformers, which we discussed in an earlier article.   We will construct an extremely simple recurrent network and ask it to predict the temperature tomorrow given the temperature for the last few days.

We will train it on the average daily temperature over 2 years measured at three nearby weather stations in the Skagit valley in Washington state.   The data comes from the Global Summary of the Day (GSOD) weather from the National Oceanographic and Atmospheric Administration (NOAA) for 9,000 weather stations.  The three stations are Bellingham Intl, Padilla Bay Reserve and Skagit Regional.  The data is available in Google’s BigQuery service named “bigquery-public-data.noaa_gsod.stations”. (We will show how to search BigQuery in the next section.)

Averaging the daily temperature for 6 years (2 years each for 3 stations) we get data like Figure 1 below.

f1

                           Figure 1.  Average daily high temperature for 2 years at 3 stations.

RNNs work by having a “memory” state tensor that encodes the sequence of inputs that have been seen so far.  The input to the RNN is a word or signal, along with the state of the system based on words or signals seen so far; the output is a predicted value and a new state of the system, as shown in figure 2 below.

f2

Figure 2.  Basic Recurrent Neural network with input stream x and output stream h

Many variations of the basic RNN exist. One challenge for RNNs is ensuring that the state tensors retain enough long-term memory of the sequence so that patterns are remembered. Several approaches have been used for this purpose.  One popular method is the Long-Short Term Memory (LSTM) version that is defined by the following equations, where the input sequence is x, the output is h and the state vector is the pair [c, h].

f3

Where sigma is the sigmoid function and W is the leaned tensor.   We are not going to use these equations explicitly because PyTorch has a built-in version that we will use.  All we need to do is provide the dimension of the input (which is a sequence of scalar values, so that is 1) and the dimension of the state vector c_h is a tuple of tensors of size (arbitrarily chosen here to be) 100. We also have linear layer that will map the final value of h down to the one-dimensional output.  Our simple LSTM neural network is

f4

It is important to understand the role of the input x as a sequence.   We want to train this network so that if we set x to be a sequence of consecutive daily high temperatures [t1, t2, … tn] then the forward method will return [tn+1].  To accomplish this training task we build a training set by scanning through the weather record andcreating tuples of sequences of length tw and the next-day temperature as a training label. This is accomplished with this function

f5

From this we create a list of such tuples called train_inout_seq and the training loop takes the form

f6

The complete details are in the notebook lstm-for-stream-final  in the Github repository.  This was trained on the average year and one the six yearly record.   The results are shown below.  The first two are the years that were training cases.  The original data is printed in blue and the predicted data is printed in orange.   As you can see virtually no blue shows.  The network has memorized the training data almost perfectly with an average daily error that is less than 0.23 degrees Fahrenheit.

f7

f8

Figure 2.  The results for the average of all the data (Figure 1 above) and for one of the individual stations. The LSTM network has memorized the training data.

In Figure 3 below we show the results when applied to two of the weather records for two of the other stations.   In these cases, the results are not very impressive.  In both cases the error average error was over 3.5 degrees and it was greater than 10 degrees for more than a dozen days.  However, the predictions for one day ahead did track the general trends.   It looks like it was able to predict todays temperature better than tomorrows.

f9

f10

Figure 3.   Predicting tomorrows temperature on two other station records.

Doing Regression and Prediction with Gaussian Processes

Before we define Gaussian Processes let us point to Christopher Bishop’s amazing book “Pattern Recognition and Machine Learning” (Springer 2006) for a complete treatment of the subject.  We will only provide a superficial introduction here.   For on-line resources there is the excellent blog on the subject by Peter Roelants.   We will use many of the code bits from that blog in what follows.  Another fun on-line source for learning about Gaussian Processes is the blog A Visual Exploration of Gaussian Processes by Görtler, Kehlbeck and Deussen.

In the simplest terms, a Gaussian Process is a statistical distribution of functions with some special properties.   In our case, the functions will represent the time evolution of stochastic processes. For example, the temperature at some location as a function of time,  or the number of daily infections of a virus in a community, or the random walk of a particle suspended in a fluid.

The distribution that defines a Gaussian Process are characterized by a mean function u(x) and a covariance function k (x, x).   A function f(x) drawn from this distribution, which is written

f11

has the property that if when we pick a finite set of time points  X = { x1 … xn } which we view as n random variable, the values y=f(X) are normally distributed in a multivariate Gaussian distribution with mean u(X) and a covariance matrix by a kernel function k(X, X).  Written another way,

f12

Given the covariance kernel function k() and mean function u() we can use this multivariant distribution to visualize what functions drawn from the Gaussian distribution look  like.   Let us pick 300 points on the interval [0,2] and a specific kernel (which we will describe later) and a mean function with constant value 1.   The following  numpy function will allow us to draw 10 sample functions

f13

As shown in Figure 4 below they appear to be like random walks but they also appear to be not only continuous be also smooth curves.  That is because nearby points on the x axis correspond to highly correlated random variables due to the choice of k().  If we had set Σ to be the identity matrix the variables at neighboring points would be independent random variables and the path would look like noise.  (We will use that fact below.)

f14

Figure 4. 10 different functions drawn from the Gaussian process.

Now for the interesting part.  What if we have some prior knowledge of values of y for a sample of x points?   We can then ask what is then the  distribution of the  functions?

View the n points on the time axis as n random variables.   Partition them into two sets X1 and X2 where we are going to suppose we have values Y1 for the X1 variables.  We can then ask for  the posterior distribution p(Y2 | Y1 , X1 , X2 ).  Reordering the variables so that X1 and X2  are contiguous the equation takes the form

f15

where

f16

One can prove that our condition probability distribution p(Y2 | Y1 , X1 , X2) is also a multivariate normal distribution described by the formulas

f17

The proof of this is non-trivial.  See this post for details.   The good news here is we can calculate this if we know the prior kernel function k() and mean m().  Picking these function is a bit of an art.  The usual way to do this is to pick k() so that m(x) = 0 so that u2 and u1 in the above are 0.   Picking the kernel is often done by forming it as a linear combination of well-known standard kernel function and then formulating a hyper-parameter optimization problem to select the best combination.

To illustrate this, we can return to the weather station data.   We have two years of data from three nearby stations.    We note two properties of the data we must exploit: it is noisy and approximately periodic with a period of 365 days.   We will not bother with the optimization and rather take a straightforward linear combination of three standard kernels.

f18

The first of these is the exponential quadratic and it is a very good, default kernel.   The second is the white noise kernel where the parameter sigma gives us the standard distribution of the noise we see in the data and the third is the periodic kernel which if we map our 365 days onto to the unit interval we can set p = 1.   Our kernel of choice (chosen without optimization, but because it seems to work o.k.) is

f19

Where for the first two terms we have set sigma to one and we pick the sigma for the noise term to best fit the data at hand.   The figure below illustrates the result of using the average of the six instrument years as the raw (prior) data.   Then we select 46 points in the first 230 days (spaced 5 days apart) as our X1 days.

In the figure the red dots are the points and the red line is u2|1 conditional mean function.  Three additional lines (blue, green and yellow) are sample function drawn from the posterior.  The pink zone is two sigma of standard deviation in the prediction.   We also calculated the error in terms of average difference between the mean prediction and the raw data.   For this example, that average error was 3.17 degrees Fahrenheit.  The mean function does a reasonable job of predicting the last 130 days of the year.

f20

Figure 5.   Graph of the raw data,  the mean conditional u2|1 (red line), and three additional functions (blue, yellow and green) drawn from the posterior.

The full details of the computation are in the jupyter notebook “Gaussian-processes-temps-periodic”,  but the critical function is the one that computes p(Y2 | Y1 , X1 , X2) and it is shown below (it is taken from Roelants blog)

f21

In this case we invoked it with kernel_function  as keq + kp.   Sigma_noise was 0.3.   The clever part of this code was the use of  a standard linear algebra solver to solve for Z in this equation

f22

But because Sigma11 is symmetric and the transpose of Sigma12 is Sigma21 we have

f23

Once you have that the rest of the computation is accomplished with the matrix multiply (@) operator.

In the notebook Gaussian-process-temps-periodic in the Github repository you can see the Gaussian processes for the six year samples.

The Coronavirus Data

Another interesting source of data comes from the daily confirmed cases of coronavirus infections in various states.   We shall see that the troubling recent growth rate is so large that it is very hard for our Gaussian process models to make predictions based on recent past samples.  However, we thought it may be of value to illustrate how to obtain this data and work with it.

The Covid-19 data is in the Google cloud uploaded from the New York times. To access this you must have a google cloud account which is free for simple first-time use.   We will run google’s bigquery to extract the data and we will run it through a client in a Jupyter  notebook.    You will need to install the bigquery libraries.   A good set of instructions are here.    To use Jupyter go here . You will need to add the json package containing you service account key to your environment variables and described here.  Finally install the local libraries with this command on your machine.

pip install –upgrade google-cloud-bigquery[pandas]

First load the bigquery library and create the client in a Jupyter notebook with the following

f24

There are a number of covid19 data sets available on BigQuery.   The one we will use is the New York Times collection.   The following query will request the data for the state of Washington, load it into a Pandas dataframe and print it.

f25

f26

 In our notebook bigquery-Covid we have the code that will extract the number of cases per day so that we can fit the Gaussian process to that.    That data is stored in the array ar_wash.   We attempted to make predictions with a sample every 9 days until the last 10 day.  Because of the large range of the data we scale it down by a factor of 1000.   The result is shown below.   The function make_gaussian is the same one we used for the weather station data except that the kernel is only the exponential quadratic.

f27

As can be seen the mean function (red line) capture features of the last 10 days reasonably well.   Looking at New York we see similar results, but the fit for the last few days is not as good.

f28

Where we fail most spectacularly is for those states that experienced a wave of new cases on the first week of July.  Here is Florida.

f29

Changing the prediction window to the last 3 days does a bit better.  But 3 days is not much of a prediction window.

f30

However, it is clear that a much more complex process is going on in Florida than is captured by this simple Gaussian process model.   The approach presented here is not a proper infectious disease model such as those from Johns Hopkins and IHME and other universities.  Those models are far more sophisticated and take into account many factors including social and human behavior and living conditions as well as intervention strategies.

Conclusion

As was pointed out in the introduction, this is a very superficial look at the problem of predicting the behavior of streaming data.  We looked at two approaches; one that focuses on accurate prediction of the next event using neural networks and one that attempts to capture long range statistical behavior using Gaussian process models.  The neural net model was able to learn the temperature patterns in the training data very well, but for test data it was much less accurate with average error of about 3- or 4-degrees Fahrenheit per day.   (This is about as good as my local weather person).   On the other hand, the Gaussian process made very good long range (over 100 days) predictions with only a small number of sample points.  This was possible because the Gaussian process model works well for patterns that have reasonably predictable cycles such as the weather.   However, the Gaussian process failed to capture changes in the more complex scenario of a viral infection where the dynamics changes because of social and human behavior or by evolutionary means.

If the reader is interested in extracting the data from Google’s BigQuery,  we have included the detail here and in the notebooks in the repository https://github.com/dbgannon/predicting_streams.

Notes on Deep Learning and Differential Equations.

Over the last two years  some very interesting research has emerged that illustrates a fascinating connection between Deep Neural Nets and differential equations.    There are two aspects of these discoveries that will be described here.  They are

  1. Many differential equations (linear, elliptical, non-linear and even stochastic PDEs) can be solved with the aid of deep neural networks.
  2. Many classic deep neural networks can be seen as approximations to differential equations and modern differential equation solvers can great simplify those neural networks.

The solution of PDE by neural networks described here is largely the excellent work of Karniadakis at Brown University and his collaborators on “Physics Informed Neural Networks” (PINNs).   This work has led to some impressive theory and also advances in applications such as uncertainty quantification of the models of subsurface flow at the Hanford nuclear site, one of the most contaminated sites in the western hemisphere.

While the work on PINNs shows us how to use neural networks to solve differential equations, the second discovery cited above tells us how modern differential equation solvers can simplify the architecture of many neural networks.  A team led by Chen , Rubanova , Bettencourt, Duvenaud at the University of Toronto, Vector Institute examined the design of some neural networks and noticed how their architecture resembled discretizations of certain differential equations.  This led them to define a hybrid of neural net and differential equation they call a “Neural Ordinary Differential Equation”. Neural ODEs  have several striking properties including excellent accuracy and greatly reduced memory requirements.

These works suggest that there exists a duality between differential equations and many deep neural networks that is worth trying to understand.  This paper will not be a deep mathematical analysis, but rather, try to provide some intuition and examples (with PyTorch code).  We first describe the solution of PDEs with neural networks and then look neural ODEs.

PINNs

We now turn to the work on using neural networks to solve partial differential equations.  The examples we will study are from four papers.

  1. Raissi, Perdikaris, and Karniadakis, “Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations”, Nov 2017, is the paper that introduces PINNS and demonstrates the concept by showing how to solve several “classical” PDEs.
  2. Yang, Zhang, and Karniadakis, “Physics-Informed Generative Adversarial Networks for Stochastic Differential Equations”, Nov 2018, addresses the problem of stochastic differential equations but uses a generative adversarial neural network.
  3. Yang, et. all “Highly-scalable, physics-informed GANs for learning solutions of stochastic PDEs”, oct 2019, is the large study that applies the GAN PINNs techniques to the large scale problem of  as uncertainty quantification of the models of subsurface flow at the Hanford nuclear site.
  4. Shin, Darbon and Karniadakis, “On the Convergence and Generalization of Physics Informed Neural Networks”, 2020, is the theoretical proof that the PINNs approach really works.

We will start with a very simple partial differ that is discussed in the Rassi paper.  Burgers’ Equation is an excellent example to study if you want to understand how shock waves can come about from relatively benign initial conditions.  The equation is

f1

Were the domain of x is the interval [-1, 1] and the time variable t go from 0 to 1.   The initial, t=0, condition is u(0,x) = -sin(pi*x). and the boundary conditions are u(t,-1) = u(t,1)=0.  The parameter v is 0.01/pi.   (If we set v = 0 then this is the inviscid equation which does describe shock waves, but with v >0 the equation is called the viscus Burgers’ equations.  While not technically describing discontinuity shock as t goes forward in time, it is awfully close.    The figure below shows the evolution of the value of u(t,x) from the initial sine wave at t=0 to a near discontinuity at t=39.

f2

Figure 1.  Samples of u(t, x) for x in [-1,1] at 40 points and t in 4 points.

The basic idea of a physics informed neural net is,  in this case, a network defining a map

f3

This network, when trained, should satisfy both the boundary condition and the differential equation.   To satisfy the differential equation we must be able to differentiate the network as a function of x and t.  But we know how to do that! We compute the derivatives of a network symbolically when we do the back propagation for training.   Neural nets are non-linear functions with first derivatives.   In our case we also need the second derivatives which means that we cannot use ReLU as an activation function because it is piecewise linear and has second derivatives that are all zeros.   However Tanh is smooth with fine second derivatives so we will build our net function as follows.

f4

Our goal is to train an instance of this network to minimize the functions

f5

The function f in Torch is

f6

Where we are using the grad function from torch.autograd  to compute derivatives.  The function flat(x) just returns the list [x[i] for i in range(x.shape[0)].

The unique thing here is that we are training the network to satisfy the differential equation and boundary conditions without data samples from the solution u.  When converged the network should give us the solution u.  This works because Burger’s equation, in this form, has been proven to have a unique solution,  Of course, this raises the question: when are a differential equation and boundary conditions sufficient to guarantee the existence and uniqueness of a solution?  Simple linear differential operators do, but not all differential equations.  In the case of PINNs, one can speculate that if our laws of nature are correct, then nature proves the solution exists. (Thanks to Wolfgang Gentzsch and Joseph Pareti for making me realize this point needed to be made.)

To train the network we iterate over 200000  epochs  using two optimizers (one for the boundary and one for the function differential equation function f).   For each epoch we randomly draw a batch of boundary samples and a batch of samples drawn from the interior of the [0,1]x[-1,1] rectangle.   Each sample is a tuple consisting of a t-value, an x-value and a u boundary value or zero.  (Recall we are forcing f(t, x) to zero and bndry to zero or u(0,x).)  The main part of the code is below.  The full details are in the github repository.

f7

A python library giving us an approximation of an exact solution is available on line that we can use to compare to our result. Figure 2 below shows the heatmap of the solution.

f8

Figure 2.  The horizontal axis is time (t in [0,1]) sampled at 40 points and the vertical axis is x in the interval [-1,1] sampled at 40 points.  Dark colors are u values near 1 and white are u values near -1.

The mid horizontal line shows the evolution of the shock as it becomes a sharp transition as t goes from 0 to 1 on the right.

A better view can be seen from the evolution of the solution from the initial condition.  This is shown in Figures 3 and 4.  It is also interesting to note that there is a subtle difference between the neural net solution and the approximation to the exact solution.

f9

Figure 3.  The x axis x in the interval [-1,1] at 40 points and the y axis is u(t,x).  Each line represents u(t, x) for a specific value of t.   The blue line is the initial condition t=0.  Subsequent lines show the evolution to the singularity.

f10

Figure 4.   A 3D view showing the evolution of the sine wave on the left to the sharp shock on the right edge.

Nonlinear Boundary Value Problems

The classical elliptical PDE takes the form

f11

where u is defined on a region in x-y space and known on a bounding edge.   A good physical example is a steel plate that is clamped and temperature controlled on perimeter and f represents heat applied to the surface.  u will represent the temperature when it reaches a steady state.

For purposes of illustration we can consider the one-dimensional case, but to make it interesting we will add some non-linear features.  In addition to the function f(x),  let us add another known function k(x) and consider

f20

To show that the technique works, we will consider special cases where we know the exact solution.  We will let k(x) = x and u(x) = sin2(x).    Taking the derivative and applying a few trig identities we get f(x) as

f13

Where we will look for the solution u on the interval [0, pi] subject to the boundary condition u(0) = u(1) = 0.  The second case will be similar, but the solution is much more dynamic: u(x) = sin2(2x)/8 with the same boundary conditions.   In this case the operator is

f14

The network is like the one above but with only one input parameter

f15

The differential operator that is the left side of the equation above is computed with the function Du(x) shown below.

f16

Training the network is straightforward.  As with the Burgers example, we use a mean squared error lost function and gradient descent optimizer. We have an array vx of x-axis points from which batches of samples are drawn.   For each point x in each batch we also create a batch consisting of f(x).  There is only one batch for the boundary.   In this case, the boundary is only two points, so it is represented by one batch.

f17

Running this for 3000 epochs with batch of size 10 will converge the network.   The result for the 1st case (u = sin2(x) ) is extremely close to the exact solution as shown in figure 5.   Both the true solution (a blue line) and the computed solution (an orange line) are plotted together.   They are hard to distinguish.   Below the solution we plot the differential equation f(x) which is what the networked was trained to learn and it is not surprising the fit is good there too.

The second case (u = sin2(2x)/8 ) is more dynamic.    The differential equation f(x) fit is excellent but the solution is shifted up (because the boundary condition was off on one end.

f18

Figure 5.   Solution to the differential equation d/dx(x du/dx) = f(x)

Stochastic Differential Equations and Generative Adversarial Nets

If we consider the differential equation from the previous section

f20

but now make the stipulation that the functions k(x) and f(x) are Gaussian processes rather than fixed functions, we now have a stochastic differential equation.   Papers 2 and 3 mentioned above show us how to create a generative adversarial network to solve for the Gaussian process that defines u.  they do not treat this as an initial value, boundary value problem as described above, but rather the authors assume we have a series of snapshots  ( k(xi), u(xi), f(xi) ) for i in 1, … n for some reasonably large n.  With these we will set up a GAN to find a representation of u.

Recall how a GAN works.  We have two networks, a generator, and a discriminator.  The generator is trained to transform a normal distribution into a distribution that fits your data.  The discriminator is trained to recognize your true data and distinguish it from the fake data from the generator.  This is illustrated in Figure 6 below.

f21

Figure 6.  Basic GAN configuration.

It is relatively easy to build a GAN that can reproduce u from samples of x, k(x) and u(x).  Unfortunately, making a GAN that looks like u(x) does not mean it satisfies the differential equation.  Taking the simple case where k(x)=x, a GAN base based on the design above generated the result in Figure 7 below.  On the left is a plot of the distribution of the converging solution for random normally distributed samples X over the true solution in blue (see right half of figure 5).  However, when we apply the differential operator to the generated solution we see it (multi-colored dots) is nothing like our the true solution operator (blue).

f22

Figure 7.  GAN from figure 6 distribution of normally distributed sample X and, on right differential operator applied to gen(X)

To solve this problem the GAN discriminator must evaluate both u(x) AND the result of the differential operator f(x) = Du(x, k(x), u) as show in Figure 8.

f23

Figure 8.  Full GAN to map normal distribution to solution that also satisfies the differential equation.

The construction of the GAN generator is straight forward.  It takes inputs as batches of samples from a normal distribution and generates a batch of points in R2 that are eventually trained to represent samples from (x, u(x)).   As for our previous examples we use Tanh () for the activation function.

f24

The discriminator takes triples of the form (xi, u(xi), f(xi)) from the samples.  (Note we are using k(x)=x, so there is no need for that argument.   We can use ReLU for the activation because we do not need second derivatives of the discriminator.

f25

The torch version of the differential operator is

f26

The discriminator wants inputs of the form

f27

Which are provided by the samples (xi, u(xi), f(xi)) and from the generator in the form

f28

The training algorithm alternates between optimizing  log(D(G(z))) for the generator and optimizing log(D(x)) + log(1 – D(G(z))) + a gradient penalty for the discriminator.   The gradient penalty is introduced by Yang, Zhang, and Karniadakis in paper 2.  The full code for this example is in the Github repository.  Figure 9 illustrates the convergence of the solution in for snapshots (A, B, C, D) with step D taken at 200000 epochs.  It is interesting to observe that image of the 1-D normal sample space in R2 takes to form of a path that gradually conforms to the desired distribution.  However at step D there is still unfocused resolution for x > 2.6.

f29

Figure 9.   Four snapshots in time (A, B, C, D) of the convergence of the GAN.

We should also note that this is not a true stochastic PDE because we have taken our samples for the training from the exact solution and are not Gaussian sources, but the concepts and training are correct.

A much more heroic and scientifically interesting example is in paper 3.   The authors of this paper address the problem of modeling the subsurface flow at the Hanford Site in Washington State where the reactors to generate plutonium for the US atomic arsenal.   The 500 square mile site had nine nuclear reactors and five large plutonium processing complexes and produced 3 million US gallons of high-level radioactive waste stored within 177 storage tanks, an additional 25 million cubic feet of solid radioactive waste.  And over the years the tanks have been leaking.   Hanford is now nation’s largest environmental cleanup project.   The team involved with paper 3 is from Brown University, Lawrence Berkeley National Lab, Pacific Northwest National Lab, Nvidia, Julia Computing and MIT.  They look at a large 2-D version of the PDE discussed above.   In this case u(x,y) is hydraulic head of the subsurface flow, k(x,y) is depth-averaged hydraulic conductivity and f is the infiltration from the earth to the flow.   These quantities are measured by sensors at a large number of sites on the Hanford reservation.

Because k and u are both stochastic and are supported only by data at the sensor points the generator in their GAN must produce both k and u (actually  log(k) and u) as shown in Figure 10.

f30

Figure 10.  GAN architecture from Yang, et. al, “Highly-scalable, physics-informed GANs for learning solutions of stochastic PDEs”, arXiv:1910.13444v1.

In order to tackle a problem of the size of the Hanford problem, they partitioned the domain into a hierarchy of subdomains  and had a separate discriminator for each subdomain.   The top levels (1 and 2) capture long range characteristics while the lower levels capture properties that correspond to short range interactions.   They parallelized the computing utilizing over 2700 GPUs on the ORNL Summit machine so that it maintained 1.2 exaflops.

Neural Ordinary Differential Equations

In the previous section we saw how neural networks can solve differential equations.   In this section we look at the other side of this coin: how can differential equation solvers simplify the design, accuracy, and memory footprint of neural nets.   Good papers and blogs include the following.

  1. Chen, Rubanova, Bettencourt, Duvenaud “Neural Ordinary Differential Equations
  2. Colyer, Neural Ordinary Differential Equations, in the Morning Paper, Jan 9, 2029, Colyer’s amazing blog about computer science research.
  3. Duvenaud, comments in Hacker News about the Chen, et. al. paper.
  4. Chen, “PyTorch Implementation of Differentiable ODE Solvers”.
  5. Rackauckas, “Mixing Differential Equations and Machine Learning” .
  6. He, et. al. “Deep Residual Learning for Image Recognition”.
  7. Gibson, “Neural networks as Ordinary Differential Equations
  8. Holländer, “Paper Summary: Neural Ordinary Differential Equations
  9. Surtsukov, “Neural Ordinary Differential Equations”

Chen et. al. in [1] and others have observed that the deep residual networks that made training very deep networks possible [6] had a form that looked like Euler’s method for solving a differential equation.  Residual networks use blocks of network layers where the input is transformed by a residual before sending it to the next layer by the simple equation

f31

So that what we are training is the sequence or residual layers Ni  .  If you have a differential equation dy/dx = f(x) then Euler’s method is to compute y from a sequence of small steps of based on an approximation of dy/dx by

f32

By analogy, our residual network now looks like this

f33

With delta i is equal to 1.   If we abstract the sequence of networks into a single network that depends upon t as well as y, we can define a neural ODE to be of the form

f34

Where theta represents the network parameters.

The interesting idea here is that if residual networks are basically Euler’s method applied then why not use a much more modern and accurate differential equation solver?  If we have an initial value y0 at  time t0, we can integrate forward to time tn to obtain

f35

To illustrate how we can train a neural ordinary differential equation we look at a slightly modified version of the ode_demo.py example provided by the authors. we will take our sample data from the solution of a simple ODE that generates spirals in 2-D given by

f36

Where y is a 2-d row vector.   Given the starting point y0 = (2.0, 0.0) and evolving this forward 1000 steps the values are plotted below.

f37

Figure 11.  Spiral Training Data

Our neural network is extremely simple.

f38

You will notice that while the network takes a tuple (t, y) as input, the value t is unused.  (This is not normally the case.) The training algorithm is quite simple.  Recall that the function is a derivative, so to predict new values we must integrate forward in time.

f39

The authors use Hinton’s Root Mean Square Propagation optimizer. And the function get_batch() returns a batch of 20 points on the spiral as starting points as batch_y0.  Batch_t is always the 10 time points between 0 and 0.225 and batch_y is a batch of 20 10-step paths along the spiral starting from the corresponding batch_y0 point.   For example, shown below is a sample batch.

f40

Figure 12.  Sample training data batch

An incredibly important point is a property of the ODE solver we use “odeint”.

f41

It comes from the author’s  torchdiffeq package as an adjoint integrator.  The actual integrator we want to use is scipy.integrate.odeint which is a wrapper for an old, but sophisticated method written in the 1980s and from ODEPACK.  But we need to be able to compute the symbolic derivative of the loss operation so we can do the backpropagation.   But we can’t do the symbolic integration back through the old ODEPACK solver.  To get around this problem the Chen and the team uses the integrator to solve an adjoint problem which, very cleverly, allows us to compute all the derivatives without trying to differentiate the solver.  The details of this step are described in the original paper and in the other references above.  We won’t go into it here.

It is fun to see the result of the training as the it evolves.   We captured the output every 10 iterations.

f42

Figure 13.  Snapshots of training showing the trajectories of the solution and the correct values.

It is worth observing that, once trained our network, when given a point (x, y) in the plane returns a vector of the trajectory of the spiral path through that point.   In other words, it has learned a vector field which can be plotted as on the right in figure 13 and below.

f43

Figure 14.  plot of the vector field generated by the network

Neural ODEs can be applied well beyond applications that resemble simple differential equations.  The paper illustrates that they can be applied to the vision tasks that resnet was invented to solve.  The paper show how it can be applied to build a MNIST hand written digit classifier but with a much smaller memory footprint (one layer  versus many layers in resnet).  There is a very nice implementation in Surtsukov [9].

The solution (detailed in Surtsukov’s notebook in his gitub repo) also contains all the details of the solution to the adjoint problem for computing the gradients of the loss.   The model to classify the MNIST images is almost identical to the classical Resnet solution.   A slightly modified version that combines parts of Surtsukov’s solution with Chen’s solution  is in our Github repo.

The input to the model is passed through an initial down-sampling followed by the residual blocks that compute the sequence

f44

More specifically the Resnet version consists

  • A down sampling layer
  • 6 copies of a residual block layer
  • And a layer that does the final reduction to 10 scores.

The neural ODE  replaces the six residual layers Ni(yi ) with the continuous derivative  N(y,t).  The forward method of the NeuralODE invokes the adjoint solution to the integration by using the odeint() function described in the previous example.   You can experiment with this in the Jupyter notebook in the repo.  With one epoch of training the accuracy on the test set was 98.6%.  Two epoch put it over 99%.

One of the other applications of Neural ODEs is to time series data.  A challenge for many tradition neural network approaches to time series data analysis is that they require uniformly spaced samples for training.  Because the Neural ODE is continuous,  the sampling can be flexible and adaptive.

Final Thoughts

This report is an attempt to illustrate a striking duality that seems to exist between deep neural networks and differential equations.   On the one hand, neural networks are non-linear functions that, with the right choice of activation functions, can have smooth first and second derivatives and, consequently, they can be trained to solve complex differential equations.    Generative adversarial networks can even model the solution to stochastic differential equations that fully satisfy the governing laws of physics.  Seen from another perspective, many deep networks are just discrete approximations to continuous operators that can be solved with advanced differential equations packages.   These Neural ODEs have much smaller memory requirements and are more adaptive in their execution and may more accurately solve difficult problems.

It is going to be interesting to see where these approaches lead in the years ahead.

The github repository for this paper contains four Jupyter notebooks:  the solution to Burger’s equation, the simple non-linear and generative adversarial example, and the simple Neural ODE solution to the spiral described above and the MNIST solver.

Postscript. I realize now that I overlooked another significant contribution to this discussion.  “Deep Learning Based Integrators for Solving Newton’s Equations with Large Timesteps” arXiv:2004.06493v2 by Geoffrey Fox and colleagues show how RNN can be use to vastly improve the performance of the integration of Newton’s equation.

Accelerating Deep Learning Inference with Hardware and Software Parallelism

Abstract

A persistent problem when using deep neural networks in production is the speed of evaluating the network (known as inference) on a single input.   While neural network inference has ample opportunities for using parallelism to gain speedup, these techniques are not as easy to exploit as when training the network. In this short report we will look how several new system/chip designs from companies like Groq, Celebras, Graphcore and Samba Nova are approaching the inference performance problem and we will also explore the software challenge in compiling Neural Nets to run on this parallel computer hardware.

Why is Inference Harder to speedup than Training?

Training Deep learning systems require vast computational resources and data. Fortunately, the algorithms used for training are highly parallelizable and hardware that supports either data parallel and/or highly multithreaded execution can make a huge difference in the training time.   In a previous post, we described how simple GPUs and clusters of CPUs can be used to train networks. However, once a network has been trained it must be deployed so one can use it to make inferences. Making an inference involves taking a single input (an image, query or sound clip) and pushing it through the many layers of the network. The trained model may be hosted in the cloud to support image identification or search, or to do natural language translation or question answering. Doing this fast for on-line applications is essential as the load on the application increases. The speed of inference is also critically important in robotics applications such as self-driving vehicles or controlling complex critical hardware which may involve life-support.

While the inference process still involves operations that can be parallelized, the challenge in using this parallelism to gain performance is different than it is when training the network.  The reason that this is the case is easy to understand. The metric of performance for inference is latency (the time it takes to push an item through the network) while the metric of performance for training is throughput (the volume of training data per second that you can manage). Training a network involves pushing batches of training data through the pipeline of network layers. GPUs are effective when you can reuse data that has already been loaded into their local memories.   Evaluating a batch of input data allows the GPU to load the layer weights once and reuse them for each item in the batch.

It is a well-known fact of life in high performance computing that the latency involved moving data is a performance killer unless you can hide that latency by using your hardware to do other useful computation. Because inference has fewer opportunities to reuse data, the best way to reduce inference latency is to reduce the amount or cost of data movement.

The new architectures we will look at all use extraordinary amounts of parallelism, but they also depend very heavily on compilers that can translate the neural network designs into the low level code and optimized data movements that will achieve the performance goals. In fact most, if not all, of the systems here were co-design efforts involving the simultaneous planning for the hardware and software. Hence as we present hardware details, we will need also to describe the compiler and runtime.   The last section of this report will focus on the general challenges of compilers for this class of computing system.

Hardware Advances

In the following paragraphs we will outline the advanced deep learning processor designs that are now coming on the market. While they all address the issues of training and inference, there are several that has put the issue of inference performance as a prime design objective.   The descriptions below vary greatly in the level of detail. The reason for this is that some are still a work in progress or highly proprietary. For example, all we know about the Huawei Ascend 910 is that it “performs much better than we expected”.

Groq

Groq.com is a Silicon Valley company co-founded by Johnathan Ross who was on the team that designed the Google Tensor Processing Unit.   The Groq Tensor Streaming Processor (TSP) is very different from the other systems which rely on massive scale multi-core parallelism. Instead the TSP can be classified as a Very Long Instruction Word (VLIW) single core, single instruction stream system.   The design is very unusual but there is a good description in the Linley Group Microprocessor Report January 2020. We will only give a capsule summary here.

The TPU is a type of systolic processor in that it has horizontal data flows with instructions streaming from the main issue engine down through 20 data layers called superlanes.   Each Superlane is composed of 16 parallel lanes of 8 byte wide data paths.   The superlanes have blocks for Matrix accumulators, transpose and permute operations and vector ALUs as shown in Figure 1 below.   Note that memory is imbedded directly in the superlanes.   Notice also that each superlane has is duplicated around a central axis so data moves between units in both directions

fig1

Figure 1.   Groq Architecture.

Instruction issue is also systolic.   The first instruction is executed on superlane 0 in one cycle. In the next cycle that instruction is executed on superlane 1 while the 2nd instruction is executed on superlane 0. In the next cycle the first instruction is executed on superlane 2, the 2nd instruction is now on superlane 2 and the 3rd instruction is on superlane 0. So, in 20 cycles an instruction has been executed on all superlanes and each subsequent instruction is complete on all superlanes, etc.   (Note: this description may not be totally accurate. We do not have the detailed Groq technical specs.)

fig2

Figure 2. Groq TSP

The Groq TSP is designed to deliver a 1000 trillion operations per second and live up to a major design goal: on the Resnet-50 deep learning model it delivers 20,000 inferences per second with a latency of 0.04ms on a batch size of 1.

A big challenge in creating a system like the Groq TSP is building a compiler that can generate an efficient instruction stream to keep all that hardware busy.   Because there are no caches, locality is not an issue and advanced architectural features like prefetch and branch prediction are not needed, so the computation is completely deterministic, and performance is completely predictable. We will return to the compiler issues below.

Habana Goya

Habana Labs was one of the first to introduce a fast inference processor.   In 2018 they announced the Goya processor.   By the end of 2019 they were acquired by Intel. The Goya architecture has an array of Tensor Processor Core (TPC) compute engines which are VLIW single-instruction-multiple-data processors. It also has a general matrix multiply engine. The TPC engines have local memory but there is a fast, shared static RAM.

One very interesting thing about the Habana design team is their work with Facebook on the Glow compiler back end for Pytorch.   More on that later.

Alibaba Hanguang 800

Another newcomer to the race for faster inference is the Alibaba Hanguang 800.   Alibaba is not planning on selling this new chip and it is intended solely for internal use in its cloud servers. There is very little that is published about its internal architecture.   In the table below we see some interesting performance numbers including one that indicates that the Alibaba system has better inference performance than the Groq TSP. However, we do not know if this IPS number is for a batch size of 1.

fig3

Figure 3. From https://www.techarp.com/computer/alibaba-hanguang-800-details/

A Digression about Compilers, ResNet-18 and ONNX.

Before we continue discussing interesting new architectures, it is helpful to stop and discuss some general issues related to the compilers and benchmarks.

One of the big problems you encounter when writing a compiler for a new architecture is that there are several very good deep learning frameworks that are used to build deep neural networks. These include MxNet, Caffe, CNTK, Tensorflow, Torch, Theano, and Keras. One could write a compiler for each, but, given that they all build very similar network models, it makes sense to have a “standard” high-level, graph intermediate form that captures the properties of a large fraction of all neural nets.  Then, if third parties build translators from the high-level frameworks to this intermediate form, the chip architect’s job is half done: all they need to do is write a code generator mapping that intermediate to their architecture.

The Open Neural Network Exchange (ONNX) may become that standard intermediate form. Originally developed by Microsoft and Facebook, it has been taken over as a community project involving 20 companies. While we do not know how Groq, or some of the other hardware companies described here, are building their proprietary compilers, looking at ONNX as it relates to a real example can give a clue of how compilers like these do work.

In the last three hardware descriptions, performance number were often cited in terms of ResNet-50. Resnet is one of a family of very deep convolutional neural networks. Originally presented by He, Zhang, Ren and Sun in their 2015 paper, they describe a clever way to improve the ability to train very deep networks.   You can think of each level of a deep neural network as learning more subtle and abstract features of the training images than were detected by the previous layers.   A residual network is one where you “subtract” the features discovered by previous layers so the following layers can work on learning the properties of the residual.  Doing this subtraction is a way to focus the learning on what remains and helps solve a problem known as the vanishing gradient that makes it hard to train very deep networks. Mathematically If your training goal is to learn a function H(X), then the residual at some layer is F(X) = H(X)-X. Hence we want the following layers to learn F(X)+X to recover H(X). Creating F(X)+X in the network is easy and it is shown in Figure 4.

fig4

Figure 4. From Zhang, Ren and Sun in their 2015 paper.

We can construct such residual network with Torch or TensorFlow and then we can look at the ONNX intermediate. In Torch, the code is summarized below. (The complete code is in a Jupyter notebook that accompanies this post.) There are two networks. One is the residual block as illustrated in Figure 4 above and the other is the full model that incorporates a sequence of residual blocks.

fig4.5

In the image above, we created an instance of the model as “resnet” and then set it to the eval() state. Using the Torch built-in ONNX export operator we can save a copy of the model in the file resnet.onnx.   Doing so gives an output like Figure 5 below.   On the right we have fragments of the ONNX intermediate code and on the left, a graph that is generated from the ONNX code with a tool called netron. What is shown here is only a small part of the ONNX graph. The top is just a list of all the model variables. Following that we have actual code for the graph.

The ONNX exporter will build the graph from the internal Torch model.   There are two ways in which it does this.   One is to directly “unroll” the graph by interpreting the execution of the forward(input) eval operator. In some cases, if the definition of the model contains conditionals, it will insert conditional code in the graph, but these are rare cases.

In this case the code consists of an initial convolutional layer followed by a batch normalization which is based on the mean and variance of previously seen batches.   This is followed by the first instance of the Residual block model.

fig5

Figure 5. Fragment of the ONNX output for the Resnet18 model.

As you can see, the ONNX graph consists of nodes that are parameterized operators with inputs that are the model tensors and they produce one output that is a well-defined tensor. Generating code for a specific architecture can be a simple as building well-tuned native versions of the ONNX operators and then managing the required data movement to ensure the input tensors are in the right place at the right time for their associated operation nodes. On the other hand, there are a number of important optimization that can be made as we “lower” the ONNX graph to a form that is executed. We will return to this point after we complete the descriptions of the new architectures.

Cerebras Systems

Cerebras Systems has taken the parallelism to an extreme. The power of their approach is most evident during network training rather than inference, but it is interesting enough to describe it in more detail. Their CS-1 system is based on a wafer-scale chip that consists of a 2d grid of 400,000 compute cores interconnected by a 2d-mesh network capable of 100 Petabits/sec bisection bandwidth that delivers single word active messages between individual cores.

fig6

Figure 6. Cerebras WSE

The Cerebras software contains the Cerebras Graph Compiler that maps deep learning models to the hardware. Their approach is an extreme form of model parallelism where each layer of the network is mapped to as many compute cores as is required to contain it. Their philosophy is nicely described in the post “Neural Network Parallelism at Wafer Scale”​ by Natalia Vassilieva and in their product overview.

fig7

Figure 7. Cerebras Software Stack

The training uses pipelined back-propagation.   The graph compiler takes the source description of the network and extracts a static graph representation of the problem and converts it into the Cerebras Linear Algebra Intermediate Representation (CLAIR). This is then converted into a “Kernel graph” and mapped to the hardware as shown in Figure 7. In their approach the entire network is mapped onto the computing fabric, so pipelining batches through has no points of congestion.

Graphcore

Graphcore is a U.K. startup started shipping their accelerator, called the Intelligence Processing Unit (IPU), in 2018 and it is now available on Azure for evaluation.   Like Cerebras, the architecture is based on massively parallel processing. Each IPU contains 1,216 processing elements called tiles; a tile consists of one computing core plus 256 KiB of local memory.   There is no shared memory, but the local memory is SRAM and faster than the DRAM in most CPU servers.  To hide latencies, the IPU cores are multithreaded.   Each core has 6 execution contexts that served in round-robin style. In terms of computational performance, each IPU is 31.1 TFlops/s in single precision.

fig8

Figure 8.   Graphcore IPU

There is an inter-processor communication switch called the exchange that provides processing element data communication and multiple IPUs are connected via a fast, off-chip interface network.   Citadel has published an excellent performance analysis by Jai et.al. “Dissecting the Graphcore IPU Architecture via Microbenchmarking”.   They have measured “On a per-tile basis, each of the 1,216 tiles can simultaneously use 6.3 GB/s of bandwidth to transfer data to an arbitrary destination on chip. The latency of an on-chip tile-to-tile exchange is 165 nanoseconds or lower and does not degrade under load above that value.” They also measured the latencies between cores that reside on different IPU boards where several boards had to be crossed to deliver the message. This increased the latency to about 700 nanoseconds.  Their report provides a very complete analysis of the data traffic performance under a variety of conditions.

Ilyes Kacher, et.al. from the European Search engine company Quant have also produced an analysis: “Graphcore C2 Card performance for image-based deep learning application: A Report”. Their interest in Graphcore was to improve performance of their image search product. In their study they considered the image analysis network ResneXt101. Inference experiments for batch sizes of 1 and 2 are 1.36 ms. Their benchmarks claim this is 40time lower latency than an Nvidia V100 GPU.   They also compare performance on BERT and they measure 30% lower latency with 3 time higher throughput.

The programming model is based on Bulk Synchronous Parallelism in which computation is dived into phases, with each phase consisting of a computation step followed by a communication step and then a barrier synchronization.

fig9

Figure 9. Graphcore BSP execution (from Graphcore IPU Programmers Guide.)

https://www.graphcore.ai/products/poplar is a discussion of their stack.   More significantly they have open sourced the entire software stack.  They have a runtime environment called the Poplar Advanced Run Time (PopART) that can be used to load a ONNX model from python and run it on their hardware. For Tensoflow they have a separate compiler and runtime.

Graphcore hardware is now available on Azure for evaluation.

SambaNova

SambaNova is a bay area startup founded by two Stanford professors and a former executive from Sun and Oracle. They have not yet announced a product, but they have an interesting background that may indicate a very novel approach to the design of an AI accelerator.

Reconfigurable computing is an idea that has been around for since the 1960s. Field Programmable Gate Array are in common use today to configure a processor to execute a new algorithm, but these usually take 10s to 100s of milliseconds to “reprogram”. Suppose you could configure the logic elements on the chip to perform a needed transformation on a block of tensors just as that block emerges from a previous operation? The SambaNova team has looked at specialized programming languages that allow them to generate streams of high-level templated instructions such as map, reduce, shuffle and transpose that are natural elements of deep network kernels.   This is clearly a talented, well-funded team and it will be interesting to see what is eventually released.

Tenstorrent

A Toronto startup called Tenstorrent has built a device called GraySkull.   The chip has 120 small processing nodes, called tensix, and two toroidal mesh networks that can be extended off-chip to build larger clusters. There is no shared memory.   In various articles about Tenstorrent they emphasize their approach to dealing with sparsity in large neural net models is key to high performance on big models.   Like several of the other startups, their compiler translates ONNX graphs into tensix primitive operators which are mapped to the nodes. They claim 22,431 IPS on resnet 50 and 23,345 sentences/sec on BERT.

fig10

Figure 10. From Tenstorrent.

NLVDA

Finally, we include NLVDA from NVIDIA. Called a Deep Learning Accelerator, this is an open source modular architecture for building inference accelerators. There is a hardware instance called Xavier that NVIDIA has produced to support inference for autonomous transportation applications.

Compiling Neural Nets for Parallel execution.

In the remainder of this report we will look at the techniques that are used in modern compilers to optimize performance on neural network training and inference. Many of the basic techniques have been used in compilers for 50 years. These techniques evolved as CPU arithmetic and logical units became so fast that many operations were dominated by the time it took to move data from main memory through layers of faster and faster caches. Data locality was critical: if an item of data was going to be reused you needed to keep it in fast cache as long as possible.

Almost all of the operations in a neural network involve matrix and vector arithmetic.   If you consider the most basic type of network layer, an n by n full connection, it is just an n by n matrix and a vector of offsets. Applying such a network to a single vector of n inputs is just a matrix-vector multiply and a vector addition. The problem with matrix-vector multiply is that the matrix elements must be fetched from memory and are used only once. On the other hand, if the computation is properly blocked so that small chunks of the array are loaded into the GPU or CPU caches, then if you have a batch of n vectors, each element of the array can be fetched once and used n times.   This method of improving matrix-matrix computation is the basis of the standard library known as the Level-3 Blas developed by Jack Dongarra and others in the 1980s.

A more interesting example of how locality can be used to speed up performance in a neural network is 2-D convolutions that are used in deep learning image networks.   Figure 11 below shows a 2-D convolution operating on a 6×7 image data with 3 color channels and outputs a new 6×7 array with 6 channels.   Each output channel is produced by using, in this case, 3 filters of size 3×3. Each filter is applied to a channel of the input (which has been expanded with an extra border of ghost pixels). The filter moves across the input channel computing the inner product of the filter with the image at that point. The corresponding output point is the sum of the three filter inner products applied to each of the input channels.

fig11

Figure 11. A 2-D convolution applied to a 3 layer, 6 by 7 image producing 6 output images.

The 6 by 3 by 9 tensor of filters, W, is the learned object and the full computation is show in the formula above (we have suppressed the bias terms to simplify the presentation). If we let Cin be the number of input channels and Cout be the number of output channels for a width by height image the full computation takes the form below. (The input is the padded (width+1) by (height+1) array, so that the pixel at position (0,0) is in location (1,1) in the array Input.)

fig11.1

This form of the computation has extremely poor locality and it will run slowly. However, we illustrate below a sequence of program transformations that allow us to simplify this loop nest and “lower” the execution to primitives that will allow this to run up to 400 times faster.

fig11.2

A close inspection of this nest of six loops will convince you that we can execute them in any order. In fact, the addition recurrence is only carried by the inner three loops.   Consequently, we can pull out the three inner loops as a separate function that does not involve repeated writes to memory around the Output tensor. The result is shown below.

fig11.3

The next thing to notice is that if you move the t loop in the kernel function to the innermost position the operation is an inner product. The computation now takes the form

fig11.4

One final reduction can be made when we notice that the two loops in kernel2 are just a pointwise matrix product of the 3×3 filter W with the Input shifted to position (k,j). And the summation can be done with the torch.sum() function, so our function now takes the form below.

We ran these four versions of the function on two machines: an Intel core I7 and an Nvidia Jetson nano. The results are in Tables 1 and 2 below. As you can see, the performance improves substantially for each transformation. In addition, the speed up of the matrix product version over the 6 nested loop version varies from 68 to over 400 times with the greatest speedup occurring when the values of Cin are largest.

6 nested loops Factored Kernel Kernel with dotprod Matrix product Cin Cout W,H SpeedUp
2.24 seconds 1.26 1.0 0.022 16 4 10,10 68
4.47 2.75 0.19 0.047 16 8 10,10 95
8.24 4.98 0.39 0.077 16 16 10,10 107
8.51 4.97 0.20 0.038 32 8 10,10 223
8.66 5.06 0.10 0.020 64 4 10,10 433

Table 1. Execution time on Intel Core i7 for the four versions of the loop with various values of Cin and Cout. The speedup is measured as the ratio of the 6 nested loop time to the time for the matrix product.

6 nested loops Factored Kernel Kernel with dotprod Matrix product Cin Cout W,H SpeedUp
47.9 seconds 28.1 7.02 0.7 16 4 10,10 68
87.9 52 9.7 0.73 16 8 10,10 120
168.9 107 18.9 1.17 16 16 10,10 144
171 107.9 9.8 0.59 32 8 10,10 289
174 104.0 4.38 0.43 64 4 10,10 404

Table 2. Execution time on Nvidia Jetson Nano for the four versions of the loop with various values of Cin and Cout.

The final point to notice about this last version is the final (i,j,k) loops may all be executed in parallel. In other words, if you had a processor for each pixel on each output plane the entire operation can be run in parallel with an addition speedup factor of Cout*width*height. Of course, all of these versions are far slower than the highly optimized conv2d() library function.

The compilers we talk about below do not operate at the level of program transformation on Python loop nests.   They start at a higher level, transforming ONNX-like flow graphs and eventually lowering the granularity to primitive operators and scheduling memory management and communication traffic and eventual code generation.

Deep Neural Network Compilers.

Every one of the hardware projects we described above has a companion compiler capable of mapping high level DNN frameworks like PyTorch or Tensorflow to run on their new machine.   Not all of these have detailed descriptions, but some do, and some are also open source.   Here is a list of a few notable efforts.

  • Intel’s NGraph compiler “Adaptable Deep Learning Solutions with nGraph Compiler and ONNX” with extensive open source and documentation.
  • ONNC: A Compilation Framework Connecting ONNX to Proprietary Deep Learning Accelerators. Also fully open source. It is designed to support NVDLA hardware.
  • Nvidia TensorRT is an SDK for high performance inference that is built on CUDA. A TensorRT backend for ONNX is also available and open source.
  • Apache TVM is an open source compiler stack for deep learning that originated with the Computer Science department of the University of Washington.
  • Google MLIR, a Multi-Level IR Compiler Framework that provides tools for many types of programming language compiler challenges.
  • ONNX Runtime for Transformer Inference from Microsoft has been open sourced.
  • The Facebook Glow compiler “Glow: Graph lowering compiler techniques for neural networks” paper.
  • The GraphCore PopART runtime was discussed in the GraphCore section above.
  1. Sivalingam and N. Mujkanovic of CRAY EMEA has a nice summary of these compilers in this post.

Glow translates the input from ONNX or Caffe2 into a high-level intermediate graph that is very similar to ONNX. Because they are interested in training as well as inference they next differentiate the graph to facilitate gradient decent training. This new graph contains the original and the derivative. TVM also generates a differentiable internal representation.   All of the others have similar high-level internal representation. Where they differ is in the layers of transformation and lowering steps.

High Level Graph Transformations

The next step is to do optimization transformations on the graph.   The nGraph compiler has a Reshape Elimination pass exploits the fact that matmul(A.t, B.t).t = matmul(B,A) and other algebraic identities to make tensor restructuring simplifications.   Common subexpression elimination and constant folding are standard compiler techniques that can be applied as transformations the high-level graph. For example when compiling a model to be used for inference most of the parameters in the high level nodes, such as various tensor dimensions are known integers and expressions involve address arithmetic can be simplified.

An important part of the “lowering” process is where some of the high-level nodes are broken down into more primitive linear algebra operations.   This step depends on the final target architecture: for a GPU certain transformation are appropriate and for a CPU, different choices are made. For example, with ResNet Glow has different strategies for different instances of the convolution operator depending on the size of the filter weights and these require different memory layouts. TVM, Glow and ONNC use a type of layer fusion to combine consecutive operators such as Convolution and Batchnormalization or ReLu into a single special operator.

Low Level Internal Representations

Now the graph is transformed into the Low-level internal representation. This layer is more specific about the representation of memory layout and important optimization can be made there. For example, if there are sequences of operation that must sweep across a large tensor, one can break the tensor into blocks so each block can be loaded once, and the operation sequence can be applied to the block. This is a classic locality optimization. Managing memory can involve other transformations. For example, ONNC uses layer splitting to handle memory constrains as shown in Figure 12 below.

fig12

Figure 12. From Skymizer ONNC tutorial.

Quantization is an issue that several compilers address. Glow also does profile-guided quantization so that floating point networks can be converted into efficient integer-based networks.   Finally, depending upon the backend architecture, code is generated from the final low-level graph.

Runtime Systems

Because the compilation system involves mapping the neural network onto hardware configurations which may have more than one processor that must communicate with each other, there must be a runtime system to handle the coordination of the execution.

Glow has a runtime system that is capable of partition networks into an acyclic graph of subgraphs and scheduled across multiple accelerators. As we have discussed previously the GraphCore PopArt runtime manages BSP-style execution across thousands of processor threads.

The Microsoft ONNX runtime focuses on CPU and CPU + GPU execution on Windows, Linux and Mac OS. For the GPU it supports CUDA, TensorRT and DirctML.   It also supports IOT/Edge applications using Intel OpenVMINO, ARM and Android Neural Networks API.

Final Thoughts

The explosion of computer architecture innovation exemplified by the new systems described here is very impressive. It is reminiscent of the boom in HPC innovation in the 1990s which led to the current generation of parallel supercomputer designs. The density and scale of some of the chips are very impressive.   In the 1980s we considered the impact of wafer-scale integration on parallel computing, so 40 years later, it is interesting to see it come to pass in systems like Cerebras.

There are many details of the compiler infrastructure that we covered here very superficially.   We will return to this topic in the future when we have more access to details and hardware.

 

Anomaly Detection: From the Edge to the AWS and Azure Cloud

There are now billions of sensors that monitor the world around us. Bio sensors are used to monitor every aspect of life. Environmental sensors measure temperature, humidity, pressure, chemical concentrations, vibrations, acceleration, light wavelengths and more. These sensors produce a constant stream of data that must be analyzed and when unusual behavior is detected these anomalies need to be reported. This alarming behavior may consist of spikes in sensor readings or device failures or other activity that should be flagged and logged. Often these sensors communicate with a nearby small edge computing device which can upload summary data to the cloud as illustrated in Figure 1. Typically, the edge computer is responsible for some initial data analysis or, if it has enough computing capacity, it may be responsible for detecting the anomalies in the data stream.

01

Figure 1. Edge sensors connected in clusters to an edge computing device which does initial data analysis prior sending aggregated information to the cloud for further analysis or action.

In this short note we look at two cloud services that provide anomaly detection. One is the Azure Cognitive Service anomaly detector and the other is from the Amazon Sagemaker AI services. In both cases these services can be (mostly) installed as Docker Containers which can be deployed on a modestly endowed edge computer.    We will illustrate them each with three example data streams.   One data stream is from an s02 sensor that was part of an early version of the Chicago-Argonne Array-of-thing edge device. The second is from the Global Summary of the Day (GSOD) weather from the National Oceanographic and Atmospheric Administration (NOAA) for 9,000 weather stations between 1929 and 2016. In particular, we will look at a sensor that briefly failed and we will see how well the anomaly detectors spot the problem. The second example is an artificial signal consisting of a sine wave of gradually lengthening period with several anomalous data spikes.

The two services each use a different algorithm to detect anomalies.   The Sagemaker algorithm uses a machine learning method called Random Cut Forest and the Azure detector uses a method which combines spectral analysis with a convolutional neural network. We will describe both algorithms in more detail at the end of this section, but first we go to the set-up and experiments.

Azure Cognitive Services.

To use the cognitive service, you need to go to the Azure portal and then to cognitive services. There you can use the search bar to look for the “Anomaly Detector” (at the time of this writing it is still in “preview”). You will need to create an instance and that will get you an API key and an endpoint for billing.   (You can use it for free until you use up the free quota.   After that you can switch to payments.   I did this and it did not cost me much: so far $0.75, for the work on this paper. )

Download and Launch the Container

You should go to this page to see what is currently required to launch the container. Assuming you have docker installed on a machine (your laptop or in the cloud), you must first pull the container.

docker pull containerpreview.azurecr.io/microsoft/cognitive-services-anomaly-detector:latest

Next you will use your ApiKey and billing endpoint to launch the container.   This command works:

docker run --rm -it -p 5000:5000 containerpreview.azurecr.io/microsoft/cognitive-services-anomaly-detector:latest Eula=accept Billing={ENDPOINT_URI} ApiKey={API_KEY}

We can now use the anomaly API to directly interact with the algorithm running on the container. We have supplied a Jupyter notebook with the details of the experiment that follows.

02

Rather than use the endpoint for the cloud resident service we need an endpoint on the container.

endpoint = 'http://localhost:5000/anomalydetector/v1.0/timeseries/entire/detect'

(which assumes your container is running on your local machine.   If it is remote you need to make sure port 5000 is open and substitute the host name for local host.) Notice the word “entire” in this endpoint.   The detector operates in two modes: entire and last.   Entire mode considers the entire history of a stream and spots the past anomalies.   Last mode is used to do real-time detection.

To illustrate its behavior on an example we will use a data stream captured from an SO2 sensor on an early version of the Argonne-Chicago “Array-of-Things” edge device. Running the detect method, decoding the output and plotting the results (see code in the Notebook) gives us the graph below. While hard to see, there are three things being plotted.   One is the value of the data, the other is a line of expected values and a region of boundary of uncertainty and finally the anomaly (red dot).   In this case the region of uncertainty is very narrow and not visible. We will see it more clearly in the next example.

03

Now the real-time monitoring case.

To run the algorithm in a continuous mode you need to use the “last” endpoint.

nendpoint = 'http://localhost:5000/anomalydetector/v1.0/timeseries/last/detect'

This allows the algorithm to look at a window of data and make a prediction about the last item in the window.   We can now send a new “sliding” window consisting of the last “window-size” data points to the service every time step.

04

The detector returns a dictionary and, if the last item in the window as anomalous, then the flag result[‘isAnomaly’] is True.   By keeping track of the anomalies (see code in the notebook), we can plot the result.

05

In this case we have spotted the true anomaly (red dot) and another that looks like a false alarm.   By lowering the sensitivity value, we may be able to eliminate some of the false alarms.

The Skagit Valley Temperature

Turning now to the NOAA GSOD data for the temperature sensor in the Skagit valley, WA station, we have one measurement per day for a year.   There are a few days in October where the sensor goes bad and signals a temperature of over 100 degrees Fahrenheit. (We looked at this case using Google’s data analysis tools in our book.   By looking at other nearby sensors we saw that this was not the temperature in this location.)  The figure below shows the batch anomaly detection for this data. In this case the region of expended value uncertainty is very clearly defined and the bad data is easily spotted.

06

Turning to the real-time detection mode, we see below that the red dots show the true anomaly, but it also flagged to other locations. One in the month of May, looks a false alarm and another in late November that is unclear.

07

Synthetic Oscillatory Data

We now look at the case where the data consists of a sine wave that has a slightly increasing period with a few added spikes.   The code to generate the data is

08

In this case the batch detector accurately tracks the sine wave and catches the big spike, but misses the first spike.  However, the real-time detector does a very good job.

09

10

The AWS Sagemaker Anomaly Detector.

The AWS Sagemaker service can be completely managed inside a container, but one difference with the Azure service is that the Sagemaker version does all the analysis in the cloud.   We have provided in the github repository the Docker file you need to create a container that will run in the edge device that will make the calls to the cloud.  In this case the job of the container is to gather the data, interact with the cloud service to set up the algorithm training, deploy a server that will host the trained model return inference results.   This container-based component is a script that makes calls to aws sagemaker. To better illustrate the details, we have a Jupyter notebook.   You can use the following script to build the container, run it and launch jupyter from inside.

docker build -t="yourname/sagemake" .
….   a great deal of build output follows
docker run -it -p 8888:8888 yourname/sagemake
…. Once container starts we are now running as ec2-user in the container.
ec2-user@29e378df61a9:~$ jupyter notebook
….. the output will tell you how to point your browser to see the notebook.

Note: to run the container, it must have your AWS credentials and your Sagemaker execution role identifier. Consequently, do not push your container to the docker repository and delete it when you are finished.

To get started you must log in to the AWS Sagemaker portal and create a user and an execution role. You will also need to create a bucket where Sagemaker will store your model data. The notebook shows the details for how to use this information to create a “session” and “role” object. We will use these to train the algorithm.

One of the ways the Sagemaker documents suggest for presenting data to the algorithm in cases where the size of the collection may be small is to create a “shingled” version. In other words, we take the data stream and for each time instance, create a set of the next “shingle-size” data values as follows.   Using the data from our Array-of-things S02 sensor.

11

To train the model we use the session and role objects as follows:

12

Next create a cloud instance that can be used for doing inference from this model.

13

We can now do a “real-time” analysis by doing a sequence of inferences on sliding windows of shingles. We will use 25 shingles each of width 48 so the window covers 73 time units (we used windows of size 100 in Azure examples). The code now looks like

14

Plotting the anomalous points with black dots we get the picture below.

15

As you can see,   it detected the anomaly at 400, but it flagged four other points.   Notice that if flags any point that has an anomaly score greater than 3 standard deviations from the others in the sliding window. Raising the threshold above 3 caused it to lose the actual anomaly but retained the three false alarms.

Applying the same procedure to the Skagit Valley temperature sensor and the artificial sinusoid al signal we get similar results.

16

17

Comparing the two anomaly detectors, we found it was easier to get accurate results from the Azure cognitive service than the AWS Sagemaker.   One the other hand, the Sagemaker method has a number of hyper-parameters that, if tuned with greater care than we have given them, may yield results superior to the Azure experience.

Another important difference between the two detectors is that the Azure detector can be completely deployed in the container while the AWS detector relies on cloud hosted analysis.   (Of course the Azure system still keeps a record of your use for billing purposes, but it was not expensive: $0.14 for the experiments above. For the work using Sagemaker it was a total of $6.41) We expect that the Sagemaker team will make a container available that will run entirely in the edge device. They may have already done so, but I missed it. If a reader can help me find it, I will happily amend this article. One possibility is their excellent greengrass framework.

The Algorithms

The Random Cut Forest

An excellent github site with good details about using the Sagemaker service is here.   This is also what we modified to create our Jupyter notebook. A basic description of the algorithm is in “Machine Learning for Business” by Doug Hudgeon and Richard Nichol and available on-line.    However, for a full technical description one should turn to the paper by Guha, Mishra, Roy and Schrijvers, “Robust Random Cut Forest Based Anomaly Detection On Streams” published in 2016.

At the risk of over simplifying, Figure 2 below illustrates how a random cut forest can be built for one-dimensional data values.   We start by picking a point somewhere near the middle of the data set and then divide everything above that point into one group and everything below that in another group. Then for each group repeat the process until the points are each in groups of size 1. Now, for each point count the number of tree divisions it took to divide the tree to get isolate that point. Low counts indicate possible outliers.   Now do the tree construction many times. Compute the average of scores.  Again low indicates possible anomaly.

18

Figure 2.   Forest of Random Trees for a one-dimensional data collection.

Unfortunately, if you apply this technique to a large window of a data stream that has a wide range of values it will only capture the extreme ends of the values. If there is an anomaly in the mid range of values it may not be seen as anomalous when taken as a whole.   For example in the image below we considered the example of the SO2 sensor and apply the algorithm across the whole data set we see it completely misses the anomaly at 400 and flags the overall highs and lows.

19

But, as we showed above, when we applied a sequence of small windows of data to the same forest of trees it did capture the anomalous spike at 400.

To better illustrate this point, consider a variation on the sine wave fake data example. The random forest tree algorithm was able to detect the anomaly when the spikes were introduced. But a variation on this example can show its failure. Instead of introducing the spikes, we simply flatten part of the sine wave for a segment of the range.   The result is that that no anomaly is detected, and for the modified range, the anomaly score actually drops.   The situation is not helped by the sliding window test.

20

The Azure Spectral Residue CNN Anomaly Detector

Hansheng Ren, et al. published “Time-Series Anomaly Detection Service at Microsoft” at the KDD 2019 conference.  The paper describes the core algorithm in the Azure cognitive service anomaly detector. There are two part to the algorithm.

Part 1. Spectral Residual and Salience

Salience in image analysis is the property that allows some parts of an image to stand out and be easily identified. It is a technique that is often used in image segmentation.   Spectral residue is computed as follows. Applying an FFT to the stream sequence yields a measure of the frequency spectrum of the data.   The spectral residue is the difference between the log of the spectrum and an averaged version of the same.   Using the inverse FFT transforms the spectral residue back to physical space.   That result is the saliency map of the signal and locates the potentially anomalous parts.

Part 2. Applying a CNN to the saliency map

The novel feature of the algorithm is that it uses a convolutional neural network to do the final anomaly detection.   The network is trained on saliency maps that are generated by injecting artificial anomalies into a variety of real signal types. The paper describes this process in more detail.

To illustrate the power of this method, consider the flattened sine wave that defeated the Forest of Random Trees example above.   As shown below, the SR-CNN method captures this obvious anomaly perfectly.   As you can see, it projected the sinusoidal oscillations into its “expected value” window and the flat region certainly did not match this “expected” feature.

21

While this example is amusing, we note that in the cases (Skagit weather, So2) where we looked at the real-time sliding window analysis, both methods found the anomaly, even though the random tree method has more false alarms.

All of the data and Jupyter notebooks for these examples are in GitHub.

A “Chatbot” for Scientific Research: Part 2 – AI, Knowledge Graphs and BERT.

Abstract.

In 2018 I published a blog about building a cloud-resident “Research Assistant” (RA) chatbot that would be the companion of each scientist. The RA would be responsible for managing scientific data, notes and publication drafts. It could create intelligent summaries and search for important related scientific articles. That post demonstrated a simple prototype that provided spoken English input and simple dialog responses to search for available, relevant research. But it did not address the important issues of data management and textual analysis required to make the RA real. In a short, invited “vision talk” I gave at the e-Science 2019 conference I tried to address the technology that, in 2030, we would need to solve these problems.   This article does not describe an implementation. Rather it is a survey of the missing pieces I alluded to in the talk in terms of the current, related literature.

Introduction

2017 was the year of the smart on-line bot and smart speaker. These are cloud based services that used natural language interfaces for both input and output to query knowledge graphs and search the web. The smart speakers, equipped with microphones listen for trigger phrases like “Hello Siri” or “hello Google” or “Alexa” and recorded a query in English, extracted the intent and replied within a second. They could deliver weather reports, do web searches, keep your shopping list and keep track of your online shopping. The impact of this bot technology will hit scientific research when the AI software improves to the point that every scientist, graduate student and corporate executive has a personal cloud-based research assistant. Raj Reddy calls these Cognition Amplifiers and Guardian Angels. We call it a research assistant.

Resembling a smart speaker or desktop/phone app, the research assistant is responsible for the following tasks:

  1. Cataloging research data, papers and articles associated with its owner’s projects.
  2. The assistant will monitor the research literature looking for papers exploring the same concepts seen in the owner’s work.
  3. Automatically sifting through open source archives like GitHub that may be of potential use in current projects.
  4. Understanding the mathematical analysis in the notes generated by the scientist and using that understanding to check proofs and automatically propose simulators and experiment designs capable of testing hypotheses implied by the research.

Understanding the implications of these 4 properties will be the central theme of this post.

In 2017 we published a short article about how we could build a chatbot for research. In that paper we presented a short overview of chatbot software circa 2017 and demonstrated a very simple toy example meta-search engine that used spoken commands about research interests and the bot would respond with matching documents from Bing, Wikipedia and ArXiv. To illustrate this, consider the sentence “Research papers by Michael Eichmair about the gannon-lee singularity are of interest.” This required out Bot, called the Research Assistant, to understand that the main topic of the sentence was the gannon-lee singularity (an obscure reference to a paper from the 1970s that I happen to know about) and the fact that we want related papers by Michael Eichmair. The result obtained by our Bot shown in Figure 1.

ra-figure

Figure 1.   The results (shortened) from our original 2017 Science Bot to the Eichmair question.

In 2019 the same results can now be obtained by directly inserting this sentence into Google or Bing. We suspect one reason for this is the use of vastly improved language models based on Transformers (that we will briefly describe below).   Our bot is not only obsolete, we will argue in this article that it completely misses the boat on what is needed to make something truly useful.   This report will not present any new research results.   Instead it will try to outline the types of tasks required to make the research assistant capable of demonstrating the capabilities listed above.   We will try to also give a survey of the best published work leading in these directions. (This report is an expansion of a paper that was originally written for an invited “vision” talk entitled “eScience 2050: a look Back” for the eScience 2019 conference held in San Diego, Sept. 2019.)

Knowledge Graphs

If we look at the first two items in the RA capabilities list above, we see that they go well beyond simple meta search. These tasks imply that the research assistant will need to keep an organized research archive of data, notes and papers and have the ability to extract knowledge from the literature. We will assume that the individual items the RA manages will be cloud-resident objects that are described by a searchable, heterogeneous database of metadata.   One such database structure that can be used for this purpose is a Knowledge Graph (KG).   KGs are graphs where the nodes are entities and the links between nodes are relations. Often these node-edge-node triples are represented using Resource Description Framework (RDF) which consist of a subject, a relationship and an object. Each element of the triple has a unique identifier. The triple also has an identifier so that it can also be subjects or objects.

Having a KG that is based on scientific ontological terms and facts that can be augmented with the content of the individual scientist would be the proper foundation for our RA. To help explain this we need to take a diversion into the existing triple store KGs to see if there is one we can build upon.

There are several dozen implementations of RDF triple stores and many are open source. In addition, there are a number of commercial products available including

  • Ontotext which produces GraphDB a commercial RDF knowledge graph used by commercial custormers in publishing ( BBC and Elsevier), pharmaceuticals (AstraZeneca) and libraries (Mellon funded projects for the British Museum and the US National Galery of Art)
  • Grakn Labs in the UK had a knowledge graph Grakn that has special versions such as BioGrakn for life science apps.
  • Cambridge Semantics has a product called AnzoGrapDB which has numerous customers in the pharmaceutical domain.
  • And, of course, Oracle has a version of its database called “Spatial and Graph” that supports very large triple stores.

If you search for Knowledge Graph on the web or in Wikipedia you will lean that the KG is the one introduced by Google in 2012 and it is simply known as “Knowledge Graph”. In fact, it is very large (over 70 billion nodes) and is consulted in a large fraction of searches.   Having the KG available means that a search can quickly surface many related items by looking at nearby nodes linked to the target of the search.   This is illustrated in Figure 2 for the result of a search for “differential equation” which is displayed an information panel to the right of the search results.

googlekd1

Figure 2. Google information panel that appears on the right side of the page. In this case the search was for “differential equation”. (This image is shortened as indicated by …).

Google’s Knowledge Graph is not as good for science topics as the example in Figure 2 suggests. In fact, it is extremely good with pop culture, but for science applications like our RA, Google’s KG often just takes information from Wikipedia. In its earliest form Google KG was based on another KG known as Freebase. In 2014 Google began the process of shutting down Freebase and moving content to a KG associated Wikipedia called Wikidata. However, the Freebase archive is still on-line had has some excellent science content.

Launched in 2012 with a grant from Allen Institute, Google and the Gordon and Betty Moore Foundation Wikidata information is used in 58.4% of all English Wikipedia articles.   Items in Wikidata each have an identifier (the letter Q and a number) and each item has a brief description and a list of alias names. (For example, the item for Earth (Q2) has alternative names: Blue Planet, Terra Mater, Terra, Planet Earth, Tellus, Sol III, Gaia, The world, Globe, The Blue Gem, and more.) each item has a list of affiliated “statements” which are the “object-relation-object” triples that are the heart of the KG.   Relations are predicates and are identified with a P and a number. For example, Earth is an “instance of” (P32) “inner planet” (Q3504248). Figure 3 shows an illustration of the item “Hubble Telescope”.   There are currently 68 million items in Wikidata and, like Wikipedia it can be edited by anyone.

Having a KG is not very useful unless you have a way to search it.   In the case of Wikidata (and other RDF KGs) the language for making queries is called SPARQL. Combined with Wikidata, SPARQL queries are a very powerful way to search the KG.   To give a trivial example of what a SPARQL query look like let’s search for all the scientific instruments carried on the Hubble Telescope.

hubblefinal

Figure 3.   Wikidata object Q2513, Hubble Telescope. This is an edited version of the full entry which has dozens of property statements.

To write the query we need to know that Hubble had id wd:q2513 and that the predicate “carries scientific instrument” is wdt:P1202. The query and results are shown below. To read the query note there are two unbound variables ?Inst and ?InstLabel.   The only significant part of the request is a match for tuples of the form (Hubble telescope, carries scientific instrument, ?Inst).

SELECT ?Inst ?InstLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language “[AUTO_LANGUAGE],en”. }wd:Q2513 wdt:P1202 ?Inst.
} LIMIT 100

The table below shows the output.

sparql

This example does not do justice to the power of the search capabilities.   A look at the example in the Wikidata Query Service will illustrate that point.

One of the more impressive KGs for science is the Springer Nature SciGraph which has over 2 billion triples related to scientific topics. While the content contains the full Springer content, it goes well beyond that such patents and grant awards.   Zhang et.al [zhang] have demonstrated the use of knowledge graphs for recommendations in the NASA Science Knowledge Graph (SKG) .

Building specialized KGs for science domains has been going on for a while.   In 2009, the Hanalyzer (short for high-throughput analyzer) system uses natural language processing to automatically extract a semantic network from all PubMed papers relevant to a specific scientist.

Where’s the Data?

This brings us to the question is Wikidata a place to store experimental data sets? The usual approach to data set description is via Schema.org.   However recent work by one of the Wikidata founders, Denny Vrandecic, and presented at the workshop Advanced Knowledge Technologies for Science in a FAIR World (AKTS) entitled Describing datasets in Wikidata described how this can be done when schema.org may not be sufficient. At that same workshop Daniel Garijo, Pedro Szekely described a way the extended Wikidata to support external collection in a presentation entitled WDPlus: Leveraging Wikidata to Link and Extend Tabular Data.   We shall ague below that this is an important possible component of the research assistant.

The Semantic Scholar Literature Graph

There is a very different approach to the problem of storing information about research papers than Wikidata.   The Allen Institute for Artificial Intelligence (AI2) has built the Semantic Scholar, a graph of the scientific literatures that has a structure that is tightly focused on research paper, their authors and the concepts in the papers that link them together.   More specifically, the Semantic Scholar Literature Graph, as described by Waleed Ammar, et. al has the following node types:

  • Author – a person record
  • Paper – a paper has a title, venue, year, etc.
  • Entities – unique scientific concepts like “deep learning” or “natural language processing”.
  • Mentions – references to entities from text

The nodes are linked by edges including author-to-paper, paper-citations, and mentions which are references in the text to entities.   Between mentions, edges link mention in the same sentence and between entities that are somehow related.   Figure 4 illustrates the graph.

literature-graph

Figure 4.  An approximate subgraph of the Literature Graph for a paper in Semantic Scholar.

Many of the entity node are associated with items in Wikimedia.

Another related project from AI2 is the GraphAL [GraphAL] query system for the knowledge graph. The query system can be accessed on-line. The types of queries that can be expressed are powerful. For example, finding the papers that mention certain pairs of entities, or all authors of papers that mention a certain entity. We shall return to this capability below.

Building the Research Assistant

If we consider the first of two tasks on our requirements list for the RAs functionality

  1. Cataloging research data, papers and articles associated with its owner’s projects

we see that this challenge may be well met by having the RA possess a copy of Wikidata together with the extensions described by Denny Vrandecic discussed above. If not that then Garijo and Szekely’s WDPlus Wikidata extension may be a perfect solution.

Turning now to the second task:

  1. The assistant will monitor the research literature looking for papers exploring the same concepts seen in the owner’s work

we see the nature of the challenge is very different, but progress has been made on this task. Xiaoyi et.al have shown it is possible to use a combination of neural networks and rule-based reasoning to identify semantic entities and even implicitly cited datasets in earth science papers.

Given a set of research notes, grant proposal or draft of research papers, we need a way the way to identify the concepts in the user’s documents and then insert them into a version of the Semantic Scholar Literature graph. To do we can use a language model to scan the documents looking for interesting literature terms.   The state of the art for language parsing has made great strides over the last few years and we will look at one called Bidirectional Encoder Representation from Transformers (called BERT)

Using BERT to extract knowledge from documents

Most older language analysis models were built from deep LSTM networks (which we discussed in our book on cloud computing). These models were unidirectional in that the processed text from right to left or left to right in order to train the network. Devlin et.al published the BERT paper in 2018 and revised it in 2019. BERT is unique in several respects. First it is designed so that it can be “pre-trained” on plane text to build a special encoder.   Then, for various language analysis tasks, such as question answering, paraphrasing and language inference, an additional layer is added so that the encoder plus the new layer can be tuned to address the task. (This is a type of transfer learning we have described before.) What makes this possible is the way BERT uses an encoder that captures a whole sentence at a time. The training is done by masking out a small number of words (15%) in the input and then using a loss function that measures how well the network predicts the correct masked word. Figure 5 below illustrates this.   The core of the encoder is based on transformers which have been shown to be powerful ways to capture context. (See the harvardNLP Annotated Transformer for a detailed walk through of building transformers.

bert-diagramFigure 5. Bert training of the encoder based on masking random words for the loss function. This figure taken from “BERT – State of the Art Language Model for NLP” by Rani Horev in Lyrn.

Another good blog explaining BERT and the transformers is by Ranko Mosic. The AllenNLP group has an excellent demo using the masked language model and this is illustrated in Figure 6. This shows the result of using a sentence “Multicore processors allow multiple threads of execution to run in parallel on the various cores.” with processors, execution and parallel masked. You can note that it did a rather good job (code is a reasonable substitute for execution here.)

bert-example2

Figure 6. The AI2 masked language model demo using the sentence “Multicore processors allow multiple threads of execution to run in parallel on the various cores.” with processors, execution and parallel masked.

Another application of a BERT based language model is semantic role labeling. This is good for analyzing sentences and identifying a subject verb and object. For our purposes this is important.   We would like to extract from the scientists document key scientific terms and the implied relations between them. With this we can query the literature graph for matches, or we can use it to extend the scientist private version of the literature graph or knowledge graph.

For example, a researcher working on papers related to programming of parallel computers may have key phrases that include, multicore programming, data parallel, multithreaded programs, synchronization, map reduce, BSP, etc.   The type of triples we may discover by mining the documents may include

(map reduce, used in, data parallel programming)

(multicore processors, speedup, multithreaded execution)

(synchronization problems, encountered in, multithreaded programs)

(locking mechanisms, solve, synchronization problems)

(bulk synchronous parallel, solve, synchronization problems)

(BSP, alias, bulk synchronous parallel)

(map reduce, type of, parallel algorithm)

The first and third elements of the triples correspond to entities that are associated with mentions in the document. The verbs are potential labels for entity-entity edges in the graph.

To demonstrate the capability the AI2 implementation of sematic role labeling we downloaded the language model and used it in a Jupyter notebook.   We tested it with a sentence related to general relativity:

A gravitational singularity is a place where gravity has caused an event horizon which created a black hole in space, but a naked singularity is a gravitational singularity that does not have an event horizon and, therefore naked singularities do not create a black hole.

Running this through the “predictor” function of the bert-base-srl-2019.06.17 model gives the output in Figure 7.

bart-output

Figure 7. Output of “predictor” function of the bert-base-srl-2019.06.17 AI2 model. The code to download the model for python is given in the document associated with the demo. The function pullTripples is a post processor which removes annotations not essential for this illustration and formats the output.

As can be seen in the figure the model identified the key noun phrases (naked singularity, gravitational singularity, event horizon, black hole and gravity) as well as a sequence of reasonable triples. It should be possible to use the GraphAL query system to find associated entities on the literature graph. Indeed, a simple search in Semantic scholar for these terms will find dozens of related papers. From these results, one can build a personal literature graph for each of the owner’s documents with links to the discovered material.

The Really Hard Problems

The final two requirements for the research assistant pose some really hard problems.

  1. Automatically sifting through open source archives like GitHub that may be of potential use in current projects.
  2. Understanding the mathematical analysis in the notes generated by the scientist and using that understanding to check proofs and automatically propose simulators and experiment designs capable of testing hypotheses implied by the research.

Github already has a very good search interface that can be used to discover resources related to specific general topics. For example, searching for “multicore programming” retrieves an excellent collection of archives that address the relevant to topics of parallelism and synchronization.

The Github machine learning group (yes, every organization these days has a ML or AI group) has done some nice work on using LSTM networks to translate English language text such as “ Read csv file into Pandas dataframe”, into the corresponding Python code.   This is done by building a good vector space embedding of the English statement and a trained LSTM that creates English summaries of code fragments. By associating the summaries with the original English question, they can map the question to the associated code. The Github team is also collaborating with Microsoft Research Cambridge where a team is working on Program Understanding. While all of this is still very early work it appears to be very promising.

Automatically “understanding” mathematical analysis

The fourth property in our RA list reaches way beyond current capabilities. The work from the GitHub team described above can make translating English program requirements into code very similar to natural language translation, but anything involving “understanding” is, for now, out of reach.   However, there have been some interesting early attempts to bridge language and models of scientific theory. Eureka (now DataRobot) does automatic AI based time series analysis and DataRobot is also a tool for automatically building ML models given only data. Michael Schmidt and Hod Lipson consider the problem of deriving theoretical mathematical models directly from experimental data (see Figure 8).

schmidt-lipson

Figure 8. From Michael Schmidt and Hod Lipson, Distilling Free-Form Natural Laws from Experimental Data. (SCIENCE VOL 324 3 APRIL 2009)

Automatic theorem checking research has been going on for years, but these systems require formal statements of the theorem to be checked and are usually designed for human-machine collaboration. If it were possible to create a system that could take a journal paper and automatically extract a formal expression of the mathematical content in a form that a checker could input, then we would be getting close to the goal.

The most impressive work on using the advanced deep learning technology to “comprehend” scientific text comes again from the AI2 team. Their system Aristo is “an intelligent system that reads, learns, and reasons about science”. Aristo recently got an “A” on the N.Y. Regents 8th grade science exams. This exam consists of multiple-choice questions such as the following:

Which object in our solar system reflect light and is a satellite that orbits around one planet? (A) Moon, (B) Earth, (C) Mercury, (D) Sun.

Aristo works by combining a number of component solvers to bear on the problem. Information retrieval and statistics for am important layer.   Pointwise mutual information is used to measure the likely hood of each Question-Answer pair against information retrieved from the text corpus. A quantitative reasoning solver is used to address questions that involved basic quantitative statements.   As shown in Figure 9, a tuple inference solver builds graphs that connect tuples from a scientific knowledge base to the terms in the question and the answers. Dalvi, Tandon and Clark have constructed an excellent knowledge base of science-related triples called the Aristo Tuple KB/.

The graphs with the most promising connection to one of the answers is the winner.

aristo

Figure 9. From Clark, et.al, From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An Overview of the Aristo Project. https://allenai.org/content/docs/Aristo_Milestone.pdf Aristo Tuple Inference Solver. Key terms in the question and answer candidates are linked to triples from the knowledge base.

While the Aristo work is a real milestone, it has far to go.   In particular, it does not yet have the ability to relate technical diagrams and equations in the text into its deductive (or abductive) analysis. I expect AI2 is working on this now. The bigger challenge, being able to classify documents by the content of the mathematical arguments used, is very hard when reasoning is spread over many pages. There is some interesting automatic document summarization work, but it is not up to this challenge.

Final Thoughts

This discussion is far too incomplete to warrant a “conclusions” section. The ability of the research assistant to take and idea and run with it is central to what we need.   The idea may be a theory expressed in a draft technical paper or research proposal. Finding all the related publication is certainly a start, but first the RA must be able to abstract the important original ideas and not just the keywords and phrases.   It may be that the key idea is a more of a metaphor for a larger truth that manifests itself in research in various disciplines. But this is probably more than any RA can grasp.

There is going to be amazing progress over the next 30 years.   This is obvious when one looks at the state of computing 30 years ago.   Much of what we have today was then only a dream.

This post contains a look at many active research projects, and I am sure I am missing some very important ones.   Please contact me if I have mischaracterized any of them or if I have missed something really important.

References

Most of the citations to literature in this blog are linked in-line.   Here are two for which I found it easier to provide explicit reference.

[grapAL] Christine Betts, Joanna Power, Waleed Ammar, GrapAL: Connecting the Dots in Scientific Literature, arXiv:1902.05170v2 [cs.DB] 19 May 2019

[zhang] Jia Zhang, Maryam Pourreza, Rahul Ramachandran, Tsengdar J. Lee, Patrick Gatlin, Manil Maskey, and Amanda Marie Weigel, “Facilitating Data-Centric Recommendation in Knowledge Graph”, in Proceedings of The 4th IEEE International Conference on Collaboration and Internet

Computing (CIC), Philadelphia, PA, USA, Oct. 18-20, 2018, pp. 207-216.

A Very Gentle Introduction to Probabilistic Programming Languages

Abstract.   Probabilistic programming languages (PPLs) allow us to model the observed behavior of probabilistic systems in terms its underlying latent variables. Using these models, the PPL provides tools to make inferences concerning the latent variables that give rise to specific observed behaviors. In this short report, we look at two such programming languages: Gen, a language based on Julia from a team at MIT and PyProb which is based on Python and Torch from the Probabilistic Programming Group at the University of Oxford.   These are not the only PPls nor are they the first, but they illustrate the concepts nicely and they are easy to describe. To fully understand the concepts behind these systems requires a deep mathematical exploration of Bayesian statistics and we won’t go there in this report. We will use a bit of math, but the beauty of these languages is that you can get results with a light overview of the concepts.

Introduction

In science we build theories that tell us how nature works.   We then construct experiments that allow us to test our theories.   Often the information we want to learn from the experiments is not directly observable from the results and we must infer it from what we measure.    For example, consider the problem of inferring the masses of subatomic particles based on the results of collider experiments,   or inferring the distribution of dark matter from the gravitational lensing effects on nearby galaxies, or finding share values that optimize financial portfolios subject to market risks, or unravelling complex models of gene expression that manifest as disease.

Often our theoretical models lead us to build simulation systems which generate values we can compare to the experimental observations.   The simulation systems are often programs that draw possible values for unknowns, call them x, from random number generators and these simulations use these values to generate outcomes y.   In other words, given values for x, our simulation is a “generative” function F which produces values y = F(x).     If our experiments give us values y’, we can think of the inference task as solving the inverse problem x = F-1(y’), i.e. finding values for the hidden variables x that give rise to the observed outcomes y’.   The more mathematical way to say this to say that our simulation gives us a probability distribution of values of y given the distribution associated with the random draws for x, which we write as p(y | x ). What we are interested in is the “posterior” probability p(x | y’) which is the distribution of x given the evidence y’. In other words, we want samples for values of x that generate values close to our experimental values y’. These probabilities are related by Bayes Theorem as

bayes-thm

Without going into more of the probability theory associated with this equation, suffice it to say that the right-hand side of this equation can be very difficult to compute when F is associated with a simulation code.   To get a feel for how we can approach this problem, consider the function F defined by our program as a generative process: each time we run the program it makes a series of decisions based on random x values it draws and then generates a value for y. What we will do is methodically trace the program, logging the values of x and the resulting ys. To get a good feel for the behavior of the program, we will do this a million time.

Begin by labeling each point in the program where a random value is drawn. Suppose we now trace the flow of the program so that each time a new random value is drawn we record the program point and the value drawn. As shown in Figure 1, we define a trace of the program to be the sequence [(a1, x1), (a2, x2), …(an, xn), y] of program address points and random values we encounter along the way.

program-trace

Figure 1. Illustration of tracing random number draws from a simulation program. A trace is composed of a list of address, value tuples in the order they are encountered. ( If there are loops in the program we add an instance count to the tuple.)

If we can trace all the paths through the program and compute the probabilities of their traversal, we could begin to approximate the joint distribution p(x,y)=p(y|x)*p(y) but given that the x’s are drawn from continuous distributions this may be computationally infeasible. If we want to find those traces that lead to values of y near to y’, we need to use search algorithms that allow us to modify the x’s to construct the right traces.   We will say a bit more about these algorithms later, but this is enough to introduce some of the programming language ideas.

To illustrate our two probabilistic programming languages, we will use an example from the book “Bayesian Methods for Hackers” by Cameron Davidson-Pilon. (There are some excellent on-line resources for the book.   This includes Jupyter notebooks for each chapter that have been done with two other PPLs: PyMC3 and Tensorflow Probability.) The example comes from chapter 1.   It concerns the logs of text messages from a user. More specifically, it is the number of text messages sent per day over a period of 74 days.   Figure 2 shows bar graph of the daily message traffic.  Looking at the data, Davidson-Pilon made a conjecture that the traffic changes in some way at some point so that the second half of the time period has a qualitative difference from the first half. Data like this is usually Poisson distributed. If so, there is an average event rate such that the probability of k events in a single time slot is given by

poisson

If there really are two separate distributions the let us say the event rate is for the first half and for the second half and a day such that for all days before that  date the first rate applies and it is the second rate after that. (This is all very well explained in the Davidson-Pilon book and you should look at the solution there that uses PPL PyMC3. The solutions here are modeled on that one.)

textingdata

              Figure 2. From Chapter 1 of “Bayesian Methods for Hackers” by Cameron Davidson-Pilon.

Gen

Gen is a language that is built on top of Julia and Tensorflow by Marco Cusumano-Towner, Feras A Saad, Alexander K Lew and Vikash K Mansinghka at MIT and described in their recent POPL paper [1]. In addition they have a complete on-line resource where you can download the package and read tutorials.

We gave a brief introduction to Julia in a previous article, but it should not be hard to understand the following even if you have never used Julia.   To cast this computation into Gen we need to build a model that captures the discussion above.   Shown below we call it myModel.

mymodel

The first thing you notice about this code are the special annotations @gen and @trace.   This tells the Gen system that this is a generative model and that it will be compiled so that we can gather the execution traces that we discussed above.   We explicitly identify the random variables we want traced by the @trace annotation.   The argument to the function is a vector xs of time intervals from 1 to 74.   We create it when we read the data from Figure 2 from a csv file (which is shown in detail in the full Jupyter notebook for this example). Specifically, xs = [1.0, 2.0, 3.0 …, 74.0] and we set a vector ys so that ys[i] is the number of text messages on day i.

If our model process is driven by a Poisson to generate y value, then the math says we should assume that the time interval between events is exponentially distributed. Gen does not have an exponential distribution function, but it does have a Gamma distribution and  gamma(1, alpha) = exponential(1.0/alpha) . The statement

lambda1 = @trace(gamma(1, alpha), :lambda1)

tells Gen to pull lambda1 values from the exponential with mean alpha and we have initialized alpha to be the mean of the ys values (which we had previously computed to be 19.74…). Finally note we have used a special Julia labeling technique :variable-name to label this to be :lambda1.   This is effectively the address in the code of the random number draw that we will use to build traces.

We draw tau from a uniform distribution (and trace and label it) and then for each x[i] <= tau we assign the variable lambda to lambda1 and for each x[i] > tau we assign lambda to lambda2.   We use that value of lambda to draw a variable from the Poisson distribution and label that draw with a string “y-i”.

We can now generate full traces of our model using the Gen function simulate() and pull values from the traces with the get_choices() function as shown below.

generate-trace

The values for the random variable draws are from our unconstrained model, i.e.   they reflect the joint probability p(x,y) and not the desired posterior probability p(x | y’) that we seek. To reach that goal we need to run our model so that we can constrain the y values to y’ and search for those traces that lead the model in that direction.   For that we will use a variation of a Markov Chain Monte Carlo (MCMC) method called Metropolis-Hastings (MH). There is a great deal of on-line literature about MH so we won’t go into it here, but the basic idea is simple. We start with a trace and then make some random mods to the variable draws. If those mods improve the result, we continue. If not, we reject it and try again.   This is a great oversimplification, but Gen and the other PPLs provide library functions that allow us to easily use MH (and other methods.)   The code below shows the how we can invoke this to make inferences.

inference_prog

The inference program creates a map from the labels for the y values to the actual constraints from our data.   It then generates an initial trace and iteratively applies the MH algorithm to improve the trace. It then returns the choices for our three variables from the final trace. Running the algorithm for a large number of iterations yields the result below.

inference_result

This result is just one sample from the posterior probabilities.   If we repeat this 100 times we can get a good look at the distribution of values.   Histograms of these values are shown below in Figure 3.

historgrams

Figure 3.   Histograms of the tau and lambda values.  While difficult to read, the values are clustered near 44, 18, 24 respectively.

If we compare these results to the Davidson-Pilon book results which used the PyMC3 (and Tensorflow Probability) PPL, we see they are almost identical with the exception of the values of tau near 70 and 5. We expect these extreme values represent traces where the original hypothesis of two separate alphas was not well supported.

There is a great deal about Gen we have not covered here including combinators which allow us to compose generative function models together.   In addition, we have not used one of the important inference mechanisms called importance sampling.   We shall return to that topic later.

PyProb

Tuan Anh Le, Atılım Günes Baydin, Frank Wood first published an article about PyProb in 2017 [3] and another very important paper was released in 2019 entitled “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale” [4] which we will describe in greater detail later.   PyProb is built on top of the deep learning toolkit PyTorch which was developed and released by Facebook research.

Many concepts of PyProb are very similar to Gen, but PyProb is Python based so it looks a bit different. Our generative model in this case is an instance of a Python class as shown below. The main requirement of the is that it subclass Model and have a method called forward() that describes how to generate our traces.   Instead of the trace annotation used in Gen, we use PyProb sample and observe functions.   The random number variables in PyProb are all Torch tensors, so to we need to apply the method numpy() to extract values. The functions Normal, Exponential and Uniform are all imported from PyProb. Other than that, our generator looks identical to the Gen example.

pyprob-model

Also note we have used the name mu1 and mu2 instead of alpha1 and alpha2 (for no good reason.) Running the MH algorithm on this model is almost identical to doing it in Gen.

pyprob-infer

Again, this is just a sample from the posterior.   You will notice that the posterior result function also tells us what percent of the traces were accepted by the MH algorithm.   PyProb has its own histogram methods and the results are shown in Figure 4 below.  The legend in the figure is difficult to read. It shows that the tau value is clusters near 44 with a few traces showing between 5 and 10.   The mu1 values are near 17 and mu2 values are near 23.   In other words, this agrees with our Gen results and the PyMC3 results in the Davidson-Pilon book.

pyprob-histograms

Figure 4. Histogram of tau, mu1 and mu2 values.

Building a PyProb Inference LSTM network.

There are several additional features of PyProb that are worth describing. While several of these are also part of Gen, they seem to be better developed in PyProb. More specifically PyProb has been designed so that our generative model can be derived from an existing scientific simulation code and it has an additional inference method, called Inference Compilation, in which a deep recurrent neural network is constructed and trained so that it can give us a very good approximation of our posterior distribution.   In fact the neural network is a Long Short Term Memory (LSTM) network that that is trained using traces from out model or simulation code.   The training, which can take a long time, produces a “distribution” q(x | y) that approximates our desired p(x | y). More of the details are given in the paper “Inference Compilation and Universal Probabilistic Programming” by Anh le, Gunes Baydin and Wood [3]. Once trained, as sketched in Figure 5, when the network is fed our target constraints y’ and trace addresses, the network will generate the sequence of components needed to make q(x|y= y’).

rnn

Figure 5. Recurrent NNet compiled and trained from model. (see [3, 4])

Building and training the network is almost automatic. We had one problem. The compiler does not support the exponential distribution yet, so we replaced it with a normal distribution.   To do create and train the RNN was one function call as shown below.

trainnetwork

Once trained (which took a long time), using it was also easy. In this case we use the importance sampling algorithm which is described in reference [3] and elsewhere.

usetrained

Figure 6 illustrates the histograms of values drawn from the posterior.

trainedhisto

Figure 6.   Using the trained network with our data. As can be seen, the variance of the results is very small compared to the MH algorithm.

The fact that the training and evaluation took so much longer with our trivial example is not important, because the scalability of importance sampling using the compiled LSTM network. In the excellent paper “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale” [4] Güneş Baydin, et. Al. describe the use of PyProb with a very large simulation code that models LHC experiments involving the decay of the tau lepton. They used 1024 nodes of the Cori supercomputer at LBNL to train and run their IC system. To do this required using PyProb’s ability to link a PyProb model to a C++ program. Using the IC LSTM network, they were able achieve a speed-up of over 200 over a baseline MCMC algorithm. The paper describes the details of the implementation and testing.

Conclusion

The goal of this paper was to introduce the basic ideas behind Probabilistic Programming Languages by way of two relatively new PPLs, Gen and PyProb.   The example we used was trivial, but it illustrated the concepts and showed how the basic ideas were expressed (in very similar terms) in both languages.   Both languages are relatively new and they implementations are not yet fully mature.   However, we are certain that probabilistic programming will become a standard tool of data science in the future. We have put the source Jupyter Notebooks for both examples on GitHub.   Follow the installation notes for Gen and PyProb on their respective webpages and these should work fine.

https://github.com/dbgannon/probablistic-programming

The traditional way computer science is taught involves the study of algorithms, based on cold, hard logic which, when turned into software, runs in a deterministic path from input to output. The idea of running a program backward from output to the input does not make sense. You can’t “unsort” a list of number. The problem is even more complicated if our program is a scientific simulation or data science involving machine learning. In these cases, we learn to think about the results of a computation as representatives of internally generated probability distributions.

Some of the most interesting recent applications of AI to science have been the result of work on generative neural networks.   These systems are trained to perfectly mimic the statistical distribution of scientific data sets.   They can allow us to build “fake” human faces or perfect, but artificial spiral galaxies, or mimic the results of laboratory experiments. They can be extremely useful but, in the case of science, they tell us little about the underlying laws of nature.  PPLs allow us to begin to rescue the underlying science in the generative computation.

References

Some of these are link.   Two can be found on arXiv and the Gen paper can be found in the ACM archive.

eScience 2050: A Look Back

Abstract— This was originally written as an invited “vision” presentation for the eScience 2019 conference, but I decided to present something more focused for that event. Never wanting to let words go to waste I decided to put a revised version here.   It was written as a look back at the period of eScience from 2019 to 2050.   Of course, as it is being published in 2019, it is clearly a work of science fiction, but it is based on technology trends that seem relatively clear. Specifically, I consider the impact of four themes on eScience: the explosion of AI as an eScience enabler, quantum computing as a service in the cloud, DNA data storage in the cloud, and neuromorphic computing.

I.  Introduction

Predictions of the future are often so colored by the present that they miss the boat entirely.   The future, in these visions, tends to look like today but a bit nicer. Looking at images of the future from the 1950s we see pictures of personal helicopters (Figure 1) that clearly missed the mark.   But in some cases, the ideas are not so far off, as in the case of the woman ordering shirts “online” as in Figure 2.

hilicopters5

Fig 1. Future personal transportation. Mechanics Illustrated, Jan 1951

videophone2

Fig 2. Shopping “online” vision.

To look at the evolution of eScience in the period from 2019 to 2050, it can help to look at the state of the art in computing in 1960 and ask how much of todays technology can we see emerging there.   My first computer program was written when I was in high school. It was a Fortran version of “hello world” that consited of one line of Fortran per punched card as shown in Figure 3 and the entire program consisted of a deck of 12 cards .

Punched_card2

Fig 3.   The Fortran statement CALL RCLASS(AAA,21,NNC, PX3,PX4)

If asked to predict the future then, I might have said that software and data in the years ahead would be stored on a punched paper tape over ten miles long.

Fifty years ago, the high points of computing looked like:

  • The programming language FORTRAN IV was created in 1961.
  • IBM introduced its System/360. A monster machine.
  • Gordon Moore makes an observation about scale in 1965 (that became Moore’s law).
  • IBM created the first floppy disk in 1967.
  • The SABRE database system that was used by IBM to help American Airlines manage its reservations data.
  • In 1970 Edgar Codd came up with Relation Database Idea. It was not implemented until 1974.
  • The Internet: a UCLA student, tries to send “login,” the first message over ARPANET, at 10:30 P.M. on October 29, 1969.  (The system transmitted “l” and then “o” …. and then crashed.)

     What we can extrapolate from this is that mainframe IBM computers will continue to dominate, databases are going to emerge as a powerful tool and there will be a future for computer-to-computer networking, but PCs and web search were not remotely visible on the horizon. The profound implication of Moore’s law would not be evident until the first microprocessors appear in 1979 and the personal computer revolution of the 1980s followed. Those same microprocessors led to the idea that dozens of them could be packaged into a “parallel computer”. This resulted in a period of intense research in universities and startups to build scalable parallel systems. While the first massively parallel computer was the Illiac IV which was deployed at NASA Ames in 1974, it was the work on clusters of microprocessors in the 1980s [1] that led to the supercomputers of 2019. Looking at computing in 1970, few would have guessed this profound transition in computing from mainframes like the System/360 to the massively parallel systems of 2019.

II. eScience from 2019 Forward

A.   The Emergence and Evolution of the Cloud

   The World Wide Web appeared in 1991 and the first search engines appeared in the mid 90s. As these systems evolved, so did the need to store large segments of the web on vast server farms. Ecommerce also drove the need for large data centers. These data centers quickly evolved to offering data storage and virtual machines as a service to external users.   These clouds, as they became known, were designed to serve thousands to millions of concurrent clients in interactive sessions.

   For eScience, clouds provided an important alternative to traditional batch-oriented supercomputers by supporting interactive, data exploration and collaboration at scale. The cloud data centers evolved in several important way. Tools like Google’s Kubernetes was widely deployed to support computation in the form of swarms of microservices built from software containers. Deploying Kubernetes still required allocating resources and configuring systems and microservices are continuously running processes. Serverless computing is a newer cloud capability that avoids the need to configure VMs to support long running processes when they are not needed.

   Serverless computing is defined in terms of stateless functions that respond to events such as signals from remote instruments or changes in the state of a data archive. For examples, an eScience workflow can be automatically triggered when instrument data arrives in the data storage archive. The analysis in the workflow can trigger other serverless function to complete additional tasks.

   The Cloud if 2019 is defined in terms of services and not raw hardware and data storage. These services include

  • Streaming data analytics tools that allow the users to monitor hundreds of live streams from remote instruments.   Edge computing services evolved as a mechanism to off-load some of the analytics to small devices that interposed between the instruments and the cloud.
  • Planet scale databases allow a user to store and search data collections that are automatically replicated across multiple continents with guaranteed consistency.
  • Services to support applications of machine learning became widely available across all the commercial public clouds. These services include speech recognition and generation, automatic language translation, computer vision tasks such as image recognition and automatic captioning.

   The architecture of the cloud data centers continued to evolve toward ideas common in modern supercomputers. The first major change was the introduction of software defined networking throughout the data centers. Amazon began introducing GPU accelerators in 2012. This was followed by Google’s introduction of TPU chips to support the computational needs of deep learning algorithms coded in the Tensorflow system. The TPU (figure 4) is a device designed to accelerate matrix multiplication. It is based on systolic computation ideas of the 1980s and has been applied primarily to Google machine learning projects.

Fig 4.   Google TPU v3 architecture. See N. Jouppi, C. Young, N. Patil,  D. Patterson [2, 3].

The most radical evolutionary step has been taken by Microsoft in the Azure cloud. Project Brainwave represents an introduction of a mesh connected layer of FPGA devices that span the entire data center (figure 5).   These devices can be programed to cooperate on parallel execution of lage computational task including serving deep learning models.brainwave-archFig 5. Microsoft FPGA based Project Brainwave [4, 5]

A.   Major Themes Going Forward from 2019

The following pages will describe the four major themes that are going to dominate the ten years forward from 2019.   They are

  • The explosion of AI as an eScience enabler.
  • Quantum computing as a service in the cloud.
  • DNA data storage in the cloud.
  • Neuromorphic computing

 III. AI and eScience

     Machine Learning went through a revolution in the years from 2008 to 2019 due to the intensive work at universities and in companies.   The cloud providers including Google, Microsoft and others were driven by the need improve their search engines. Along the way they had amassed vast stores of text and images. Using this data and the new deep learning technologies they were able to build very powerful language translations systems and image recognition and captioning services that were mentioned above. By 2017, applications to eScience were beginning to emerge.

A.   Generative Neural Networks and Scientific Discovery

     One of the more fascinating developments that arose from the deep learning research was the rise of a special class of neural networks called generative models. The important property of these networks is that if you train them with a sufficiently, large and coherent collection of data samples, the network can be used to generate similar samples. These are most often seen in public as pernicious tools to create fake images and videos of important people saying things they never said.

     What these networks are doing is creating a statistical distribution of data that, when sampled, produces data that has properties that match nicely with the input data. When a scientist creates a simulation based on assumed parameters of nature, the simulations is evaluated against the statistical properties of the experimental observations.   Generative models can be used to validate the simulation output in case the data is sparse. One of the most commonly used generative model is the Generative Adversarial Network (GAN) in which to networks are pitted against each other. As shown in Figure 6, a discriminator is trained to recognize the data that is from the real data set and a generator is designed to fool the discriminator. When converged the generator mimics the data distribution of the real data. There are several interesting examples of how generative models have been used in science. We list several in [6], but two that stand out are an application in astronomy and one in drug design.

GAN

Fig 6. Generative Adversarial Network (GAN) from this site.

    M. Mustafa, and colleagues demonstrate how a slightly-modified standard Generative Adversarial Network (GAN) can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations [7]. The results illustrate how the generated images match their validation tests, but what is more important, the resulting images also pass a variety of statistical tests ranging from tests of the distribution of intensities to power spectrum analysis.

    However, generative methods do not, by themselves, help with the problem of inferring the posterior distribution of the inputs to scientific simulations conditioned on the observed results, but there is value of in creating ‘life-like’ samples. F. Carminati, G. Khattak, S. Vallecorsa make the argument that designing and testing the next generation of sensors requires test data that is too expensive to compute with simulation . A well-tuned GAN can generate the test cases that fit the right statistical model at the rate needed for deployment [8].

     In the area of drug design, Shahar Harel and Kira Radinsky have used a generative model to suggest chemical compounds that may be used as candidates for study [9]. They start with a prototype compound known to have some desirable properties. This is expressed as sequence and fed to a layer of a convolutional network that allow local structures to emerge as  shorter vectors that are concatenated. A final all-to-all layer is used to generate sequence of mean and variance vectors for the prototype. his is fed to a “diversity layer” which add randomness as shown in Figure 7.

drug-network

Fig 7. Shahar Harel and Kira Radinsky multi-layer generative netowork for drug design.

       The decoder is an LSTM-based recurrent network which generates the new molecule. The results they report are impressive. In a one series of experiments they took as prototypes compounds from drugs that were discovered years ago, and they were able to generate more modern variations that are known to be more powerful and effective. No known drugs were used in the training.

B.   Probabilistic Programming and Bayesian Inference

     For very large experiments such as those conducted in the Large Hadron Collider, the scientists are interested in testing their models of the universe based the record of particle collisions that were generated.   It is often the case that the scientists have a detailed simulation of the experimental system based on the current theoretical models of physics that are involved. These models are driven by setting parameters that correspond to assumptions about nature. When they run the experiment, they see new behaviors in the data, such as new particles emerging from collisions, and they would like to see which parameters when fed to the simulation can produce the output that corresponds to the experimental results. In other words, the simulation is a function taking parameters x to outputs y and the interesting question is given the outputs what the corresponding values for x. This is an inverse problem that is notoriously difficult to solve for large scientific problems. In terms of Bayesian statistics, they are interested in the posterior distribution P(x | y) where y is the experimental statistics.

     In 2018 and later researchers began to study programming languages designed to express and solve problems like these. Pyro [10], Tensorflow Probability [11], and the Julia-based Gen [12] are a few of the Probabilistic Programing Languages (PPLs) introduced in 2019. One of the ideas in these languages is to build statistical models of behavior where program variables represent probability distributions and then, by looking at traces of the random choices associated with these variables during execution, one can make inferences about the random behaviors that influence outcomes.

     These languages have been put in use in industries such as finance, their role in eScience is now becoming apparent.    A. Güneş Baydin, et al. showed that PPLs can be used in very large Bayesian analysis of data from an LHC use case. Their approach allows a PPL too couple directly to existing scientific simulators through a cross-platform probabilistic execution protocol [13].

     Probabilistic programming with PPLs will go on to become a standard tool used by eScientists.

C.   AI-based Research Assistant

      2018 was the year of the smart on-line bot and smart speaker. These were cloud based services that used natural language interfaces for both input and output. The smart speakers, equipped with microphones listen for trigger phrases like “Hello Siri” or “hello Google” or “Alexa” and recorded a query in English, extracted the intent and replied within a second. They could deliver weather reports, do web searches, keep your shopping list and keep track of your online shopping.

     The impact of this bot technology will hit eScience when the AI software improves to the point that every scientist, graduate student and corporate executive had a person cloud-based research assistant. Raj Reddy calls these Cognition Amplifiers and Guardian Angels. Resembling a smart speaker or desktop/phone app, the research assistant is responsible for the following tasks:

  • Cataloging research data, papers and articles associated with its owner’s projects. The assistant will monitor the research literature looking for papers exploring the same concepts seen in the owner’s work.
  • Coordinate meetings among collaborators and establish coordination among other research assistants.
  • Automatically sifting through other data from collaborators and open source archives that may be of potential use in current projects.
  • Understanding the mathematical analysis in the notes generated by the scientist and using that understanding to check proofs and automatically propose simulators and experiment designs capable of testing hypotheses implied by the research.

     From the perspective of 2019 eScience, the claim that the Research Assistant of the future will be able to derive simulations and plan experiments may seem a bit “over the top”. However, progress on automatic code completion and generation was making great strides in 2019. Transformers [16] provide a second generation of sequence-to-sequence AI tools that outpaced the recurrent neural networks previously used.   Advances in programming tools based on similar technology show that AI can be used to generate code completions for C# method calls based on deep semantic mining of online archives of use cases.

     We built a toy demo of the Research Assistant in [14] that could take natural language (spoken or typed) input and find related research articles from Bing, Wikipedia and ArXiv (see Figure 6.)

res-asst4

    Fig 6. Demo cloud-based “research assistant” [14]

     This demo prototype research assistant was built by composing a few cloud tools as shown in Figure 7. A much more sophisticated version can be built using a modern chatbot tool like RASA, but that will not address everything that is needed to reach the final goal.

Ra-components

Fig 7. The toy research assistant demo is hosted an “elegant” cardboard container from Google’s voice kit running a python script on raspberry Pi. It invokes Googles speech to text, a text analysis service running on Algorithmia and uses Amazon’s Lex to generate voice. Depending on context, calls are made to Bing, Wikipedia or ArXiv.

     While this demo was a toy, there has been serious work over the last few years to make progress on this topic. In 2018, Google introduced its Dataset Search service to provide easy data set discovery. The work of the Allen Institute for AI stands out [15]. Their Semantic Sanity project is a sophisticated research search engine that allows you to tell it basic topics that interest you and it will monitor ArXiv looking for important related contributions.   Aristo is “an intelligent system that reads, learns, and reasons about science”. It can reason about diagrams, math and understand and answer elementary school science questions.

     One important problem that must be solved is extracting non-trivial mathematical concepts from research papers and notes so that similar papers can be found. In addition to Semantic Sanity, there is some other early related work on this problem. For example, researchers are using unsupervised neural net to sift through PubChem and uses NLP techniques to identify potential components for materials synthesis [22].

 IV. The Rise of Quantum in the Cloud

     Long viewed as a subject of purely theoretical interest, quantum computing emerged in 2018 as a service in the cloud. By 2019, Rigetti, IBM, D-Wave and Alibaba had live quantum computing services available.   Google and Microsoft followed soon after that.   The approaches taken to building a quantum computer differed in many respects.

     The basic unit of quantum computing is the qubit which obeys some important and non-intuitive laws of physics. Multiple qubits can be put into an “entangled” state in which an observation about one can effect the state of the others even when they are physically separated.   One can apply various operators to qubits and pairs of qubits to form “logical” circuits and these circuits are the stuff of quantum algorithms. A fact of life about quantum states is that once you measure them, they are reduced to ordinary bits. However, the probability that a bit is reduced to a zero or a one is determined by the unmeasured quantum state of the system. A good quantum algorithm is one in which the quantum circuit produces outputs with a probability distribution that makes solving a specific problem very fast.

     The first important quantum computations involved solving problems in quantum chemistry. In fact, this was the role that Richard Feynman had suggested for a quantum computer when he first came up with the idea in the 1980s. The other area of eScience that quantum computers excelled at was certain classes of optimization problems.

     Early quantum computers are program by literally composing the quantum gates as illustrated in Figure 8.quantum-circuit

Fig 8. A simple quantum circuit to create an entangled pair of two qubits and measure the result

     Programming tools to build such circuits include IBM’s IBM-Q qiskit and Microsoft’s Q# language and compiler. The IBM hardware was based what they call “noisy intermediate-scale quantum computers (NISQ)”. As shown in Figure 9, the induvial qubits are non-linear oscillators. Tuned superconducting resonator channels are used for readout. Also, the qubits are tied together by additional superconducting channelsibm-qubits

Fig 9 An early 5-qubit IBM-Q computational unit. Source: IBM-Q website.

     One problem with early quantum systems is that the qubits are very susceptible to noise and degrade quickly.   In other words, the error rate of quantum algorithms can be very high. The deeper the quantum circuit (the number of layers of “gates”), the more noise. A way around this is to add error-correcting redundancy into the circuit, but this means more qubits are required.   Quantum volume refers to a measure of the number of qubits needed for the error-corrected circuit times the depth of the corrected circuit. Unfortunately, early quantum computers had very low bounds on achievable quantum volume.

Microsoft took a different approach to building qubits that are based on the topological property of woven threads. These are much more noise resistant and hence capable of building deeper circuits with less error correction.

IV.  DNA-Based Storage

     There are two problems with 2019 era storage technologies.   First, it was not very stable. Data degrades and storage devices fail. Second, there was not enough storage capacity to capture the data that was being generated.   2.5 quintillion bytes of data was generated each day in 2018. Even if you cast out 99.9% of this as being of no long-term value, the remainder still leaves 2.5e+15 bytes per day. The amount of data stored in the cloud in 2018 was 1.0e+18 (an exabyte). In other words, the cloud only held 400 days’ worth of the important data. Obviously, another order or magnitude or more of the data needed to be discarded to allow the data centers time to expand to contain this growth. A new storage technology was needed.

     DNA storage was first proposed in the 1960, but not much happened with the idea until 1988 when researchers from Harvard stored an image in the DNA of e.coli. By 2015 researchers at Columbia University published a method that allowed the storage of 215 petabytes (2.15e+17) of data per gram of DNA. And DNA is very stable over long periods of time. While this was promising, there was still a big problem.   The encoding and decoding of the data were still an expensive manual process and it was not practical to have lab scientists running around in the back rooms of data centers managing wet processes.

        In 2019, researchers at the University of Washington and Microsoft demonstrated an automated lab that could encode and decode data to DNA without human hands. The system works by converting the ones and zeros of digital data into the As, Ts, Cs and Gs that make up the building blocks of DNA. These encoded sequences are then fed into synthesizing systems. Their next breakthrough was to reduce the entire “lab” to a chip that uses microfluidics to route droplets of water around a grid to enact the needed chemistry steps. They also produced a sophisticated software stack called puddle that allow the scientist to program this with conventional high-level languages [20].

      Other research has demonstrated ways in which DNA encoded data could be searched and structured in ways like relational databases.   As the costs came down, this became the standard cloud storage technology.

VI. Neuromorphic Computing

     It was long a dream of computer designers to build systems that mimicked the connectome of the biological brain. A major research initiative in Europe, the Human Brain Project, looked at the possibility of simulation a brain on traditional supercomputers. As that proved unrealistic, they turned to the study of special hardware devices that can simulate the behavior of neuron. Technology to build artificial neurons progressed in university and industry labs. The Stanford neurogrid can simulate six billion synapses.   In 2017 Intel introduced the Loihi neuromorphic research test chip. The device is a many-core mesh of 128 neuromorphic cores and each of which contains 1,024 primitive spiking neural units. The neural units are current-based synapse leaky integrate-and-fire neurons [21].

     In addition to the Loihi chip, Intel has released Pohoiki Beach, a system comprised of 64 Loihi chips and a substantial software stack to allow application development. Because overall power consumption is 100 times lower than GPU based neural networks, Intel’s application target is autonomous vehicles and robots. While full realization of true biological brain-like functionality may not be realized until the 2nd half of the 21st century, it is none the less an exciting step forward.

 VII.   Conclusions

     EScientists in 2019 found themselves at an important inflection point in terms of the technology they could deploy in doing science.   The cloud has evolved into a massive on-line, on-demand, heterogeneous supercomputer. It not only supports traditional digital simulation; it will also support hybrid quantum-digital computation.   It will soon allow applications to interact with robotic sensor nets controlled by neuromorphic fabrics. Storage of research data will be limitless and backed up by DNA based archives.

     One of the most remarkable features of computing in the 21st century has been the evolution of software. Programming tools have evolved into very deep stacks that have used and exploited AI methods to enable scientific programmer to accomplish more with a few lines of Julia in a Juypter notebook than was remotely possible programming computers of 1980. New probabilistic programming languages that combine large scale simulation with deep neural networks show promise of making Bayesian inference an easy to use used tool in eScience.

     The role of AI was not limited to the programming and execution of eScience experiments. Users of the AI research assistants in 2050 can look back at the “smart speakers” of 2019 and view them as laughably primitive as the users of the 2019 World Wide Web looked back at the 1970s era where computers were driven by programs written on decks of punched cards.

References

  1. G. Fox, R. Williams, P. Massina, “Parallel Computing Works!”, Morgan Kauffman, 1994.
  2. N. Jouppi, C. Young, N. Patil, D. Patterson, “A Domain-Specific Architecture for Deep Neural Networks”, Communications of the ACM, September 2018, Vol. 61 No. 9, Pages 50-59
  3. https://cloud.google.com/tpu/docs/images/tpu-sys-arch3.png
  4. Chung, et al. “Serving DNNs in Real Time at Datacenter Scale with Project Brainwave.” IEEE Micro 38 (2018): 8-20.
  5. Fowers et al., “A Configurable Cloud-Scale DNN Processor for Real-Time AI,” 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, 2018, pp. 1-14
  6. D. Gannon, “Science Applications of Generative Neural Networks”, https://esciencegroup.com/2018/10/11/science-applications-of-generative-neural-networks/, 2018
  7. Mustafa, et. al. “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018)
  8. Carminati, G. Khattak, S. Vallecorsa, “3D convolutional GAN for fast Simulation”, https://www.ixpug.org/images/docs/IXPUG_Annual_ Spring_Conference_2018/11-VALLECORSA-Machine-learning.pdf
  9. Harel and K. Radinsky, “Prototype-Based Compound Discovery using Deep Generative Models” http://kiraradinsky.com/files/acs-accelerating-prototype.pdf
  10. Bingham, et al, “Pyro: Deep universal probabilistic programming.” Journal of Machine Learning Research (2018).
  11. Tensorflow Probability, https://medium.com/tensorflow/an-introduction-to-probabilistic-programming-now-available-in-tensorflow-probability-6dcc003ca29e
  12. Gen, https://www.infoq.com/news/2019/07/mit-gen-probabilistic-programs
  13. Güneş Baydin, et al, “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale”, arXiv:1907.03382v1
  14. Gannon, “Building a ‘ChatBot’ for Scientific Research”, https://esciencegroup.com/2018/08/09/building-a-chatbot-for-scientific-research
  15. Allen Institute for AI, Semantic Sanity and Aristo, https://allenai.org/demos
  16. Vaswani et al., “Attention Is All You Need”, arXiv:1706.03762v5
  17. Chong, Hybrid Quantum-Classical Computing https://www.sigarch.org/hybrid-quantum-classical-computing/
  18. Shaydulin, et al., “A Hybrid Approach for Solving Optimization Problems on Small Quantum Computers,” in Computer, vol. 52, no. 6, pp. 18-26, June 2019.
  19. Takahashi, B. Nguyen, K. Strauss, L. Ceze, “Demonstration of End-to-End Automation of DNA Data Storage”, Nature Scientific Reports, vol. 9, no. 1, pp. 2045-2322, 2019
  20. Wellsey, et al., “Puddle: A Dynamic, Error-Correcting, Full-Stack Microfluidics Platform”, ASPLOS’19 , April 13–17, 2019.
  21. Intel Loihi Neurmorphic Chip. https://en.wikichip.org/wiki/intel/loihi
  22. E. Kim, et al., “Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning”, Chem. Mater. 2017 29219436-944

Scientific Workflow in the Cloud using Serverless Functions

Introduction

Wikipedia has a pretty good definition of workflow: an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information.   Doing Science involves the workflow of repeating and documenting experiments and the related data analysis. Consequently, managing workflow is critically important for the scientific endeavor. There have been dozens of projects that have built tools to simplify the process of specifying and managing workflow in science. We described some of these in our 2007 book “Workflows for e-Science” and another Wikipedia article gives another list of over a dozen scientific workflow system and this page lists over 200 systems. Many of these systems were so narrowly focused on a single scientific domain or set of applications that they have not seen broad adoption.   However, there are a few standouts and some of these have demonstrated they can manage serious scientific workflow.

Pegasus from ISI is the most well-known and used framework for managing science workflow.   To use it on the cloud you deploy a Condor cluster on a set of virtual machines. Once that is in place, Pegasus provides the tools you need to manage large workflow computations.   Makeflow from Notre Dame University is another example. Based on a generalization of the execution model for the old Unix Make system, Makeflow uses Condor but it also has a native distributed platform called WorkQueue. Makeflow is commonly used on large HPC Supercomputers but they also claim an implementation on AWS Lambda.

Workflow in the Cloud.

Doing scientific computing in the cloud is different from the traditional scientific data centers built around access to a large supercomputer. A primary difference is that the cloud support models of computing not associated with traditional batch computing frameworks.   The cloud is designed to support continuously running services. These may be simple web services or complex systems composed of hundreds of microservices. The cloud is designed to scale to your needs (and budget) on-demand. The largest commercial public clouds from Amazon, Google and Microsoft are based on far more than providing compute cycles. They offer services to build streaming applications, computer vision and speech AI services, scalable data analytics and database services, managing edge devices, robotics and now attached quantum processors. While there are tools to support batch computing (Microsoft Azure even has an attached Cray), the cloud is also an excellent host for interactive computational experimentation.

Christina Hoffa, et. al. “On the Use of Cloud Computing for Scientific Workflows” describe some early 2008 experiments using cloud technology for scientific workflow. The cloud of 2019 presents many possibilities they did not have access to.   Two of these are “cloud native” microservice frameworks such as Kubernetes and serverless computing models.

Kubernetes has been used for several workflow systems. An interesting example is Reana from CERN. Reana is a research data analysis platform that runs on your desktop or on a cloud Kubernetes cluster. Reana uses several workflow languages but the one that is most frequently used is CWL, the Common Workflow Language, which is rapidly becoming an industry standard.   CWL is used in a number of other cloud workflow tools including AVADOS from Veritas Genetics, a version of the popular Apache Airflow workflow tools and several other systems with implementations “in progress”.   Argo is another workflow took that is designed to work with Kubernetes.

Workflow Using Serverless Computing

Serverless computing is a natural fit for workflow management.   Serverless allows applications to run on demand without regard to compute resource reservation or management.   Serverless computations are triggered by events. Typical among the list of event types are:

  • Messages arriving on Message Queues
  • Changes in Databases
  • Changes in Document Stores
  • Service APIs being invoked
  • Device sensor data sending new updates

These event types are each central to workflow automation. AWS Lambda was the first production implementation of a serverless platform, but not the last. Implementations from Microsoft, IBM and Google are now available and the open source implementation from OpenWhisk is available for OpenStack and other platforms.

Serverless computing is built around the concept of “function as a service” where the basic unit of deployment is not a VM or container, but the code of a function.   When the function is deployed it is tied to a specific event category such as one of those listed above.    These functions are intended to be relatively light weight (not a massive memory footprint and a short execution time).   The semantics of the function execution dictate that they are stateless.   This allows many instances of the same function to be responding to events at the same time without conflict.  The function instances respond to an event, execute and terminate.   While the function itself is stateless, it can affect the state of other object during its brief execution.   It can write files, modify databases, and invoke other functions.

Workflow State Management

Most scientific workflows can be described as a directed acyclic graph where the nodes are the computational steps. An arc in the graph represents a completion of a task that signals another it may start.   For example, the first task writes a file in a storage container and that triggers an event which fires the subsequent task that is waiting for the data in the file. If the graph takes the shape of a tree where one node creates events which trigger one or more other nodes, the translation to serverless is straightforward: each node of the graph can be compiled into one function. (We show an example of this case in the next section.)

One of the challenges of using serverless computing for workflow is state management. If the “in degree” of a node is greater than one, then it requires more than one event to trigger the event.   Suppose there are two events that must happen before a node is triggered. If the function is stateless it cannot remember that the one of the conditions has already been met.  The problem is that the graph itself has state defined by which nodes have been enabled for execution. For example, Figure 1 is a CWL-like specification of such a case. NodeC cannot run until NodeA and NodeB both complete.

cwl

Figure 1. A CWL-like specification of a three step workflow where the third step requires that both the first step and second step are complete.   The first and second step can run simultaneously or in any order.

One common solution to this problem is to assume there is a persistent, stateful instance of a workflow manager that holds the graph and keeps track of its global state.   However, it is possible to manage the workflow with a single stateless function. To see how this can be done notice that in the above specification each of the step nodes requires the existence of one or more input files and the computation at that node produces one or more output files.  As shown in Figure 2 below, workflow stateless function listens to the changes to the file system (or a database).

lambda_plus_kub

Figure 2.   A workflow manager/listener function responds to events in the file system created by the execution of the applications.   As shown in the next session, if the app is small, it may be embedded in the manager, but otherwise it can be containerized and run elsewhere in the cloud.

Here we assume that the application invocations, which are shown as command-line calls in Figure 1, are either executed in the lambda function or by invocations to the application wrapped as a Docker container running as a microservice in Kubernetes. When an application terminates it deposits the output file to the file system which triggers an event for the workflow manager/listener function.

The workflow manager/listener must then decide which step nodes were affected by this event and then verify that all the conditions for that node are satisfied. For example, if the node requires two file, it much check that both are there before invoking the associated application. There is no persistent state in the manager/listener as it is all part of the file system. To run multiple workflow instances concurrently each event and file must have an instance number ID as part of its metadata.

A Simple Example of Workflow using AWS Lambda

In the following paragraphs we describe a very simple implementation of a document classification workflow using AWS Lambda. The workflow is a simple tree with two levels and our goal here is to demonstrate the levels of concurrency possible with a serverless approach. More specifically we demonstrate a system that looks for scientific documents in an AWS S3 bucket and classifies them by research topic. The results are stored in an Amazon DynamoDB table. The documents each consist of a list of “scientific phrases”. An example document is below.

“homology is part of algebraic topology”,
‘The theory of evolution animal species genes mutations’,
“the gannon-lee singularity tells us something about black holes”,
‘supercomputers are used to do very large simulations’,
‘clifford algebras and semigroup are not fields and rings’,
‘galaxies have millions of stars and some are quasars’,
‘deep learning is used to classify documents and images’,
‘surgery on compact manifolds of dimension 2 yields all possible embedded surfaces’

(In the experiments the documents are the titles of papers drawn from ArXiv.)

The first step of the workflow classifies each statement according to basic academic field: Physics, Math, Biology and Computer Science. (Obviously there more fields than this, but this covered most of the collection we used to train the classifiers.) Once a sentence is classified as to topic it is then passed to the second stage of the workflow where it is classified as to subcategory. For example if a sentence is determined to belong to Biology, the subcategories that are recognized include Neuro, Cell Behavior, Genomics, Evolution, Subcellular, Health-Tissues&Organs and Molecular Networks. Physics sub areas are Astro, General Relativity and Quantum Gravity, Condensed Matter, High Energy, Mathematical Physics, Quantum mechanics and educational physics. Math is very hard, so the categories are simple: algebra, topology, analysis and other.   The output from the DynamoDB table for this list of statements is shown in Figure 3 below.

sample-table-output

Figure 3. Output from the sample document in an AWS DynamoDB table. The “cmain predict” column is the output of the first workflow phase and the “cpredicted” column is the output of the second phase of the workflow.

The AWS Lambda details.

A Lambda function is a very simple program that responds to an event, does some processing and exits. There is a complete command line interface to Lambda, but AWS has a very nice portal interface to build Lambda functions in a variety of standard languages.   I found the web portal far superior to the command line because it gives you great debugging feedback and easy access to you function logs that are automatically generated each time your lambda function is invoked.

The example below is a very simple Python lambda function that waits for a document to arrive in a S3 storage bucket.   When a document is placed in the bucket, the “lambda_handler” function is automatically invoked. In this simple example the function does three things.   It grabs the name of the new S3 object and the bucket name. It then opens and reads the document (as text). If the document is not text, the system throws an error and the evidence is logged in the AWS CloudWatch log file for this function. In the last step, it saves the result in a DynamoDB table called “blanklambda”.

To make this work you need to assign the correct permissions policies to the lambda function. In this case we need access to S3, the DynamoDB and the basic Lambda execution role which includes permission to create the execution logs.

To tell the system which bucket to monitor you most go to the S3 bucket you want to monitor and add to the property called “Events”. Follow the instructions to a reference to your new Lambda function.

lambda-code

In our workflow example we used 5 lambda functions: a main topic classifier, and a classifier lambda function for each of the subtopics.   It is trivial to make one Lambda function create an event that will trigger another. We send each document as a string json document encoding our list of statements.

The call function is

Lam = boto3.client("lambda",region_name="us-east-1")
resp = Lam.invoke(FunctionName="bio-function", 
                  InvocationType="RequestResponse",
                  Payload=json.dumps(statementlist))

The “bio-function” Lambda code receives the payload as a text string and convert it back to a list.

The workflow is pictured in Figure 4 below.  When a document containing a list of statements lands in S3 it invokes an instance of the workflow. The main classifier lambda function invokes one instance each of the sub-classifiers and those four are all running concurrently.   As illustrated in Figure 5, when a batch of 20 documents files land in s3 as many as 100 Lambda instances are running in parallel.

workflow

Figure 4. When a document list file lands in a specified S3 bucket it triggers the main lambda function which determines the general topic of each individual statement in the document list. The entire document is sent to each of the subtopic specialist lambda function that classifies the them into subtopics and places the result in the DynamoDB table.

multiple-lambda

Figure 5. As multiple documents land in S3 new instances of the workflow are run in parallel. A batch of 20 documents arriving all at once will generate 5*20 concurrent Lambda invocations.

AWS Lambda’s Greatest Limitation

Lambda functions are intended to be small and execute quickly.   In fact, the default execution time limit is 3 seconds, but that is easily increased. What cannot be increased beyond a is the size of the package. The limit is 230Mbytes.   To import python libraries that are not included in their basic package you must add “layers”. These layers are completely analogous to the layers in the Docker Unified File System. In the case of Python a layer is a zipped file containing the libraries you need. Unfortunately, there is only one standard library layer available for python 3 and that includes numpy and scipy.   However, for our classification algorithms we need much more.   A grossly simplified (and far less accurate) version of our classifier requires only sklearn, but that is still a large package.

It takes a bit of work to build a special Lambda layer.   To create our own version of a Scikit learn library we turn to an excellent blog “How to create an AWS Lambda Python Layer” by Lucas.    We were able to do this and have a layer that also included numpy.   Unfortunately, we could not also include the model data, but we had the lambda instance dynamically load that data from S3. Because our model is simplified the load takes less than 1 second.

We tested this with 20 document files each containing 5 documents uniformly distributed over the 4 major topics. To understand the performance, we captured time stamps at the start and end of each instance of the main Lambda invocation.   Another interesting feature of AWS Lambda execution is that the amount of CPU resource assigned to each instance of a function invocation is governed by the amount of memory you allocate to it.   We tried our experiments with two different memory configurations: 380MB and 1GB.

sklearn_both-endssklearn-1G-data

Figure 6. Each horizontal line represents the execution of one Lambda workflow for one document.   Lines are ordered by start time from bottom to top. The figure on the top show the results when the lambda memory configuration was set to 380MB. The figure on the bottom shows the results with 1GB of memory allocated (and hence more cpu resource).

The results are shown in Figure 6. There are two interesting points to note.   First, the Lambda functions do not all start at the same time.   The system seems to see that there are many events that were generated by S3 and after a brief start it launches an additional set of workflow instances. We don’t know the exact scheduling mechanism used. The other thing to notice is that the increase in memory (and CPU resource made a profound difference.   While the total duration of each execution varied up to 50% between invocations the mean execution time for the large memory case was less than half that of the small memory case. In the 1GB case the system ran all 20 documents with a speed-up of 16 over the single execution case.

Having to resort to the simplified (and far less accurate) small classifier was a disappointment.   An alternative, which in many ways is more appropriate for scientific workflows, is to use the Lambda functions as a the coordination and step-trigger mechanism and have the lambda functions invoke remote services to do the real computation.   We ran instances of our full model as docker containers on the AWS LightSail platform and replaced the local invocations to the small classifier model with remote invocations to the full model as illustrated in Figure 7.

full-service

Figure 7.   Invoking the remote classifiers as web services.

Results in this configuration were heavily dependent on the scale-out of each of the classifier service. Figure 8 illustrate the behavior.

results-standard-oregon

Figure 8. Running 20 documents simultaneously pushed to S3 using remote webservice invocations for the classification step. Once again these are ordered from bottom to top by lambda start time.

As can be seen in the diagram the fastest time using a remote service for the full classifier was nearly three times faster than the small classifier running on the lambda function.   However, the requests to the service caused a bottleneck which slowed down others.   Our version of the service ran on two servers with a very primitive random scheduler for load balance.   A better design would involve many more servers dynamically allocated to meet the demand.

Conclusion

Serverless computing is a technology that can simplify the design of some scientific workflow problems. If you can define the workflow completely in terms of the occurrence of specific triggers and the graph of the execution is a simple tree, it is easy to setup a collection of function that will be triggered by the appropriate events. If the events don’t happen, the workflow is not run and you do not have to pay for servers to host it. (The example from the previous section was designed, debugged on AWS over a few weeks and run many times. The total bill was about $5.)

There are two downsides to using serverless as part of a workflow execution engine.

  1. The complexity of handling a workflow consisting of a complex non-tree DAG is large. A possible solution to this was described in Figure 2 and the accompanying text. The task of compiling a general CWL specification into such a workflow manager/listener is non-trivial because CWL specification can be very complex.
  2. A topic that was only briefly discussed here is how to get a Lambda function to execute the actual application code at each step.   In most workflow systems the work steps are command line invoked applications that take input files and produce output files. To invoke these from a serverless lambda function in the cloud requires that each worker application is rendered as a service that can be invoked from the lambda function. This can be done by containerizing the applications and deploying them in pods on Kubernetes or other microservice architecture as we illustrated in Figure 2.   In other scenarios the worker applications may be submitted to a scheduler and run on a batch system.

While we only looked at AWS Lambda here we will next consider Azure functions.   More on that later.

The source code for the lambda functions used here can be found at this GitHub repo

Note: there are various tricks to getting lambda to use larger ML libraries, but we didn’t go down that road. One good example is described in a blog by Sergei on Medium. Another look at doing ML on lambda is described by Alek Glikson in this 2018 article.