Explainable Deep Learning and Guiding Human Intuition with AI.

In July 2021 Alex Davies and a team from DeepMind, Oxford and University of Sydney published a paper entitled “Advancing mathematics by guiding human intuition with AI”. The paper addresses the question of how can machine learning be used to guide intuition in mathematical discovery?  The formal approach they take to this question proceeds as follows.   Let Z be a collection of objects.  Suppose that for each instance z in Z we have two distinct mathematical representations of z:  X(z) and Y(z).   We can then ask, without knowing z,  is there a mathematical function f : X -> Y such that given X(z) and Y(z),  f(X(z)) = Y(z)?  Suppose the mathematician builds a machine learning model trained on many instances of X(z) and Y(z).  That model can be thought of as a function f^ :X -> Y such that f^(X(z)) ~ Y(z).  The question then becomes, can we use properties of that model to give us clues on how to construct the true f?

A really simple example that the authors give is to let Z be the set of convex polyhedral (cube, tetrahedron, octahedron, etc.).  If we let X(z) be the tuple of numbers defined by the number of edges, the number of vertices, the volume of z and the surface area and let  Y(z) be the number of faces,  then without knowing z, the question becomes is there a function f: R4 ->  R such that f( X(z) ) = Y(z) ?   Euler answered this question some time ago in the affirmative:  Yes, he proved that

f(edges, vertices, volume, surface area)  =  edges – vertices + 2 = faces.

Now suppose we did not have Euler to help us.   Given a big table where each row corresponds to (edge, vertices, volume, surface area, faces) for some convex polytope, we can select a subset of rows as a training set and try to build a model to predict faces given the other values. Should our AI model prove highly accurate on the test set consisting of the unselected rows, that may lead us to suspect that such a function exists.   In a statistical sense, the learned model is such a function, but it may not be exact and, worse, by itself, it may not lead us to formula as satisfying as Euler’s. 

This leads us to Explainable AI.   This is a topic that has grown in importance over the last decade as machine learning has been making more and more decisions “on our behalf”.  Such as which movies we should rent and which social media article we should read.  We wonder “Why did the recommender come to the conclusion that I would like that movie?”  This is now a big area of research (the Wikipedia article has a substantial bibliography on Explainable AI.)  One outcome of this work has been a set of methods that can be applied to trained models to help us understand what parts of the data are most critical in the model’s decision making.    Davies and his team are interested in understanding what are the most “salient” features of X(z) in relation to determining Y(z) and using this knowledge to inspire the mathematician’s intuition in the search for f.   We return to their mathematical examples later, but first let’s look closer at the concept of salience.

Salience and Integrated Gradients

Our goal is to understand how important each feature of the input to a neural network is to the outcome. The features that are most important are often referred to as “salient” features.  In a very nice paper, Axiomatic Attribution for Deep Networks from 2017 Sundararajan, Taly and Yan consider this the question of attribution.  When considering the attribution of input features to output results of DNNs, they propose two reasonable axioms.   The first is Sensitivity: if a feature of the input causes the network to make a change then that feature should  have a non-zero attribution.  In other words, it is salient. Represent the network as a function F: Rn->[0,1] for n-dimensional data.  In order to make the discussion more precise we need to pick a baseline input x’ that represents the network in an inactivated state: F(x’) = 0.    For example, in a vision system, an image that is all black will do.  We are interested in finding the features in x that are critical when F(x) is near 1. 

The second axiom is more subtle. Implementation Invariance:  If two neural networks are equivalent (i.e. they give the same results for the same input), the attribution of a feature should be the same for both.

The simples form of salience computation is to look at the gradient of the network.    For each i in 1 .. n,  we can look at the components of the gradient and define

This axiom satisfies implementation invariance, but unfortunately this fails the sensitivity test.  The problem is the value of F(xe) for some xe may be 1, but the gradient may be 0 at that point. We will show an example of this below.  On the other hand if we think about “increasing” x from x’ to xe , there should be a transition of the gradient from 0 to non-zero as F(x) increases towards 1.  That motivates the definition of Integrated Gradients.   We are going to add up the values of the gradient along a path from the baseline to a value that causes the network to change.   

let γ = (γ1, . . . , γn) : [0, 1] → Rn be a smooth function specifying a path in Rn from the baseline x’ to the input x, i.e., γ(0) = x’ and γ(1) = x.   It turns out that it doesn’t matter which path we take because we will be approximating the path integral,  and by the fundamental theorem of calculus applied to path integrals,  we have

Expanding the integral out in terms of the components of the gradient,

Now, picking the path that represents the straight line between x’ and x as

Substituting this in the right-hand side and simplifying, we can set the attribution for the ith component as

To compute the attribution of factor i, for input x, we need only evaluate the gradient along the path at several points to approximate the integral.   In the examples below we show how salience in this form and others may be used to give us some knowledge about our understanding problem.

Digression: Computing Salience and Integrated Gradients using Torch

Readers not interested in how to do the computation of salience in PyTorch can skip this section and go on to the next section on Mathematical Intuition and Knots.

A team from Facebook AI introduced Captum in 2019 as a library designed to compute many types of salience models.  It is designed to work with PyTorch deep learning tools.  To illustrate it we will look at a simple example to show where simple gradient salience breaks down yet integrated gradience works fine.  The complete details of this example are in this notebook on Github.

We start with the following really dumb neural network consisting of one relu operator on two input parameters.

A quick look at this suggests that the most salient parameter is input1 (because it has 10 time the influence on the result of input2.   Of course, the relu operator tells us that result is flat for large positive values of input1 and input2.  We see that as follows.

We can directly compute the gradient using basic automatic differentiation in Torch.  When we evaluate the partial derivatives for these values of input1 and input2 we see they are zero.

From Captum we can grab the IntegratedGradient and Salience operators and apply them as follows.

The integrated Gradient approximation shows that indeed input1 has 10 time the attribution strength of input2.  And the sum of these is m(input1, input2) plus an error term.  As we already expect, the simple gradient method  of computing salience will  fail.

Of course this extreme example does not mean simple gradient salience will fail all the time.   We will return to Captum (but without the details) in another example later in this article.

Mathematical Intuition and Knots

Returning to mathematics, the Davies team considered two problems.   The methodology they used is described in the figure below which roughly corresponds to our discussion in the introduction.

Figure 1.   Experimental methodology used by Davies, et.al. (Figure 1 from “Advancing mathematics by guiding human intuition with AI”. Nature, Vol 600, 2 December 2021)

They began with a conjecture about Knot theory.   Specifically, they were interested in the conjecture that that the geometric invariants of knots (playing the role of X(z) in the scenario above)  could determine some of the algebraic invariants (as Y(z)).  See Figure 2 below.

Figure 2.   Geometric and algebraic invariant of hyperbolic knots. (Figure 2 from “Advancing mathematics by guiding human intuition with AI”. Nature, Vol 600, 2 December 2021)

The authors of the paper had access to information 18 geometric invariants on 243,000 knots and built a custom deep learning stack to try to identify the salient invariants that could identify the signature of the knot (a snapshot of the information is shown below).   Rather than describing their model, we decided to apply one generated by the AutoML  tool  provided by the Azure ML Studio.  

Figure 3.  Snapshot of the Knot invariant data rendered as a python pandas dataframe.

We uses a Jupyter Notebook to interact with the remote instances of the azure ML  studio.  The notebook is in github here: dbgannon/math: Notebooks from Explainable Deep Learning and Guiding Human Intuition with AI (github.com).  The notebook also contains links to the dataset.

Because we described azure ML studio in a previous post, we will not go into it here in detail.  We formulated the computation as a straightforward regression classification problem.    AutoML completed the model selection and training and it also computed the factor salience computation with the results shown below.

Figure 4.  Salience of features computed by Azure AutoML using their Machine Learning services  (To see the salience factors one has to look at the output details on the ML studio page for the computation.)

The three top factors were: the real and imaginary components of meridinal_translation, and the longitudinal translation.   These are the same top factors that were revealed in the authors study but in a different order.  

Based on this hint they authors proposed a conjecture: for a hyperbolic knot K define the slope(K) to be the real part of the fraction longitudinal translation/meridinal_translation.  Then there exists constants c1 and c2 such that

In Figure 5,   we show scatter plots of the predicted signature versus the real signature of each of the knots in the test suit.   As you can see, the predicted signatures form a reasonably tight band around the diagonal (true signatures). The mean squared error of the formula slope(K)/2 from the true signature was 0.86 and the mean squared error of the model predictions was 0.37.   This suggests that the conjecture may need some small correction terms, but that is up to the mathematician to prove.  Otherwise, we suspect the bounds in the inequality are reasonable.

Figure 5.  On the left is a scatter plot of the slope(K)/2 computed from the geometric data vs the signature.  On the right is the signature predicted by the model vs the signature.  

Graph Neural Nets and Salient Subgraphs.

The second problem that the team looked at involved representation theory.    In this case they are interested in pairs of elements in the symmetric group Sn represented as permutations.   For example, in S5 an instance z might be {(03214), (34201)}.  An interesting question to study is how to transform the first permutation into the second by simple 2-element exchanges (rotations) such as 03214->13204->31204->34201.  In fact, there are many ways to do this and we can build a directed graph showing the various paths of rotations to get from the first to the second.   This graph is called the unlabeled Bruhat interval, and it is their X(z).    The Y(z) is the Kazhdan–Lusztig (KL) polynomial for the permutation pair.   To go any deeper into this topic is way beyond the scope of this article (and beyond the knowledge of this author!) Rather, we shall jump to their conclusion and then consider a different problem related to salient subgraphs.   They discovered by looking at salient subgraphs of a Bruhat interval graph a for a  pair in Sn that there was a hypercube and a subgraph isomorphic to an interval in Sn−1.  This led to a formula for computing the KL polynomial.  A very hard problem solved!

An important observation the authors used in designing the neural net model was that information conveyed along the Bruhat interval was similar to message passing models in Graph neural networks.  These GNNs have become powerful tools for many problems.   We will use a different example to illustrate the use of salience in understanding graph structures.    The example is one of the demo cases for the excellent Pytorch Geometric libraries. More specifically it is one of their example Colab Notebooks and Video Tutorials — pytorch_geometric 2.0.4 documentation (pytorch-geometric.readthedocs.io).  The example illustrates the use of graph neural networks for classification of molecular structures for use as drug candidates.

Mutagenicity is a property of a chemical compound that hampers its potential to become a safe drug.  Specifically there are often substructures of a compound, called toxicophores,  that can interact with proteins or DNA that can lead to changes in the normal cellular biochemistry.  An article from J. Med. Chem. 2005, 48, 1, 312–320, describes a collection of 4337 (2401 mutagens and 1936 nonmutagens).  These are included in the TUDataset collection and used here.

The Pytorch Geometric example uses the Captum library (that we illustrated above) to identify the salient substructures that are likely toxicophores.  While we will not go into great detail about the notebook because it is in their Colab space. If you want to run this on your on machine we have put a copy in our github folder for this project.

The TUDataset data set encodes molecules, such as the graph below, as an object of the form

Data( edge_index=[2, 26], x=[13, 14], y=[1] )

 In this object x represents the 13 vertices (atoms).  There are 14 properties associated with each atom, but they are just ‘one-hot’ encodings of the names of each of the 14 possible elements in the dataset: ‘C’, ‘O’, ‘Cl’, ‘H’, ‘N’, ‘F’,’Br’, ‘S’, ‘P’, ‘I’, ‘Na’, ‘K’, ‘Li’, ‘Ca’.  The edge index represents the 26 edges where each is identified by the index of the 2 end atoms.   The value Y is 0 if this molecule is known to be mutagenic and 1 otherwise.

We must first train a graph neural network that will learn to recognize the mutagenic molecules.  Once we have that trained network, we can apply Captum’s IntegratedGradient to identify the salient subgraphs that most implicate the whole graph as mutagenic. 

The neural network is a five layer graph convolutional network.  A convolutional graph layer works by adding together information from each graph neighbor of each node and multiplying it by a trainable matrix.   More specifically assume  that each node has a vector xl of values at level l.  We then compute a new vector xl+1 of values for node v at level l+1 by

where 𝐖(+1) denotes a trainable weight matrix of shape [num_outputs, num_inputs] and 𝑐𝑤,𝑣 refers to a fixed normalization coefficient for each edge.  In our case 𝑐𝑤,𝑣 is the number of edges coming into node v divided by the weight of the edge.   Our network is described below.

The forward method uses the x node property and the edge index map to guide the convolutional steps.   Note that we will use batches of molecules in the training and the parameter batch is how we can distinguish one molecule from another, so that the vector x is finally a pair of values for each element of the batch.  With edge_weight set to None,  the weight of each edge is  a constant 1.0.   

The training step is a totally standard PyTorch training loop.  With the dim variable set to 64 and 200 epochs later, the training accuracy is 96% and test accuracy is 84%.   To compute the salient subgraphs using integrated gradients we will have to compute the partial derivate of the model with respect to each of the edge weights.   To do so, we replace the edge weight with a variable tensor whose value is 1.0. 

Looking at a  sample from the set of mutagenic molecules, we can view the integrated gradients for each edge.   In figure 6 below on the left we show a sample with the edges labeled by their IG values.  To make this easier to see the example code replaces the IG values with a color code where large IG values have thicker darker lines. 

The article from J. Med. Chem notes that the occurrence of NO2 is often a sign of mutagenicity, so we are not surprised to see it in this example.   In Figure 7, we show several other examples that illustrate different salient subgraphs. 

Figure 6.   The sample from the Mutagenic set with the IngetratedGradient values labeling the edges.  On the right we have a version with dark, heavy lines representing large IG values.

Figure 7.  Four more samples.  The subgraph N02 in the upper left is clearly salient, but in the three other examples we see COH bonds showing salient subgraphs.  However these also occur in the other, non-mutagenic set, so the significance is not clear to us non-experts.


The issue of explainability  of ML methods is clearly important when we let ML make decisions  about peoples lives.    Salience analysis lies at the heart of the contribution to mathematical insight described above.   We have attempted to illustrate where it can be used to help us learn about the features in training data that drive classification systems to draw conclusions.  However, it takes the inspiration of a human expert to understand how those features are fundamentally related to the outcome.

ML methods are having a far-reaching impact throughout scientific endeavors.  Deep learning has become a new tool that is used in research in particle physics to improve the search for signals in data, in astronomy where generative methods can create galaxy models for dark energy experiments  and biochemistry where RoseTTAFold and DeepMind’s AlphaFold have used deep learning models to revolutionize protein folding and protein-protein interaction.  The models constructed in these cases are composed from well-understood components such as GANs and Transformers where issues of explainability have more to do with optimization of the model and its resources usage.  We will return to that topic in a future study.

A Look at Cloud-based Automated Machine Learning Services


AI is the hottest topic in the tech industry.   While it is unclear if this is a passing infatuation or a fundamental shift in the IT industry, it certainly has captured the attention of many.   Much of the writing in the popular press about AI involves wild predictions or dire warnings.  However, for enterprises and researchers the immediate impact of this evolving technology concerns the more prosaic subset of AI known as Machine Learning.    The reason for this is easy to see.   Machine learning holds the promise of optimizing business and research practices.    The applications of ML range from improving the interaction between an enterprise and its clients/customers (How can we better understand our clients’ needs?), to speeding up advanced R&D discovery (How do we improve the efficiency of the search for solutions?).

Unfortunately, it is not that easy to deploy the latest ML methods without access to experts who understand the technology and how to best apply it.  The successful application of machine learning methods is notoriously difficult.   If one has a data collection or sensor array that may be useful for training an AI model,  the challenge is how to clean and condition that data so that it can be used effectively.  The goal is to build and deploy a model that can be used to predict behavior or spot anomalies.   This may involve testing a dozen candidate architectures over a large space of tuning hyperparameters.  The best method may be a hybrid model derived from standard approaches.   One such hybrid is ensemble learning in which many models, such as neural networks or decision trees, are trained in parallel to solve the same problem.  Their predictions are combined linearly when classifying new instances.  Another approach (called stacking) is to use the results of the sub-models as input to a second level model which selects the combination dynamically.  It is also possible to use AI methods to simplify labor intensive tasks such as collecting the best features from the input data tables (called feature engineering) for model building.    In fact, the process of building the entire data pipeline and workflow to train a good model itself a task well suited to AI optimization.  The result is automated machine learning.   The cloud vendors have now provided expert autoML services that can lead the user to the construction of a solid and reliable machine learning solutions.

Work on autoML has been going on for a while.  In 2013, Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown introduced Auto-WEKA and many others followed.  In 2019, the AutoML | Home  research groups led by Frank Hutter at the University of Freiburg,  and Prof. Marius Lindauer at the Leibniz University of Hannover published Automated Machine Learning: Methods, Systems, Challenges (which can be accessed on the AutoML website).

For an amateur looking to use an autoML system, the first step is to identify the problem that must be solved.  These systems support a surprising number of capabilities. For example, one may be interested in image related problems like image identification or object detection.   Another area is text analysis.  It may also be regression or predictions from streaming data.  One of the biggest challenges involve building models that can handle tabular data which may be contain not only columns of numbers but also images and text.    All of these are possible with the available autoML systems.

While all autoML systems are different in details of use, the basic idea is they automate a pipeline like the one illustrated in Figure 1 below.

Figure 1.   Basic AutoML pipeline workflow to generate an optimal model based on the data available.

Automating the model test and evaluation is a process that involves exploring the search space of model combinations and parameters.   Doing this search is a non-trivial process that involves intelligent pruning of possible combinations if they seem like to be poor performers.   As we shall show below, the autoML system may test dozens of candidates before ranking them and picking the best.    

Amazon AWS, Micosoft Azure,  Google cloud and IBM cloud all have automated machine learning services they provide to their customers.  In the following paragraphs we will look at two of these,   Amazon AWS autoGluon which is both open source and part of their SageMaker service,   and Microsoft Azure AutoML service which is part of the Azure Machine Learning Studio.   We will also provide a very brief look at Google’s Vertex AI cloud service.   We will not provide an in-depth analysis of these services, but give a brief overview and example from each. 

AWS and AutoGluon

AutoGluon was developed by a team at Amazon Web Services which they have also released as open source.  Consequently, it can be used as part of their SageMaker service or complete separately.  An interesting tutorial on AutoGluon is here.   While the types of problems AutoGluon can be applied to is extremely broad, we will illustrate it for only a tiny classical problem:  regression based on a tabular input. 

The table we use is the Kaggle bike share challenge.   The input is a pandas data frame with records of bike shares per day for about two years.  For each day, there is an indicator to say if this is a holiday and a workday.  There is weather information consisting of temperature, humidity and windspeed.  The last column is the “count” of the number of rentals for that day.     The first few rows are shown below in Figure 2.    Our experiment differs from the Kaggle competition in that we will use a small sample (27%) of the data to train a regression model and then use the remainder for the test so that we can easily illustrate the fit to the true data.

Figure 2.  Sample of the bike rental data used in this and the following example.

While AutoGluon, we believe, can be deployed on Windows, we will use Linux because it deploys easily there.   We used Google Colab and Ubuntu 18.04 deployed on Windows 11.  In both cases the installation from a Jupyter notebook was very easy  and went as follows.  First we need to install the packages.

The full notebook for this experiment is here and we encourage the reader to follow along as we will only sketch the details below. 

As can be seen from the data in Figure 2, the  “count” number jumps wildly from day to day.   Plotting the count vs time we can see this clearly.

A more informative way to look at this is a “weekly” average shown below.

 The training data that is available is a random selection about 70% of the complete dataset, so this is not a perfect weekly average, but it is seven consecutive days of the data.  

Our goal is to compute the regression model based on a small training sample and then use the model to predict the “count” values for the test data.   We can then compare that with the actual test data “count” values.    Invoking the AutoGluon is now remarkably easy.

We have given this a time limit of 20 minutes.   The predictor is finished well before that time.   We can now ask to see how well the different models did (Figure 3) and also ask for the best.

Running the predictor on our test data is also easy.   We first drop the “count” column from the test data and invoke the predict method on the predictor

Figure 3.    The leaderboard shows the performance of the various methods tested

One trivial graphical way to illustrate the fit of the prediction to the actual data is a simple scatter plot.

As should be clear to the reader, this is far from perfect.  Another simple visualization is to plot the two “count” values along the time axis.  As we did above, the picture is clearer if plot a smoothed average.  In this case each point is an average of the following 100 points.  The results, which shows the true data in blue over the prediction in orange, does indicate that the model does capture the qualitative trends. 

The mean squared error is 148.   Note: we also tried training with a larger fraction of the data and the result was similar. 

Azure Automated Machine Learning

The azure AutoML system is also designed to support classification, regression, forecasting and computer vision.   There are two basic modes in which Azure autoML works:  use the ML studio on Azure for the entire experience,  or use the Python SDK, running in Jupyter on your laptop with remote execution in the Azure ML studio.   (In simple cases you can run everything on you laptop, but taking advantage of the studio managing a cluster for you in the background is a big win.) We will use the Azure studio for this example.  We will run a Jupyter notebook locally and connect to the studio remotely.  To do so we must first install the Python libraries.  Starting with Anaconda on windows 10 or 11, it can be challenging to find the libraries that will all work together.   The following combination will work with our example.

conda create -n azureml python=3.6.13

conda activate azureml

pip install azureml-train-automl-client

pip install numpy==1.18

pip install azureml-train-automl-runtime==1.35.1

pip install xgboost==0.90

pip install jupyter

pip install pandas

pip install matplotlib

pip install azureml.widgets

jupyter notebook

Next clone the Azure/MachineLearningNotebooks from Github and grab the notebook configuration.ipynb.   If you don’t have an azure subscription, you can create a new free one.  Running the configuration successfully in you jupyter notebook will set up your connection to the Azure ML studio.  

The example we will use is a standard regression demo from the AzureML collection.  In order to better illustrate the results, we use the same bike-share demand data from the Kaggle competition as used above where we sample both the training and test data from the official test data.  The train data we use is 27% of the total and the remainder is used for test.  As we did with the AutoGluon example, we delete two columns: “registered” and “casual”.

You can see the entire notebook and results here:

azure-automl/bike-regression-drop-.3.ipynb at main · dbgannon/azure-automl (github.com)

If you want to understand the details, this is needed.   In the following we only provide a sketch of the process and results.

We are going to rely on autoML to do the entire search for the best model, but we do need to give it some basic configuration parameters as shown below.

We have given it a much longer execution time than is needed.   One line is then used to send the job to Azure ML studio.

After waiting for the experiment to run, we see the results of the search

  As can be seen, the search progressed through various methods and combination with a stack ensemble finally providing the best results.

We can now use the trained model to make our predictions as follows.  We begin by extracting the fitted model.  We can then drop the “count” column from the test file and feed it to the model.   The result can be plotted as a scatter plot.

As before we can now use a simple visualization based on a sliding window average of 100 points to “smooth” the data and show the results of the true values against the prediction. 

As can be seen the fit is pretty good.  Of course, this is not a rigorous statistical analysis, but it does show the model captures the trends of the data fairly well. 

In this case the mean squared error was 49.

Google Vertex AI

Google introduced their autoML service, called VertexAI in 2020.    Like AutoGluon and Azure AutoML there is a python binding where there is a function  aiplatform.TabularDataset.create() that can be used to initiate a training job in a manner similar to AutoMLConfig() in the Azure API.  Rather than use that we decided to use their full VertexAI cloud service on the same dataset and regression problem we described above.  

The first step was to upload our dataset, here called “untitled_1637280702451”.   The VertexAI system steps us through the process in a very deliberate and simple manner.   The first step is to tell it we want to do regression (the other choice for this data set was classification). 

The next step is to identify the target column and the columns that are included in the training.  We used the default data slit of 80% for training, 10% validation and 10% testing.  

After that there is a button to launch the training.   We gave it one hour.   It took two hours and produced a model

Once complete, we can deploy the model in a container and attach an endpoint.  The root mean squared error if 127 is in line with the AutoGluon result and more than the Azure autoML value.   One problem with the graphical interactive view is that I did not see the calculation to see if we are comparing the VeretexAI result to the same result for to the RMSE for the others.   


Among the three autoML methods used here, the easiest to deploy was VertexAI because we only used the Graphical interface on the Google Cloud.   AutoGluon was trivial to deploy on  Google Collab and on a local Ubuntu installation.   Azure AutoML was installable on  Windows 11, but it took some effort to find the right combination of libraries and Python versions.   While we did not study the performance of the VertexAI model,  the performance of the Azure AutoML model was quite good.

As it is like obvious to the reader, we did not push these systems to produce the best results.  Our goal was to see what was easy to do. Consequently, this brief evaluation of the three autoML offerings did not do justice to any of them.   All three have capabilities that go well beyond simple regression.  All three systems can handle streaming data, image classification and recognition as well as text analysis and prediction.  If time permits, we will follow up this article with more interesting examples.

IEEE Symposium on Cloud HPC

On September 7th, 8th and 9th, we held the IEEE Services 2021 – IEEE International Symposium on Cloud HPC (CloudHPC) (computer.org) as part of IEEE CLOUD 2021 (computer.org).   The program consisted of a plenary panel and 8 sessions with three speakers in each.   Each session was recorded and  in this note we will describe four of these sessions. The videos for two sessions are still unavailable, but we will update this document when we can access them. The other organizers of this event were Rong N Chang,  Geoffrey Fox, James Sexton, Christoph Hagleitner, Bruce D’Amora, Ian Foster and Andrew Lumsdaine. In the paragraphs below we provide the links to the videos and brief thumbnail abstract to introduce the talks.

The first session was a plenary panel that was part of the IEEE Cloud 2021 conference.


Katherine Yelick, UC Berkeley and Lawrence Berkeley National Laboratory
Ian Foster, Argonne National Laboratory, University of Chicago
Geoffrey Fox, University of Virginia
Kate Keahey, Argonne National Laboratory, University of Chicago

This group of panelists have been involved in every aspect of HPC and HPC in the cloud. Kathy Yelick has tons of experience ranging from managing a HPC center to designing Programming Languages for parallel computing and, now, being  a partner in running a major NSF program CloudBank to help with cloud access. Ian Foster was a pioneer who conceived of the Grid, a network of supercomputer services which, in many ways foreshadowed the Cloud concept and he continues to lead research in HPC and the cloud.  Geoffrey Fox has been in the HPC business from the early days of MPI and he is now doing groundbreaking work using cloud native middleware with MPI to design new deep learning paradigms for science.   Kate Keahey is a leader in high performance distributed computing, and she designed one of the first open-source Cloud platforms and now runs a major cloud devoted to academic research for NSF.

The link to the panel discussion is SERVICES Plenary Panel 1: Cloud HPC – Exploring Growing Synergy btw Cloud & HPC – YouTube.

The overriding theme of the discussion involved an exploration of how cloud HCP is, and will be, different from traditional supercomputer based HPC and where the opportunities for exploiting the strengths of both can advance research and science.

Ian started the discussion by making the case that there were some fundamental differences between the cloud and traditional HPC systems.   Most significant of these was that the cloud made it possible to deliver and ecosystem of on-demand services.  Among these services are tools to interactively explore large data collections and services that enable users to easily build machine learning solutions for difficult data analysis problems. HPC on the other hand is about providing the most powerful computing platform for very large scientific applications.  Kathy also observed that what cloud lacked in high end power, the cloud excel in resilience. This resilience is critical for teaching such as when student assignment require just-in-time access to resources. Kate made the addition point that there are classes of HPC problems where having the interactivity clouds provide can be very important for exploring new areas where existing HPC application codes are not yet available.  Geoffrey made the point that access to HPC systems capable of doing some of the most advanced Neural Net training experiments is out of reach of must academics. That leaves clouds, but a national funding model for research is clearly needed. Kathy made the point that the new NSF CloudBank provides a way for university to access commercial cloud platform with NSF and other funding. Foster argues that the future challenge is to figure out how to combine the capabilities of both supercomputer centers and clouds for the next generation of scientific endeavors. The panel also addressed a number of important issues including economics, advances in technology and scalability.  It was an extremely lively and interesting discussion.  Check it out.  

Session 2.  HPCI in Biology & Medicine in the Cloud  (Chair: Dennis Gannon)

There are three talks in this session.

  • Computational Biology at the Exascale
    Katherine Yelick, UC Berkeley and Lawrence Berkeley National Laboratory
  • HySec-Flow: Privacy-Preserving Genomic Computing with SGX-based Big-Data Analytics Framework
    Judy Fox, Professor, University of Virginia
  • An automated self-service multi-cloud HPC platform applied to the simulation of cardiac valve disease with machine learning
    Wolfgang Gentzsch, UberCloud, Founder & President

Unfortunately, the recording of this session on YouTube is flawed (my fault).   Consequently, I have collected the videos together elsewhere.   A summary of the session and link to the induvial talks are here: Talks from the first IEEE Symposium on Cloud & HPC | The eScience Cloud (esciencegroup.com) 

Session 3. Using HPC to Enable AI at Scale (Chair: Dennis Gannon)

Session 3 included three talks that addressed how the private sector was approaching Cloud HPC to build new tools to solve critical global problems.  The session recording is here: CloudHPC Symposium – Session 3 – YouTube.  The talks were

  • Grand Challenges for Humanity: Cloud Scale Impact and Opportunities
    Debra Goldfarb, Amazon, Director HPC Products & Strategy
  • Enabling AI at scale on Azure
    Prabhat Ram, Microsoft, Azure HPC
  • Benchmarking for AI for Science in the Cloud: Challenges and Opportunities
    Jeyan Thiyagalingam, STFC, UK, Head of SciML Group

Debra Goldfarb passionately addressed the importance of bring HPC and the cloud to bear on many important global challenges.   Chief among this is the health crises brought on by viral pandemics.   She went on to describe work done in collaboration with the research community on providing the computation and data resources needed.  Of particular importance is the fact that the cloud could deliver scale that goes beyond single data centers.   One of the exciting outcomes was the work AWS did with the Covid-19 HPC consortium to provide the resources to Moderna to deliver their mRNA vaccine in an incredibly short time.   Debra also discussed the work on building the national strategic computing reserve that will play an important role in solving future problems.

Prabhat Ram, who recently moved from the DOE’s LBNL facility to Microsoft to build a team to address AI and HPC on Azure, provided an in-depth overview of the current state of Azure HPC Cloud capabilities.   He pointed out that for scientific computing problems of modest scale that it is easy to good speed-up on batch job with up to 128 cores.  Focusing on AI workloads Prabhat described their impressive performance on the HPL-AI benchmark.  Microsoft has now a deep collaboration with OpenAI and is building on the results from a few years ago with GPT-3.   Now the results now demonstrate that the cloud man be classified as leadership class computing.

Jeyan Thiyagalingam addressed the problem of designing benchmarks for AI and machine learning for science applications.   This is important for many reasons.  We have hundreds of different ML models and the applications and platforms differ widely.   Having a reasonable set of benchmarks will enable us to understand the applications better and provide a guide to tell us when our solutions are improving.   Developing these benchmarks is not easy.  One of the big problems is the size of realistic data sets for realistic applications. If a data set require 100 terabytes, how do we make it available?  The answer, it would seem, is the cloud.  However, Thiyagalingam observes that there are still many challenges to making that work in a practical way.   He makes a very strong case that the science community needs help from the cloud providers and other institutions to make this possible.

Session 4. Applications of Cloud Native Technology to HPC in the Cloud (Chair: Christoph Hagleitner)

Session 4 focused more on the Cloud software tools that are important for science.   There were three talks and the recorded video is here: CloudHPC Symposium – Session 4 – YouTube.

  • Serverless Supercomputing: High Performance Function as a Service
    Kyle Chard, Professor, University of Chicago
  • Minding the Gap: Navigating the transition from traditional HPC to cloud native development
    Bruce D’Amora, IBM Research
  • Composable Systems: An Early Application Experience
    Ilkay Altintas, SDSC, Chief Data Science Officer

Kyle’s talk was about how the function-as-a-service (FAS) model can be applied to large scale science.   In the talk he points out that the cloud computing industry has provided a number of important ideas, some  of which have been adopted by HPC (virtualization, containers, etc.).  He discussed FAS and introduced the FuncX system that they have built. (We have described FuncX elsewhere in this blog).  He also described a recent application of using FuncX for drug discovery pipelines for COVID-19 therapeutics.   

Kubernetes was developed by Google are released to the open-source community where it has become a cloud standard.  Bruce D’Amora considers the important issue of using Kubernetes on large scale scientific environments.  More specifically he described how they are using it to serve as a control plane for bare-metal system with high performance networks.

Ilkay Altintas, discussed a very ambitious project involving dynamic composability of computational services, AI and distributed real-time data integration.   This work goes well beyond static service integration into workflow.   The applications she discussed involved challenges such as the monitoring, measurement and prediction of the spread of wildfires and other rapidly changing input from widely disturbed sensor networks.   Large teams of participants need to be interacting with the system, so the workflow (or teamflow) requires dynamic scaling, on-demand interactive access and performance analysis.  The AI involved goes beyond the optimization of simulations to the optimization of a rapidly changing configuration of services and data streams.   The complexity of this task and its solution is extraordinary.   

Session 5.  Distributed Computing Issues for HPC in the Cloud (Chair Geoffrey Fox)

Session 5 contains 3 papers and the videos are available the following link: CloudHPC Symposium – Session 5 – YouTube

The papers are:

  • Challenges of Distributed Computing for Pandemic Spread Prediction based on Large Scale Human Interaction Data
    Haiying Shen, Professor, University of Virginia
  • GreenDataFlow: Minimizing the Energy Footprint of Cloud/HPC Data Movement
    Tevfik Kosar, Professor, University of Buffalo & NSF
  • IMPECCABLE: A Dream Pipeline for High-Throughput Virtual Screening, or a Pipe Dream?
    Shantenu Jha, Professor, Rutgers University

Haiying Shen describes the challenges of epidemic simulation using distributed computing.  In the case of epidemic simulations, the population is represented as a very large graph with edges representing interactions of human communities.   If we use a static partition of this graph to distribute the simulation over a distributed network  of servers, we encounter the problem that the graph is dynamic and so the partition must reflect this fact.  Optimizing the performance requires careful repartitioning and some replication.

Tevfik Kosar’s team has been look at the problem of reducing the cost of  moving data over the Internet.  On a global basis this costs $40 billion/year.  Their approach is to use application-level tuning of cross-layer parameters to save significant amounts of power.  Their approach involves clever clustering optimization.   They build a model that is best to capture the real-time conditions.   They applied this to a cloud-based experiment.  They achieve 80% higher throughput and up to 48% lower energy consumption.

Shantenu Jha described a massive effort by the DOE and other HPC laboratories to build a drug discovery pipeline, called IMPECCABLE that combines traditional HPC simulations with new AI methods that can greatly reduce the search spaces involved in molecular dynamics modeling.   IMPECCABLE is clearly vastly more sophisticated and compete than any of the current cloud-based HT virtual screening.  Professor Jha makes the point that more work is needed on system software to support heterogeneous workflows for large scale drug discovery. 

Final Sessions

There are two sessions for which we are still waiting for the IEEE to upload the videos.   We will update this document when we have access to recordings.

Session 1 Cloud & Heterogeneous Architectures & Opportunities for HPC (Chair: Ian Foster)

Advancing Hybrid Cloud HPC Workflows Across State of the Art Heterogeneous Infrastructures
Steve Hebert, Nimbix Founder and CEO

The impact of the rise in cloud-based HPC
Brent Gorda, ARM Director HPC Business

HPC in a box: accelerating research with Google Cloud
Alexander Titus, Google Cloud

Session 6.  Cloud HPC Barriers & Opportunities (Chair: Bruce D’Amora)

The Future of OpenShift
Carlos Eduardo Arango Gutierrez, Red Hat, HPC OpenShift Manager

Scientific Computing On Low-cost Transient Cloud Servers
Prateek Sharma, Indiana University

HW-accelerated HPC in the cloud: Barriers and Opportunities
Christoph Hagleitner, IBM Research

A Brief Introduction to Ray Distributed Objects, Ray Tune, and a Small Comparison to Parsl.


This short report describes the core features of the Ray Distributed Object framework.  To illustrate Ray’s actor model, we construct a simple microservice-like application with a web server frontend.  We compare Ray to similar features in Parsl and provide a small performance analysis.   We also provide an illustration of how Ray Tune can optimize the hyperparameter of a neural network.


Ray is a distributed object framework first developed by the amazingly productive Berkeley  RISE Lab. It consists of a number of significant components beyond the core distributed object system including a cluster builder and autoscaler,  a webserver framework,  a library for hyperparameter tuning, a scalable reinforcement learning library and a set of wrappers for distributed ML training and much more.  Ray has been deeply integrated with many standard tools such as Torch, Tensorflow, Scikit-learn, XGBoost, Dask, Spark and Pandas.    In this short note we will only look at the basic cluster, distributed object system and webserver and the Ray Tune hyperparameter optimizer.  We also contrast the basics of Ray with Parsl   which we described in a previous post.  Both Ray and Parsl are designed to be used in clusters of nodes or a multi-core server.  They overlap in many ways.  Both have Python bindings and make heavy use of futures for the parallel execution of functions and they both work well with Kubernetes, but as we shall see the also differ in several important respects.   Ray supports an extensive and flexible actor model, while Parsl is primarily functional.  Parsl is arguably more portable as it supports massive supercomputers in addition to cloud cluster models.  

 In the next section we will look at Ray basics and an actor example.  We will follow that with the comparison the Parsl and discuss several additional Ray features.  Finally, we turn to Ray Tune, the hyperparameter tuning system built on top of Ray.

Ray Basics

Ray is designed to exploit the parallelism in large scale distributed applications and, like Parsl and Dask, it uses the concept of futures as a fundamental mechanism.  Futures are object returned from function invocations that are placeholders for “future” returned values.   The calling context can go about other business while the function is executing in another thread, perhaps on another machine.   When the calling function needs the actual value computed by the function it makes a special call and suspends until the value is ready.  Ray is easy to install on your laptop.  Here is a trivial example.

The result is

As you can see, the “remote” invocation of f returns immediately with the object reference for the future value, but the actual returned value is not available until 2 seconds later.  The calling thread is free to do other work before it invokes ray.get(future) to wait for the value.  

As stated above, Parsl uses the same mechanism.   Here is the same program fragment in Parsl.

Ray Clusters and Actors

Ray is designed to manage run in ray clusters which can be launched on a single multicore node,  a cluster of servers or as pods in Kubernetes.   In the trivial example above we started a small Ray environment with ray.init() but  that goes away when the program exits.   To create an actual cluster on a single machine we can issue the command

$ ray start -–head

If our program now uses ray.init(address=’auto’) for the initialization the program is running in the new ray cluster.  Now, when the program exits, some of the objects it created can persist in the cluster.  More specifically consider the case of Actors which,  in Ray, are instances of classes that have persistent internal state and methods that may be invoked remotely. 

To illustrate Ray actors and the Ray web server will now describe an “application” that dynamically builds a small hierarchy of actors.    This application will classify documents based on the document title.  It will first classify them into top-level topics and, for each top-level topic subdivided further into subtopic as shown in Figure 1. Our goal here is not actually to classify documents, but to illustrate how Ray works and how we might find ways to exploit parallelism.  Our top-level topics are “math”, “physics”, “compsci”, “bio” and “finance”.   Each of these will have subtopics.   For example “math” has subtopics, “analysis”, “algebra” and  “topology”.  

Figure 1.  Tree of topics and a partial list of subtopics

We will create an actor for each top-level topic that will hold lists of article titles associated with the subtopics.   We will call these actors SubClassifiers and they are each instances of the following class.

 We can create an instance of the “math” actor with the call

cl = SubClassifier.options(name=”math”, lifetime=”detached”).remote(“math”)

There are several things that are happening here.  The SubClassifier initializer is “remotely” called with the topic “math”.  The initializer reads a configuration file (read_config) to load the subtopic names for the topic “math”.  From that the subtopic dictionaries are constructed.  We have also included an option that instructs that this instance to be “detached” and live in the cluster beyond the lifetime of the program that created it.  And it shall have the name “math”.   Another program running in the cluster can access this actor with the call

math_actor = ray.get_actor(“math”)

Another program running in the same Ray cluster can now add new titles to the “math” sub-classifier with the call


In an ideal version of this program we will use a clever machine learning method to extract the subtopics, but here we will simply use a tag attached to the title.   This is accomplished by the function split_titles()

We can return the contents of the sub-classifier dictionaries with


A Full “Classifier” Actor Network.

To illustrate how we can deploy a set of actors to make the set of microservice-like actors we will create a “document classifier” actor that allocates document titles to the SubClassifier actors.  We first classify the document into the top-level categories “math”, “physics”, “compsci”, “bio” and “finance”, and then send them to the corresponding SubClassifier actor.  

The Classifier actor is shown below.   It works as follows.   A classifier instance has a method “send(title)” which uses the utility function split_titles to extract the top-level category of the document.  But to get the subcategory we need to discover and attach the subcategory topic tag to the title.   A separate function compute_subclass does that.   For each top-level category, we get the actor instance by name, or if it does not exist yet, we create it.   Because computing the subclass may require the computational effort of a ML algorithm, we invoke compute_subclass as remote and store the future to that computation in a list and go on and process the next title.  If the list reaches a reasonable limit (given by the parameter queue_size) we empty the list and start again.   Emptying the list requires waiting until all futures are resolved.   This is also a way to throttle concurrency.

By varying the list capacity, we can explore the potential for concurrency.   If queue_size == 0, the documents will be handled sequentially.   If queue_size == 10 there can be as many of 10 documents being processed concurrently.  We shall see by experiments below what levels of concurrency we can obtain. 

In the ideal version of this program the function compute_subclass invoked a ML model to compute the subclassification of the document,  but for our experiments here, we will cheat because we are interested in measuring potential concurrency.   In fact the subclassification of the document is done by the split_titles() function in the SubClassifier actor.  ( The document titles we are using all come from ArXiv and have a top level tag and subclass already attached.  For example, the title ‘Spectral Measures on Locally Fields [math.FA]’ which is math with subtopic analysis. )

The simplified function is shown below.  The work is simulated by a sleep command.   We will use different sleep values for the final analysis.

The Ray Serve webserver

To complete the picture we need a way to feed documents to top-level classifier.  We do that by putting a webserver in front and using another program outside the cluster to send documents to the server as as fast as possible.    The diagram in Figure 2 below illustrates our “application”.

Figure 2.   The microservice-like collection of Ray actors and function invocations.  The clients send titles to  the webserver “server” as http put requests.  The server makes remote calls to the classifier instance which spawns compute_subclass invocations (labeled “classify()” here) and these invoke the send method on the subclassifiers.

The ray serve library is defined by backend objects and web endpoints.   In our case the backend object is an instance of the class server.

Note that this is not a ray.remote class.   Once Ray Serve creates an instance of the server which, in turn grabs an instance of the Classifier actor).  The Server object has an async/await coroutine that is now used in the Python version of ASGI, the Asynchronous Server Gateway Interface.   The call function invokes the send operation on the classifier.  To create and tie the  backend to the endpoint we do the following.

We can invoke this with

Serve is very flexible.  You can have more than one backend object tied to the same endpoint and you can specify what fraction of the incoming traffic goes to each backend. 

Some performance results

One can ask the question what is the performance implication of the parallel compute_subclass operations?   The amount of parallelism is defined by the parameter queue_size.  It is also limited by the size of the ray cluster and the amount of work compute_subclass does before terminating.   In our simple example the “work” is the length of time the function sleeps.   We ran the following experiments on AWS  with a single  multicore server with 16 virtual cores.   Setting queue_size to 0 make the operation serial and we could measure the time per web invocation for the sleep time.   For a sleep of t=3 seconds the round trip time per innovation was an average of 3.4 seconds.   For t=5, the time was 5.4 and for t=7 it was 7.4.  The overhead of just posting the data and returning was 0.12 seconds.  Hence beyond sleeping there is about .28 seconds of ray processing overhead per document.  We can now try different queue sizes and compute speed-up over the sequential time.

Figure 3.  Speed up over sequential for 40 documents sent to the server. 

These are not the most scientific results: each measurement was repeated only 3 times.  However, the full code available for tests of your own.   The maximum speed-up was 9.4 when the worker delay was 7 seconds and the queue length was 20.   The code and data are in Github.

Another simple test that involves real computation is to compute the value of Pi in parallel.   This involves no I/O so the core is truly busy computing.   The algorithm is the classic Monte Carlo method of throwing X dots into a square 2 on a size and counting the number that land inside the unit circle.  The fraction inside the circle is Pi/4.   In our case we set X to 106 and compute the average of 100 such trials.   The function to compute Pi is

The program that does the test is shown below.   It partitions the 100 Pi(10**6)  tasks in blocks and  each block will be executed by a single thread.

As can be seen the best time is when we compute 100 independent blocks.  The sequential time is when it is computed as a sequential series of 100 Pi tasks.   The speedup is 9.56.  We also did the same computation using Dask and Parsl.  In the chart below we show the relative performance .The graph shows the execution time on the vertical axis and the horizontal is the block sizes 4, 5, 10, 20, 100.   As you can see Dask is fastest but Ray in about the same when we have 100 blocks in parallel.

Figure 4.  Execution time of 100 invocations of Pi(10**6).  Blue=Parsl, Green=Ray and  Orange=Dask.

More Ray


One interesting extension of Ray is Modin,  a drop-in replacement for Pandas.   Data scientists  using Python have, to a large extent, settled on Pandas as a de facto standard for data manipulation.   Unfortunately, Pandas does not scale well.   Other alternatives out there are Dask DataFrames, Vaex, and the NVIDIA-backed RAPIDS tools.   For a comparison see   Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS (datarevenue.com) and the Modin view of Scaling Pandas

There are two important features of Modin.   First is the fact that it is a drop-in replacement for Pandas.  Modin duplicates 80% of the massive Pandas  API including all of the most commonly used functions and it defaults to the original Pandas versions for the rest.  This means you only need to change one line:

import pandas as pd


import modin.pandas as pd

and your program will run as before.  The second important feature of Modin is performance.   Modin’s approach is based on an underlying algebra of operators that can be combined to build the pandas library.  They also make heavy use of lessons learned in the design of very large databases.  In our own experiments with Modin on the same multicore server used in the experiments above Modin performed between 2 time and 8 times better for standard Pandas tasks.  However, it underperformed Pandas for a few.   Where Modin performance really shines is on DataFrames that are many 10s of gigabytes in size running on Ray clusters of hundreds of cores.  We did not verify this claim, but the details are in the original Modin paper cited above. 

Ray Tune: Scalable Hyperparameter Tuning

Ray Tune is one of the most used Ray subsystems.   Tune is a framework for using parallelism to tune the hyperparameters of a ML model.   The importance of hyperparameter optimization is often overlooked when explaining ML algorithm.   The goal is to turn a good ML model into a great ML model.    Ray Tune is a system to use asynchronous parallelism to explore a parameter configuration space to find the set of parameters that yield the most accurate model given the test and training data.  We will Illustrate Tune with a simple example.   The following neural network has two parameters: l1 and l2.

These parameters describe the shape of the linear transforms at each of the three layers of the network.  When training the network, we can easily isolate two more parameters:  the learning rate and the batch size.    We can describe this parameter space in tune with the following.

This says that l1 and l2 are powers of 2 between 4 and 64 and the learning rate lr is drawn from the log uniform distribution and the batch size is one of those listed.  Tune will extract instances of the configuration parameters and train the model with each instance by using Ray’s asynchronous concurrent execution.  

To run Tune on the model you need to wrap it in a function that will encapsulate the model with instances of the configuration.  The wrapped model executes a standard (in this case torch) training loop followed by a validation loop which computes an average loss and accuracy for that set of parameters.

One of the most interesting part of Tune is how they schedule the training-validation tasks to search the space to give the optimal results.   The scheduler choices include HyperBand, Population Based Training and more.  The one we use here is Asynchronous Successive Halving Algorithm (ASHA) (Li, et.al. “[1810.05934v5] A System for Massively Parallel Hyperparameter Tuning (arxiv.org)” ) which, as the name suggests, implements a halving scheme that rejects regions that do not show  promise and uses a type of genetic algorithm to create promising new avenues.  We tell the scheduler we want to minimize a the loss function and invoke that with the tune.run() operator as follows.

We provide the complete code in a Jupyter notebook.  The model we are training to solve is the classic and trivial Iris classification,   so our small Neural network converges rapidly and Tune gives the results

Note that the test set accuracy was measured after ray completed by loading the model from checkpoint that Tune saved (see the Notebook for details)

Final Thoughts

There are many aspects of Ray that we have not discussed here.   One major omission is the way Ray deploys and manages clusters.   You can build a small cluster on AWS with a one command line.  Launching applications from the head node allows Ray to autoscale the cluster by adding new nodes if the computation demands more resources and then releases those node if no longer needed.

In the paragraphs above we focused on two Ray capabilities that were somewhat unique.   Ray’s actor model in which actors can persist in the cluster beyond the lifetime of the program that created them.  The second contribution from Ray that we found exciting was Tune.   A use case that is more impressive that our little Iris demo is the use of Tune with hugging face’s Bert model. See their blog.

Deep Learning with Real Data and the Chaotic Double Pendulum

In this note we provide a simple illustration of how the ‘classic’ deep learning tool LSTM can be used to model the  solution to time dependent differential equations.  We begin with an amazingly simple case: computing the trajectory of a projectile launched into the air,  encountering wind resistance as it flies and returns to earth.  Given a single sample trajectory, the network can demonstrate very realistic behavior for a variety of initial conditions. The second case is the double pendulum which is well known to generate chaotic behavior for certain initial conditions.   We show that a LSTM can effectively model the pendulum in the space of initial conditions up until it encounters the chaotic regions.

Using Neural Nets to Generate Streams of Data

Much of the recent excitement in deep learning has come from the ability of very large language models (such as gpt-3 from OpenAI) to generate streams of text to solve impressive tasks such as document summarization, writing poetry, engaging in dialog and even generating computer code.      In the case of scientific application, deep learning models have been used to tackle a number of interesting forecasting challenges.  

There are two major differences between natural language processing and physical system modeling.  First, natural language has a finite vocabulary and the model must include an embedding that maps words into vectors and a decoder  that maps the neural network outputs into a probability distribution over the vocabulary.   In the case of physical systems, the data often consists of  streams of vectors of real numbers and the embedding mapping may not be needed.  The second difference is that physical systems can often be modeled by differential equations derived from the laws of physics.   There are often strong parallels between the formulation of the differential equation and the structure of the neural network.   We examined this topic in a previous article where we described the work on Physically Inspired Neural Networks and Neural Ordinary Differential Equations.   Solving a time dependent differential equation is often stated as an initial value problem:  Given the equations and a consistent set of k critical variables at some time t0, a solution is an operator D that defines a trajectory through Rk that satisfies the equations.

This image has an empty alt attribute; its file name is Picture2.png

Unless a differential equation has a known, closed form solution, the standard approach is numerical integration. We can describe this as an operator Integrate which takes vector of  initial variable values at t0 ,V0 in Rk  Integrate( V0to a sequence Vi in Rk 

This image has an empty alt attribute; its file name is Picture3.png

so that the appropriate finite difference operations when applied to the sequence of Vi approximately satisfy the differential equations. The Integrate operator is normally based on a carefully crafted numerical differential equation solver and   the number of time steps (Tn) required to advance the solution to a desired end configuration may be very large and the computation may be very costly.   Alternatively, the Integrate operator may be based on a pretrained neural network which may be as accurate and easier to compute.   We are now seeing excellent examples of this idea.

Rajat Sen, Hsiang-Fu Yu, and Inderjit Dhillon consider the problem of forecasting involving large numbers of time series.  By using a clever matrix factorization they extract a basis set of vectors that can be used with a temporal convolutional network to solve some very large forecasting problems.  JCS Kadupitiya, Geoffrey C. Fox, and Vikram Jadhao consider classical molecular dynamics problems and they show that an integrator based on Long Short Term Memory (LSTMs) can solve the Newton differential equations of motion with time-steps that are well beyond what a standard integrator can accomplish.   J. Quetzalcóatl Toledo-Marín, Geoffrey Fox, James Sluka and James A. Glazier  describe how a convolutional neural network can be used as a “surrogate” for solving diffusion equations. 

In his draft “Deep Learning for Spatial Time Series”,  Geoffrey Fox generalizes the LSTM based differential equation integrators described above (and in our examples below) to the case of spatial as well as temporal time series.   To do this he shows how to integrate the encoder of a Transformer with and LSTM decoder.   We will return to the discussion of Fox’s report at the end of this post.

There are many variations of neural networks we could use to approach the problem of forecasting streaming data.   Transformers are the state-of-the-art for NLP, having totally eclipsed recurrent neural nets like the Long Short-term Memory (LSTM).   When put to the task of generating sentences LSTM tend to lose context very quickly and the results are disappointing.  However, as we shall see, in the case of real data streams that obey an underlying differential equation they can do very well in cases where the solutions are non-chaotic.   We will look at two  very simple  examples to illustrate these points

Case1.  The Trajectory of a Ball Thrown Encountering Wind Resistance.

Consider the path of a ball thrown with initial speed v and at an elevation angle θ.  As it flies it encounters a constant drag and eventually falls back to earth.  With the initial position as the origin, the differential equation that governs the path of our ball is given by  

This image has an empty alt attribute; its file name is Picture4.png

where g is the gravitational acceleration and u is the drag caused by the air resistance as illustrated in the figure  below.

This image has an empty alt attribute; its file name is Picture5-1.png

Figure 1.   The path of our ball given initial speed v and elevation angle θ

Wikipedia provides a derivation of these equations and a Python program that generates the numerical solution illustrated above.   We will use a modified version of that program here.   A standard ordinary differential equation solver (odeint) can be used to generate the path of the ball. 

This image has an empty alt attribute; its file name is Picture6-1-1024x326.png

We will use a simple LSTM network to learn to reproduce these solution.  The network will be trained so that if it is given 10 consecutive trajectory points (xi,yi) for i in [0,9] it will generate the next point (x10,y10).   To make the LSTM generate a path for a given value of v and θ we provide the first 10 points of the solution and then allow the LSTM to generate the next 100 points.

 We will use the most trivial form of a LSTM as shown below to define our model.

This image has an empty alt attribute; its file name is Picture7-1.png

The differential equation requires 4 variables (x, y, vx, vy)  to make a well posed initial condition,  however our model has an input that is only two dimensional.  To provide the velocity vector we give it a sequence of 10  (x,y) pairs, so our model is

This image has an empty alt attribute; its file name is Picture8-1.png

The way the model is trained is that we take a trajectory computed by the ODE Solver computed by the projectile_motion function and break that into segments of length 11.   The LSTM is given the first 10 pairs of points and it is trained to predict the 11th point.  This is a training set of size 100.  (The details of the training are in the Jupyter Notebook in the github repo for this article.  )

We have a simple function makeX(angle, speed) that will compute the ODE solution and time to ground.  The training set was generated with the path p = makeX(30, 500).   A second function  OneStep(path) that takes a computed path and uses the first 10 steps to start the model which goes until it reaches the ground.  OneStep (less a plotting component and the ground hitting computation) is shown below.   The data is first normalized to fit in the (-1, 1) range that neural net expects.  We then take the first 10 elements to start a loop which predicts the 11th step.  That is appended to the list and we continue to predict the next step until done. 

This image has an empty alt attribute; its file name is Picture1-1024x378.png

  If we run OneStep with the path from the training data it is a perfect fit.  This, of course is not surprising. However if we give it a different initial angle or different initial speed the network does an very good job on a range of initial conditions.  This is illustrated in Figures 2 and 3 below.

This image has an empty alt attribute; its file name is Picture10-1.png

Figure 2.  Comparing the ODE solver to the LSTM for a fixed speed and a range of angles.

This image has an empty alt attribute; its file name is Picture11-1-1024x386.png

Figure 3.   Comparing the ODE solver to the LSTM for a fixed angle and a range of speeds.

The further we move from our initial training path, the less accurate the result.  This is shown in Figure 4.

This image has an empty alt attribute; its file name is Picture12-1.png

Figure 4.   angle = 10, speed = 500.  The LSTM overshoots slightly but ends correctly.

We also trained the network on 4 different paths.  This improved the results slightly.

Case 2. The Double Pendulum

The double pendulum is another fascinating case with some interesting properties.  It consists of two pendulums, with the second suspended from the end of the first.  As shown in figure 5, it can be characterized by two angles θ1 and θ2 that measure the displacements from vertical, the length of the two pendulum arms L1 and L2 and the masses M1 and M2

This image has an empty alt attribute; its file name is Picture13-1.png

Figure 5.  The double pendulum

The equations of motion are in terms of the angles, the angular velocities and the accelerations.  The

This image has an empty alt attribute; its file name is Picture15-1-1024x290.png

We have adapted a very nice program by Mohammad Asif Zaman (github) by adding our LSTM network as above.   The initial condition is to hold the configure vector in a position for a given value for the two angles with the angular velocities both zero and release it.   We have a function
double_pendulum(V,t,m1,m2,L1,L2,g) that evaluates the differential equations given the configuration and we use the standard ODE solvers to integrate the solution forward from the initial configuration with

This image has an empty alt attribute; its file name is Picture16-1-1024x228.png

Using different values for parameters for d1 and d2 we can generate a variety of behaviors including some that are very chaotic. 

Our LSTM network is as before except its initial and final dimension is 4 rather than 2.  We train this as before but integrate much farther into the future.   (Again, the details are in the notebooks in github.)

This image has an empty alt attribute; its file name is Picture17-1.png

We have created a version of the OneStep function that allows us to see the trajectories.  It is more interesting to look at the trajectory of the lower mass.   In Figure 6 below we see on the right the positions of the double pendulum with the initial condition angles θ1= -60 degrees and θ2 = 60 degrees.   We then show the configurations at T = 20 and T = 80.

This image has an empty alt attribute; its file name is Picture18-1.png

Figure 6.   Double Pendulum.   On the left is the path of the ODE solution in red and the LSTM solution in green.

By looking at the path traces on the left the two solutions for this configuration diverge rapidly.   Using smaller angles we get much more stable solutions.   Below is the trace for θ1= -20 degrees and θ2 = 35 degrees.

This image has an empty alt attribute; its file name is Picture20-1.png

Figure 7.  Blue mark is starting point, yellow is when the two solutions begin to diverge.

Below we have the animated versions of the ODEsoltion and the LSTMsolution for this case. Below that in Figure 9 is the illustration of the chaotic case (-50, 140)

Figure 8 Left is ODE solution for -20 degrees, 35 degrees. Right is LSTM solution

Figure 9. Left is chaotic ODEsoltion for -50 degrees, 140 degrees. Right is failed LSTMsolution


The experiment described here used only the most trivial LSTM network and tiny training set sizes.   The initial value problems we considered were a two dimensional submanifold of the 4 dimensional space:  in the case of the projectile, the origin was fixed and the speed and angle were free variables and in the pendulum, we had the two angles free and the initial angular velocities were set to 0.  Each point of the initial conditions is a point on the configuration manifold (which is a cylinder defined by angle and speed for the projectile and a torus defined by the two angles of the pendulum).  The differential equations specify a unique path from that initial point along the configuration manifold.  The LSTM defines another path from the same initial point along the manifold.  The question is  how are these two paths related?

We noticed that training the LSTM with one solution allowed the LSTM to generate reasonably good solutions starting from points “near” the training initial value point.   It is interesting to ask about the topology of set of solutions S to the differential equations, in the region containing the solution generated by an initial value point x.   Consider the case of the projectile when the wind resistance is zero.  In this case the solutions are all parabolas with a second derivative that is -g.   This implies that any two solutions differ by a function that has a second derivative that is zero: hence they differ by a simple line.  In this case he topology of S is completely flat.   If the wind resistance is non-zero, the non-linearity of the equations tells us that the submanifold S is not completely flat.   If we consider the submanifold of solutions S’x for the LSTM trained on the single initial value point x, we noticed that values near x generated solutions in S’x that were very close to the solutions in S near those generated  by points near x.  We suspect that the two submanifolds and S’x share the same tangent plane at x.   We attempted to illustrate this in Figure 9 below.

This image has an empty alt attribute; its file name is Picture22.png

Figure 9.  Depiction of the tangent planes to two submanifolds that have the same solution for a given initial condition x.

In the case of the double pendulum,  the  shape of S is much more complex.  The LSTM manifold is a good approximation of the pendulum in the stable region, but once a starting point is found at the edge of the chaotic region the LSTM solution fall apart. 

In his draft ”Deep Learning for Spatial Time Series”, Fox shows a very general and more sophisticated application of neural networks to time series.   In particular, he addresses sequences that are not limited to time series as we have discussed here.  He describes a concept he calls a “spatial” bag which refers to a space-time grid of cells with some number of series values in each cell.  For example, cells may be locations where rainfall, runoff and other measures are collected over a period of time.   So location in the spatial bag is one axis and time is the other.   Making prediction along the spatial direction is  analogous to sequence to sequence transforms  and prediction along the time axis is similar to what we have described here.  The network his team used is hybrid with a Transformer as encoder and LSTM as a decoder.   Because the bag is an array of discrete cells,  the Transformer is a natural choice because the attention mechanism can draw out the important correlations between cells.  The LSTM decoder can generate streams of output in a manner similar to what we have done.   The paper shows convincing results for hydrology applications as well as COVID-19 studies.   I expect much more to come from this work.

Interacting with a Running FuncX Instance

This short report is an addendum to the report “A Look at Parsl and FuncX: Two Excellent Parallel Scripting Tools for Clouds and Supercomputers.”  At the conclusion of that report it was stated that one of the missing pieces of our analysis was a description of how distributed FuncX function instances can communicate with other objects to handle remote interactions or to process streaming data.  Here is another way to state the problem.   FuncX gives you a way to package and launch a function on a remote resource, but you have no direct way to interact with that executing function until it returns or otherwise terminates.  We present three different scenarios that show how to do this.

Consider the following scenario.   You have a function that loads a deep learning vision model and you want it to run on a remote CUDA-capable device and then use the camera on that device to capture a picture and return the analysis and the image to you.   Even if the model has been previously cached on the remote device, loading it, moving the model to the CUDA engine and then analyzing the image can take over 10 seconds. However, once it has been run through the CUDA pipeline, subsequent images can be processed 100 times faster.    How can we interact with the function execution to tell it “Take another picture now” (taking advantage of 100 fold speed up) and return the result of the inference and the image without having to terminate the function’s execution?    

This image has an empty alt attribute; its file name is Picture1.png

Figure 1. Interacting with a remote execution.

In Figure 1 above we have a small desktop client application that communicates with our function execution.  There is a button at the top to send a message to the waiting function to activate the camera and do the inference to recognize the image.   The first image was a picture of a chair and it took 21.0 seconds.  The next was a picture of a desk.  Third image was of a book filled office which was labeled “library”.   These last two images only took about 0.16 seconds.   The full transaction recorded on the client was

This image has an empty alt attribute; its file name is Picture2.png

The original function was launched from a Jupyter notebook on the same desktop that is running the client app.  When the client app is closed it sends  a “quit”  message to the function which causes it to terminate normally and return to the jupyter notebook.   The key to making this work is the communication between the client app and the executing function.  The communication mechanism used here is asynchronous queueing as illustrated below.

This image has an empty alt attribute; its file name is Picture3-1024x343.png

Figure 2.   Client and running instance communicating asynchronously through queues

It is important to note that this is not a ‘request-response’ scenario.  It is fully asynchronous.  Both the client and the function instance run loops that monitor their input queue.  The client sends either a “quit” action or a “take picture” action depending on the user input. A separate thread continuously monitors the input stream of messages from the function instance.   Messages coming from the function instance are either informational, such as “model loaded” or “model cuda-ized” meaning that the model has been moved to the Nvidia Cuda cores.   The client displays these in the text box.  The other message that the instance sends are the inference classifications such as “library” followed by the time it took for the inference.   When it sees a “class” message it uses secure copy (scp) to pull the image and display it.

We implemented this scenario with two different queue systems: RabbitMQ with Pika and AWS Simple Queue Service.   RabbitMQ is installed as a separate service on a host visible to the network that has the client and the Jetson device.  Pika is the AMQP protocol python implementation that allows communication with the RabbitMQ service.   Message consumers based on Pika use a continuation-passing style in which the consumer provides a callback function that will be invoked when the queue has a message to deliver. Our FuncX function that is run on the Jetson device takes the form below.

This image has an empty alt attribute; its file name is Picture4-1024x591.png

When invoked by FuncX on the remote device it first establishes a connection and channel back to the RabbitMQ service.  It then goes about loading and initializing the Resnet18 computer vision model and moving it to the Cuda Nvidia cores.  It next registers the callback function and issues a “start_consuming” action.   At this point the function will wait for messages on the “command” queue.  When it receives a “quit” command it stops consuming and returns to the continuation point which is the return from the function back to the FuncX calling Jupyter notebook. 

The solution using AWS Simple Queue Service is conceptually easier to grasp.  SQS allows us to read a message from a queue if there is a message there.   If not, the read operation waits for a prescribed interval of time and if nothing arrives in the queue, it will return.  The main body of the function is as follows,

This image has an empty alt attribute; its file name is Picture5.png

The function goes into a loop that start with a receive-message on its input queue.  It asks for 1 message and the maximum wait time is 5 seconds.  If a message arrives in that interval it is either “quit” or “take picture”.   If it is “quit” it send a signal back to the client program signaling it to quit and it then return from the FuncX invocation.   

The source code for both solutions is in the dbgannon/pars-funcx github repository as funcx-interactive-camera-final.ipynb (and html).   The desktop client program are also there.   You need to supply an AWS account information to run the aws example.  To run the rabbitmq version you need to have an instance of rabbitmq running.  It doesn’t cost you anything to download and run it, but it is not a fun installation. 

Dealing with Streams

If we want to monitor a data stream from an instrument on the remote resource it is easy enough to write a function that will go into a continuous loop gathering that data, doing needed analysis and sending results to some remote database.   The problem is that we may need to stop the function, perhaps to replace it with a better version, but the FuncX client does not give us a way to do so.   There are several solutions to sending graceful termination signals to such a function and we discuss one below.   

Another challenge is designing a protocol that allows to or more remotely executing functions to exchange messages reliably without entering deadlock states. 

The scenarios we are interested in are

  1. A function generates a stream of messages that can be sent to zero or more listeners.   For example, the sending function may be drawing samples from an instrument.  If the sender is sending message it should not wait on a “blocking send” for a listener to show up because the instrument may be generating interesting values that will be missed.   Consequently, it may be best to just push the messages to some “device” that allows listeners to pick up the most recent values.   We will look at a sample where multiple senders are sending messages to 0 or more listeners and we will use a publish-subscribe model.  Listeners select a “channel” to subscribe to and they will receive only the messages that are sent on that channel.  You may have multiple listeners on a channel and listeners may come and go as needed.   Senders can send messages on any channel and the send operations are non-blocking.   This example uses ZMQ service for the messaging.  The routing device is a small 10-line program that runs on a server exposed to the network.
This image has an empty alt attribute; its file name is Picture6.png
  • In the case above we use a routing device to channel messages from senders to listeners and if there is no listener on a channel, we just drop messages that come in on that channel.  In the second case we want to avoid dropping messages.  To do this we encapsulate collections of function invocations together with a queue service into a single “component”.  Messages sent to the component queue are processed in a first-in-first-out manner by one of the function instances.  In our example we consider the case of two “components”: a front-end that receives messages from our control program and a backend that receives messages from the front-end.  The front-end function processes the message and then save the result in an external table or it forwards the message to a back-end component that has multiple function instances to process the request.   This design allows messages that require more cpu intensive processing to be handled by a pool of back-end workers as shown below.
This image has an empty alt attribute; its file name is Picture7.png

Pub-Sub Routing with ZMQ in FuncX

The first case above is illustrated by a  demo in notebook funcx-zmq-final.ipynb that shows how four funcx instances an communicate streams through a zmq pub-sub routing filter. The routing filter is a simple program that runs on a server with a “public” IP. In this case it is a small NVIDIA jetson device and the access is via the local area network at address and it is listening on port 5559 for the listeners to subscribe and on port 5560 for the publishers.  This is a standard ZMQ example and the core of the program is shown below.

context = zmq.Context(1)
# Socket facing clients
frontend = context.socket(zmq.SUB)
frontend.setsockopt(zmq.SUBSCRIBE, "")

  # Socket facing services
backend = context.socket(zmq.PUB)
zmq.device(zmq.FORWARDER, frontend, backend)

In the demo we have two listeners. One listener subscribes to messages on channel 5 and the other on channel 3. We have two “sender” functions that send a stream of messages to the router. One “Sally” runs on a small server and the other “Fred” is invoked from the notebook. The only difference is that Sally sends a message every 0.2 seconds and Fred sends messages twice as often. Both alternate messages between channels 3 and 5.  The code for Fred is below.

This image has an empty alt attribute; its file name is Picture8.png

In this case it sends only 22 messages of the form


For x in the range 0 to 21 alternating between channels 3 and 5.   It then sends a “Stop” message on both channels.

The stop message causes all listeners on channels 3 and 5 to terminate.  The listeners are also quite simple.

This image has an empty alt attribute; its file name is Picture9.png

The listener subscribes on the supplied topic channel and processes messages until the “Stop” message is received.  It is also easy to stop a specific listener if we provide the listener with a unique id so that it can stop when it sees a Stop message with that ID attached.  This is important for the case when a listener needs to be replaced by an upgraded instance without missing messages.  One starts the new version and then stop the old version with its specific kill signal.   The Jupyter notebook illustrates this by showing how the listeners receive the messages from the senders in interleaved order. 

Reliable Messaging Using a Component with Queue Model

To implement the solution in case 2 above we need a queue management system with the following properties

  1. It must allow FIFO queues to be easily created and destroyed.
  2. To ensure the solution remains deadlock free and termination guarantees it must be possible for a process to read from the head of the queue and, if there is nothing there the reader is released in a bounded amount of time. 
  3. The queues should be able to hold an unbounded number of messages.

Of course, 3 is impossible, so we satisfy ourselves with queues built into substantial storage backends.  The easiest way to do this is to use Azure storage queues or the AWS simple queue service SQS.  SQS is simple to use and it is very inexpensive.  (For the work on this project my expenses were far less than $1.) 

For the demo we have two components:

  1. A Front-end component that receives messages and processes them in FIFO order. If the message is “forward” it passes the message to a Back-end component. Otherwise if the message is not “Stop”, it processes message and stores the result in a table. The table we use is in Azure Storage Service because it is cheap and reliable.
  2. The Back-end component consists of one or more instances of a backend processor functions which pull messages from the input queue for that component. We can control throughput of the back-end component by increasing or decreasing the number of functions servicing the queue. When the back-end processing functions complete and execution they store the result in the queue.

The function we incorporate into the front end is as follows.

This image has an empty alt attribute; its file name is Picture10-918x1024.png

In this demo every message is a simple python dictionary with a field called “action” which tell the function what action to take.   

The details of the auxiliary function are in the full code in the notebook aws-sqs-example in the repository.   The back end function is similar

This image has an empty alt attribute; its file name is Picture11-1024x922.png

The following simple wrapper creates an instance of the components.   The parameters are:

  1. the base name of the component (like “Front” or “Back”)
  2. the name of the input queue for this component.
  3. the output name which could be another queue name or the table name.
  4. repl_factor: the number of function instances of the type of func_id to create.
  5. the end_point that will be the host service for the function instances.
  6. func_id the uuid of the function resulting from funcx registration as above.
This image has an empty alt attribute; its file name is Picture12-1024x173.png

This launcher creates repl_factor copies of our function instance.  Running this on Kubernetes launches one pod per instance with each listening to the input queue.   The return value is the list of funcx return values for each instance. 

Final thoughts.

The solutions above are somewhat ad hoc and the programming patterns are not new.  A possible improvement to FuncX would be to make a dedicated message queue service available.  This could be an extension of existing Globus functionality already being used in FuncX.       

A Look at Parsl and FuncX: Two Excellent Parallel Scripting Tools for Clouds and Supercomputers.

In 2019, Yadu N Babuji, Anna  Woodard, Zhuozhao  Li,  Daniel S. Katz, Ben  Clifford, Rohan  Kumar, Lukasz  Lacinski, Ryan Chard, Justin Michael Joseph Wozniak, Michael  Wilde and Kyle  Chard published a paper in HPDC ’19 entitled Parsl: Pervasive Parallel Programming in Python.  I have been looking forward to finding the time to dig into it and give it a try.  The time did arrive and, as I started to dig, I discovered some of this same group, along with Tyler Skluzacek, Anna Woodard, Ben Blaiszik and Ian Foster published  funcX: A Federated Function Serving Fabric for Science in HPDC ’20.  In the following paragraphs we look at both and show examples of FuncX running on Kubernetes on Azure and on a tiny Nvidia Jetson Nano.

An Overview of Parsl

Parsl is best thought of as a tool for constructing and managing distributed parallel scientific workflows. For example, suppose you have to do data processing on a collection of files in a remote Kubernetes cluster at the same time manage a large simulation that will run  in parallel on a supercomputer and finally channel the results to a visualization system.  As sketched in the diagram in Figure 1, you want the main thread of the workflow to be managed from a python Jupyter notebook session. Parsl can do this.

Figure 1.  Hypothetical parallel distributed workflow involving remote resources managed from a Jupyter session on a laptop.

The list of actual examples of scientific applications studied using Parsl is impressive and it is documented in their case studies page.   They include examples from chemistry, physics, cosmology, biology and materials science. 

Programming Parsl is based on the concept of futures.   This is an old idea in which function invocations returns immediately with an object that represents the “future value” that the function will compute.  The calling thread can go about other work while the function computation takes place in another thread of execution.   The calling thread can later wait for the function to complete and retrieve the result.  To illustrate this here is an example of a function that computes Pi

This image has an empty alt attribute; its file name is Picture2.png

The decoration @python_app indicates that this function will return a future.    We can check to see if the computation is complete by calling done() on the future object.   When done() returns true we can get the result value with the result() function.

This image has an empty alt attribute; its file name is Picture3.png

Parsl will allow functions returning futures to be composed into graphs that can be scheduled and executed in “dataflow” style.   For example if we have to additional functions F(x,y) and G(a,b) that return futures then the graph in Figure 2

This image has an empty alt attribute; its file name is Picture4.png

Figure 2.   Parsl dataflow-style scheduling

will be scheduled and executed so that F and G are not invoked until the values of its arguments are available. 

Parsl also has a data provider class that facilitates access to remote files.  Parsl has a way to handle futures for file objects called dataFutures which is a mechanism to guarantee synchronization around file reads and writes from remote threads.

The true strength of Parsl is in how it separates the execution and runtime from the high-level language Python script.   Parsl is unique among parallel programming systems in that it allows a Parsl program to run and SCALE across laptops, shared memory multicore servers,  small HPC Clusters,  Kubernetes in the cloud, and supercomputers.   It accomplishes this by allowing to programmer to specify an Executor class to manage concurrency in the running of the application.   There are currently four executors: ThreadPool, HighTroughPut, WorkQueue, and ExtremeScale.   Each executor must have a Provider that provides the mechanism to connect to the local resource for launching tasks.   There is a simple LocalProvider for shared memory multiprocessors and a provider for Kubernetes in the cloud.   In addition, there are special providers for a host of supercomputers including

  • Argonne’s Theta and Cooley
  • ORNL’s Summit
  • Open Science Grid Condor Clusters
  • University of Chicago Midway Cluster
  • TACC’s Frontera
  • NERSC’s Cori
  • SDSC’s Comet
  • NCSA’s Blue Waters
  • NSCC’s Aspire 1

To illustrate executors,  we have allocated an ubuntu “data science” vm on Azure that is an 8 core (4 real cores) server.   We will run 100 instances of the pi program from above, but we will do this we different levels of concurrency. We would like to do this with maximum throughput so we will use the “HighThroughputExecutor” configured as

This image has an empty alt attribute; its file name is Picture5.png
This image has an empty alt attribute; its file name is Picture6.png

We will first run pi(10**6)  sequentially 100 times.   Next, we launch two instances of pi repeated 50 times.   Doing 4 instances concurrently for 25 repetitions is next.  Repeating this process for 5, 10 and 100 concurrent instances gives us the following

This image has an empty alt attribute; its file name is Picture7.png

Compared to Dask

In many ways Parsl is like Python Dask (which we wrote about in a 2018 blog article.)   Dask is heavily integrated into the python stack.   Numpy and Pandas smoothly interoperate with Dask.  For the embarrassingly parallel bag-of-tasks applications Dask has a feature called dask.bag (db below).  We can compare Dask on the same 8 core Azure ubuntu machine that we used above.   We create a list of 100 copies of the value 10**6 and create a bag sequence from this.  We partition this bag into “nparts” partitions and invoke it in parallel as follows.

This image has an empty alt attribute; its file name is Picture8.png

Running this with the same set of partitions as above we get the following.

This image has an empty alt attribute; its file name is Picture9.png

Note that the best performance is achieved when there is one execution of pi(10*6) per partition. The graph below illustrates the relative performance of Parsl and Dask.

This image has an empty alt attribute; its file name is Picture10.png

Figure 3.  Dask (orange) vs Parsl (blue) execution time for block sizes 1, 2, 4, 5,  10, 20,  100.

Without further tuning of the executor and provider for running Parsl on your multicore laptop I believe Dask is the best performer there.  (The Jupyter notebook that contains these tests is in the repo dbgannon/parsl-funcx (github.com).)  But this is certainly not the whole story.  You can use your laptop to debug a Parsl script and then run it with a different executor on a massive cluster or supercomputer to achieve remarkable results.   In the Parsl HPDC ’19 paper, the authors provide ample evidence that Parsl outperforms every other alternative except perhaps a custom MPI program. 

FuncX – a function as a service fabric.

Serverless computing is one of the standards computing paradigms that cloud computing supports.  The AWS Lambda service was the first to introduce the idea of providing a way to have your function deployed and invoked without requiring you to deploy  VMs or other infrastructure.    In addition to Lambda, Azure has a serverless offering called Azure functions and Google has Cloud Functions and IBM supports Apache OpenWhisk.  One of the major shortcomings of these serverless FaaS products is that they are designed to support relatively light weight computations (small memory requirement and short-lived execution). 

FuncX is a FaaS fabric that is designed for science.  Technically, FuncX is not a serverless platform.   FuncX requires that the user acquire or deploy the physical resources needed.   To submit a function to the resource you need an instance of the FuncX client and an endpoint string which is the key to the host resource.   To create an instance of the FuncX client  and bind a function to it, you simply do the following.

This image has an empty alt attribute; its file name is Picture13.png

Once we have registered the function with the FuncX client and when we have the FuncX endpoint we can run the function.

This image has an empty alt attribute; its file name is Picture14.png
This image has an empty alt attribute; its file name is Picture15.png

As shown above the run method returns a uuid for the result which can be passed to the client object to obtain the execution status and eventually the result.   What is not obvious from this example is where the host endpoint comes from and how does the client convey the task to the host for execution.

The FuncX paper describes the architecture in detail, so we will not dwell on it here.  In simple terms, requests from the client (fxc above) to run a function are sent to an AWS-based web service which puts the request into a task queue in Redis for the specified endpoint. A “forwarder process” in AWS monitors the redis queue then sends the request to the remote endpoint agent (the funcx-endpoint process running on the endpoint host). The endpoint process, called the funcx-manager distributes tasks to processes called funcx-workers that carries out the execution and returns the result.    The fact that this is all managed through a cloud-based web service is very impressive.  It is very fast and reliable.

FuncX on Nvidia Jetson Nano

The Nvidia Jetson Nano is a tiny, but powerful computer for embedded applications and AI. It has a Quad-core ARM Cortex-A57 MPCore processor, 128 NVIDIA CUDA® cores and 4GB memory.  Installing a FuncX end point on the Nano is easy.  Just follow the steps in the docs.  At the end of the installation, you will have the endpoint uuid. An important application of FuncX is to use it to probe an edge device and read instruments or use special features that the device provides.  Because the Nano has an Nvidia GPU (and my laptop doesn’t) why not ship computation to the nano?  Here is a trivial example.   KMeans clustering is a standard ML algorithm that clusters points in space into a given number of nearby sets.  KMeans also involves lots of vector operations so it is suitable for execution in a GPU.  We have a program kmeans.py that contains four functions:

  • random_init  which initializes a random partition of the points in set by giving each point a code.
  • update_centers calculates the “center of gravity” of each set of points.
  • Compute_codes uses the current set of center points to reclassify each point according to the new centers.
  • Cluster is the main function that uses iteration to over the set of centers and codes

For FuncX we will execute Cluster remotely from our laptop.   It is shown below.

This image has an empty alt attribute; its file name is Picture16.png

There are two important points to notice here.  First is that we must import all the needed libraries inside the scope of the function because they may not be available in the worker execution environment on the remote host.  Second, we note that we need to import the three additional functions from kmeans.py which is stored as the file “/home/jetbot/kmeans.py”  on the remote machine.  The inputs to the cluster function consist of the array of data, the number partitions and a Boolean flag which signals to use the GPU or CPU to do the computation. 

Running the function with 400000 random points in 2-d space looking for 7 partitions goes as follows.

This image has an empty alt attribute; its file name is Picture17.png
This image has an empty alt attribute; its file name is Picture20.png

With results shown below.

This image has an empty alt attribute; its file name is Picture19.png
This image has an empty alt attribute; its file name is Picture21.png

Figure 4.  Jetbot device on the left and kmeans plot on the right.

Running it again with the GPU gives a similar result but the execution time is 3.6 seconds.  A scatter plot of the clustering is above.  The GPU yields a speed-up of 7.8 over the CPU. (This is low because the kmeans algorithms are mostly vector operations and not matrix-matrix operations.  Algorithms that have that property will yield speedup in the hundreds.) A significant shortcoming of the approach taken above is that we passed the input array of points to the function.  This limits the size we can handle to about 1 million single precision floats.  A more appropriate solution is to pass the location of the data to the remote computer and have it fetch the data.  An even better solution is to use the properties of this edge device to gather the data from its on-board instruments for processing. 

The Jenson board has a camera, so as an experiment, we decided to write a function which will capture an image from the camera and return it to display.  Unfortunately, this camera requires a complex initialization and only one process can initialize and own the camera at a time.   But FuncX allows multiple instances of a function to execute concurrently, so we need some way to mediate access to the camera. The solution was to create a simple miro-web service that runs continuously.  It initializes the camera and has a single method “hello”.   Invoking that method causes the service to grab an image and store it in a file.  The path to the file is returned to the FuncX function that can grab the file and return it to the caller. In this case the function is now very simple. The camera service is waiting on a port on the local host where the function is executing.  The local file is read and the image is returned. 

This image has an empty alt attribute; its file name is Picture22.png

The result is the view from the camera of part of the office where it sits.

This image has an empty alt attribute; its file name is Picture23.png

FuncX on Kubernetes

In the examples below we will demonstrate the use of FuncX with a small Kubernetes cluster.  More specifically we will deploy a deep learning (BERT-based) document classifier as a function and do  some simple performance analysis.  The appendix at the end of this document give the instructions for installing Kubernetes on the Docker distribution on your laptop and also on Azure.  The typical application of Kubernetes involves running Docker-style containers on each node of the cluster. We are going to create a special container that contains our BERT classifier model that will be loaded as the worker.   Our goal is to run as many instances of the classifier in parallel as possible. 

The BERT classifier

The classifier that we are going to use as our test case is described in a previous post.  The classifier takes as input a text abstract of a science paper posted on ArXiv and classifies it as being either Math, Physics, Computer Science, Biology or Finance.   The classifier was trained on a subset of the document abstracts.  In this case we start with a pretrained BERT model, so for classification we only have to fine-tune an extra layer.

This image has an empty alt attribute; its file name is Picture24.png

Figure 5.  BERT modified with classifier layer.

Using the simpletransformers library, the training was done with two line of code:

This image has an empty alt attribute; its file name is Picture25-1024x141.png

The model is trained on a pandas dataframe of 4500 (abstract, classification) pairs.  We will test it on 2600 additional abstracts. 

The model along with the python code that will load data and do the classification was saved in a directory “classifier”.  To build the Docker container we will run on the cluster we need to start with a container with a good Python 3 implementation.  Next we have to install the torch and simpletransformers library and our classifier directory are loaded and we copy the classifier.py program to the root level.  Next we need to install the funcx-endpoint code.  The complete Docker file is shown below.

This image has an empty alt attribute; its file name is Picture26-1024x377.png

If you want to experiment with this container it is in the Docker repo as dbgannon/classify.

The classifier.py program, has two methods for doing inference:

  • classifys( list-of-abtract-texts ) which takes a list of 1 or more texts of the paper abstracts and returns a  predicted classification.
  • classify(list-of-abstract-IDs) with take a list of abstract IDs and returns a predicted classification and the actual classification which has been looked up by the abstract ID.

The function to send a list of abstract strings to the classifier is

This image has an empty alt attribute; its file name is Picture37.png

If we let “s” be the string

We show that effective theories of matter that classically violate the null energy condition cannot be minimally coupled to Einstein gravity without being inconsistent with both string theory and black hole thermodynamics. We argue however that they could still be either non-minimally coupled or coupled to higher-curvature theories of gravity.’

and run this on the classifier through FuncX we get

This image has an empty alt attribute; its file name is Picture28.png

which returns 2, the correct classification (Physics).  (The true and false values are results from the pending queries.)   To do the performance evaluation we will use the other function which allows us to send a list of document ids.  This makes us able to send longer lists (because FuncX has a limit on the size of messages).  The following is an example.

This image has an empty alt attribute; its file name is Picture29-1024x531.png

The reply gives the predicted classification and the actual classification.   Notice they agree except in position 6 which corresponds to document 890:

'For a paradigmatic model of chemotaxis, we analyze the effect how a nonzero affinity driving receptors out of equilibrium affects sensitivity. This affinity arises whenever changes in receptor activity involve ATP hydrolysis. The sensitivity integrated over a ligand concentration range is shown to be enhanced by the affinity, providing a measure of how much energy consumption improves sensing. With this integrated sensitivity we can establish an intriguing analogy between sensing with nonequilibrium receptors and kinetic proofreading: the increase in integrated sensitivity is equivalent to the decrease of the error in kinetic proofreading. The influence of the occupancy of the receptor on the phosphorylation and dephosphorylation reaction rates is shown to be crucial for the relation between integrated sensitivity and affinity. This influence can even lead to a regime where a nonzero affinity decreases the integrated sensitivity, which corresponds to anti-proofreading.

The classifier predicted 3 = Biology.  The correct classification (according to the authors who submitted it to ArXiv) is 2=Physics. ( I too would have gotten that one wrong.)

Now that we have our Kubernetes cluster, we can run many invocations of this function in parallel and see how it performs.   The cluster we are using has only five nodes, each with 2 cores, but only one copy of the container can fit on each node.  One of the beauties of FuncX is that when it gets to many request it automatically spawn a new worker (which are called Pods in Kubernetes).   Using the Kubernetes dashboard we can see what it looks like when we have the system loaded with parallel requests.

This image has an empty alt attribute; its file name is Picture30-1024x643.png

We have only 5 nodes, so each node has one copy of the classifier container running as funcx-…. One node is also running the endpoint server.   We should also note that when not receiving new requests the endpoint manager starts killing off un-needed workers and the number of deployed pods drops.  

A Performance Experiment.

If you divide a fixed number of independent tasks between P processor in parallel the the total time to execute them should drop by a factor of P.   Let’s check that with our classifier.  We consider our set of tasks to be 250 document abstracts.  We have created an array of document indexes called vals_stringVals_string[ n-1] contains 250/n of the documents for n in the range 1 to 10.  In the code below we launch p instances of our funcx_impl2 each working on vals_sting[ p – 1].  Then we wait for them all to finish.  We do this with p in the range of 1 to 10.  

This image has an empty alt attribute; its file name is Picture31-1024x415.png

in the first case one pod computes the entire set of 250 docs in 54 seconds. Next two pods working in parallel complete the task in 29 seconds. The optimal case occurs when 4 pods work in parallel on the 4 blocks.  After that, the 5 pods suffer scheduling delays trying to execute more tasks than 4. Recall that we only had 1 cpu per pod. While 5 pods are available, the endpoint is also running on one of them.

This image has an empty alt attribute; its file name is Picture32.png

A second question is to look at the average time per inference achieved.  In each case we are asking the classifier to classify a set of documents.   Each classification requires a BERT inference, so what is the inference rate.  

Another factor is that every execution of the classify function must load the model. If the model load time is C and the rate of each classify operation is r, then for N classification the total time is

This image has an empty alt attribute; its file name is Picture33.png

So the average for time for all N is

This image has an empty alt attribute; its file name is Picture38.png

Hence for large N,  C/N is small and the inference rate is r.  If the load is divided among p processors in parallel, then the time is

This image has an empty alt attribute; its file name is Picture40.png

Looking at the actual measured inference time from the experiments above we see the average time per inference in the graph below.

This image has an empty alt attribute; its file name is Picture41.png

In a separate experiment we measured the load time C to be about 3 seconds.  Looking at the first item in the graph above we see that r is close to 0.2 for a single thread.   Using the formula above we can plot an approximate expected values (where we add a small scheduling penalty for tasks counts greater than our number of available servers as

This image has an empty alt attribute; its file name is Picture42.png

The added scheduling penalty is just a guess. By plotting  this we get a graph that looks similar to the data.

This image has an empty alt attribute; its file name is Picture43.png

Final Observations

Parsl and FuncX are excellent contributions to computational science.   The only parallel computing paradigm that they don’t explicitly cover is distributed parallel streams such as AWS Kinesis or apache Storm or Kafka. On the other hand, we experimented with using FuncX to launch functions which communicated with other functions using message brokers like AWS simple queue service and Azure  storage queues as well as ZeroMQ.  For example, you can use FuncX to launch a function that connects to an instrument and establishes  a stream of values that can be pumped to a queue or remote file system.  We didn’t include those examples here, because this is already too long.    However, we were convinced we could emulate the capability of Kafka and Storm stream parallelism with FuncX.   In fact, it seems FuncX uses ZeroMQ internally for some of its function.

I want to thank Ryan Chard, Zhuozhao Li and Ben Galewsky for guiding me through some rough spots in my understanding of FuncX deployment.   In the following appendix I have documented the steps I leaned while deploying FuncX on Azure and Docker.   All of the code described here will be in the repo dbgannon/parsl-funcx (github.com).  

In summary,  FuncX is Fun!

Appendix:  Getting Kubernetes up on Windows, Mac and Azure

If you are running the latest Windows10 or MacOS you can install docker and with it, Kubernetes.  The latest version on Windows10 will use the Linux subsystem as part of the installation and with that, you get a version of Kubernetes already installed.  If you go to the docker desktop control (click on the docker icon and in the control setting you will see the Kubernetes control.   You will see an “enable Kubernetes” button.  Startup Kubernetes.

You will also need Helm.  (To install helm on Windows you need to install chocolatey .) Helm is a package manager for  Kubernetes that is used by Funcx.   In a shell run

>choco install kubernetes-helm

On the mac you can install helm with

$brew install kubernetes-helm

Followed by

$ helm init

A good way to see what is going on inside a Kubernetes cluster is to run the Kubernetes dashboard.  First you must start a proxy server on Kubernetes with

>kubectl proxy

Starting to serve on

In another shell run

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.4/aio/

To access the dashboard you will need a special token.   To get that run

kubectl -n kubernetes-dashboard describe secret $(kubectl -n kubernetes-dashboard get secret | grep admin-user | awk '{print $1}')

On windows, the “grep” won’t work but token is displayed anyway as part of the output.  

Go to the link  on localhost: 8001 given by this  Kubernetes Dashboard and plug in the token.   You will see the important categories of services, pods and deployments.  

Now we are ready for FuncX!  Go to https://github.com/funcx-faas/funcX and download the file.   Store that in some place like your home directory C:\Users/You or /home/you.  (I am indebted to Ryan Chard, Zhuozhao Li and Ben Galewsky for walking me though the next few steps!!)

First you will  need a .funcx with a credentials subdirectory which contains the file funcx_sdk_tokens.json.

To get this do

>pip install funcx_endpoint

>funcx-endpoint configure

You will be asked to authenticate with Globus Auth . From the funcx.org website:

We require authentication in order to associate endpoints with users and enforce authentication and access control on the endpoint. As part of this step we request access to your identity information (to retrieve your email address) and Globus Groups management. We use Groups information to facilitate sharing of functions and endpoints by checking the Group membership of a group associated with a function.

Once you have completed that, cd to the credentials directory and execute

>kubectl create secret generic funcx-sdk-tokens --from-file=funcx_sdk_tokens.json

and then pick a name for your endpoint and do

>helm repo add funcx   http://funcx.org/funcx-helm-charts/

And then

>helm install yourEndPointname   ./funcx-dev/helm/funcx_endpoint

if you do

>kubectl get pods

you will see yourEndPoint.   Given its pod ID you can do

>kubectl get logs YourEndPointPodID

You will find the endpoint uuid in the logs.   Alternatively, you can go to the logs directly on the Kubernetes dashboard.  You are now ready.   You can try the example above to test it out.  To deploy a special container as the worker, such as the container created for the BERT classifier described above you need a special yaml file.  In that case the file is shown below.

This image has an empty alt attribute; its file name is Picture44-1024x269.png

Let’s call it Myvalues.yaml. The final helm deployment step is the command

>helm install -f Myvalues.yaml myep4 funcx-dev/helm/funcx_endpoint

You can now grab the endpoint uuid as described above and invoke functions that use the environment contained in the container.

Kubernetes on Azure

If you want to install it on an Azure Kubernetes cluster, you need to first create resource group and a container registry and an azure account.   To deploy the cluster, it is best to do this from the command line interface.   You can get the instructions for that here.  

To manage the azure container registry, go to here.  Suppose the container registry called “myContainers” and the resource group is called “parl_test”.    To create the cluster named “funcx-cluster”  in the Azure region “northcentralus” do the following.

>az aks create -g parl_test -n funcx-cluster --location northcentralus --attach-acr funcx --generate-ssh-keys --node-count 5 -node-vm-size Standard_D2s_v3

You can monitor the creation status on the portal.    Next the following steps get you endpoint setup for the cluster

 >az aks get-credentials --resource-group parl_test --name funcx-cluster
 >cd .\.funcx\credentials\
>kubectl create secret generic funcx-sdk-tokens --from-file=funcx_sdk_tokens.json
>helm repo add funcx http://funcx.org/funcx-helm-charts/
 >cd ..
 >cd ..
 >helm install -f Myvalues.yaml myep4 funcx-dev/helm/funcx_endpoint
 >kubectl get pods

Look for the name of the endpoint pod called myep4-endpoint- followed by long key.

Grab it and do

 >kubectl logs myep4-endpoint-6f54d9549-6hkmj

At this point you may have two clusters: one for the docker k8 cluster and one for Azure. 

>kubectl config get-clusters

Use “kubectl config current-context” to see which is currently responding to kubectl commands and use “kubectl config set-context” to change it.  When you are done you should delete the deployment for your endpoint.   It will persist as long as your cluster does and also automatically restart after crashes.

Building a Tiny Knowledge Graph with BERT and Graph Convolutions.


Knowledge graphs (KGs) have become an important tool for representing knowledge  and accelerating search tasks.   Formally, a knowledge graph is a graph database formed from entity triples of the form (subject, relation, object) where the subject and object are entity nodes in the graph and the relation defines the edges.  When combined with natural language understanding technology capable of generating these triples from user queries, a knowledge graph can be a fast supplement to the traditional web search methods employed by the search engines.    In this tutorial, we will show how to use Google’s Named Entity Recognition to build a tiny knowledge graph based on articles about scientific topics.  To search the KG  we will use BERT to build vectors from English queries and graph convolutions to optimize the search. The result is no match for the industrial strength KGs from the tech giants, but we hope it helps illustrate some core concepts.


We have explored the topic of KGs in previous articles on this blog.  In “A ‘Chatbot’ for Scientific Research: Part 2 – AI, Knowledge Graphs and BERT.”, we postulated how a KG could be the basis for a smart digital assistant for science. If you search for Knowledge Graph on the web or in Wikipedia you will lean that the KG is the one introduced by Google in 2012 and it is simply known as “Knowledge Graph”.  In  fact, it is very large (over 70 billion nodes) and is consulted in a large fraction of searches.   Having the KG available means that a search can quickly surface many related items by looking at nearby nodes linked to the target of the search.   This is illustrated in Figure 1.  This  is the result of a Google search for “differential equation” which is displayed an information panel to the right of the search results.  

This image has an empty alt attribute; its file name is Picture1.jpg

Figure 1.  Google information panel that appears on the right side of the page.  In this case the search was for “differential equation”.  (This image has been shortened a bit.)

The Google KG is extremely general, so it is not as good for all science queries, especially those that clash with popular culture.   For example, if you search for the term that describes the surface of a black hole,  an “event horizon” you get an image from the bad 1997 movie by that name.   Of course, there is also a search engine link to the Wikipedia page describing the real scientific thing.   Among the really giant KGs is the Facebook entity graph which is nicely described in “Under the Hood: The Entities Graph” by Eric Sun and Venky Iyer in 2013.  The entity graph has 100+ billion connections.  An excellent KG for science topics is the Microsoft Academic graph.  See “Microsoft Academic Graph: When experts are not enough” Wang et.al. 2019 which we will describe in more detain in the last section of this note.   In its earliest form the Google’s KG was based on another KG known as Freebase.  In 2014 Google began the process of shutting down Freebase and moving content to a KG associated with Wikipedia called Wikidata.  We will use Wikidata extensively below.

Wikidata was launched in 2012 with a grant from Allen Institute, Google and the Gordon and Betty Moore Foundation and it now has information that is used in 58.4% of all English Wikipedia articles.   Items in Wikidata each have an identifier (the letter Q and a number) and each item has a brief description and a list of alias names.  (For example,  the item for Earth (Q2) has alternative names: Blue Planet, Terra Mater, Terra, Planet Earth, Tellus, Sol III, Gaia, The world, Globe, The Blue Gem, and more.)  each item has a list of affiliated “statements” which are the “object-relation-object” triples that are the heart of the KG.   Relations are predicates and are identified with a P and a number.  For example, Earth is an “instance of” (P31) “inner planet” (Q3504248).  There are currently 68 million items in Wikidata and, like Wikipedia it can be edited by anyone.

In what follows we will show how to build a tiny knowledge graph for two narrow scientific topics and then using some simple deep learning techniques, we will illustrate how we can query to KG and get approximate answers. 

As mentioned above ideal way to construct a KG from text is to use NLP methods to extract triples from text.  In a very nice tutorial by Chris Thornton, this approach is used to extract information from financial news articles.   If the relations in the  object-relation-object are rich enough one may be able to more accurately answer questions about the data.   For example, if the triples are as shown in the graph in Figure 2  below

This image has an empty alt attribute; its file name is Picture2.png

Figure 2.  Graph created from the triples “Mary attended Princeton” and “Princeton is located in New Jersey”

one may have a good chance at answering the question “Where has Mary lived?”.There has been a great deal of research on the challenge of building relations for KG. (for example see Sun, et.al)     Unfortunately, scientific technical documents are not as easy to parse into useful triples.  Consider this sentence.

The precise patterns prevalent during the Hangenberg Crisis are complicated by several factors, including difficulties in stratigraphic correlation within and between marine and terrestrial settings and the overall paucity of plant remains.

Using the AllenAI’s NLP package we can pull out several triples including this one:

[ARG1: the precise patterns prevalent during the Hangenberg Crisis]

[V: complicated]

[ARG0: by several factors , including difficulties in stratigraphic correlation within and between marine and terrestrial settings and the overall paucity of plant remains]

But the nodes represented by ARG0 and ARG1 are not very concise and unlikely to fit into a graph with connections to other nodes and the verb ‘complicated’ is hard to use when hunting for facts. More sophisticated techniques are required to extract usable triples from documents than we can describe here. Our approach is to use a simpler type of relation in our triples.   

Building our Tiny Knowledge Graph

The approach we will take below is to consider scientific documents to be composed as blocks of sentences, such as paragraphs and we look at the named entities mentioned in each block. If a block has named entities, we create a node for that block called an Article node which is connected to a node for each named entity.  In some cases we can learn that a named entity is an instance of a entity class.   In other words, our triples are of the form

(Article  has named-entity)  or (named-entity instance-of  entity-class)

If entities, such as “Hangenberg Crisis”, occur in other blocks from the same paper or other papers we have an indirect connection between the articles. 

To illustrate what the graph looks like consider the following tiny text fragment.  We will use it to generate a graph with two article nodes.

The Theory of General Relativity demonstrates that Black Holes are hidden by an Event Horizon. Soon after publishing the special theory of relativity in 1905, Einstein started thinking about how to incorporate gravity into his new relativistic framework.

In 1907, beginning with a simple thought experiment involving an observer in free fall, he embarked on what would be an eight-year search for a relativistic theory of gravity. After numerous detours and false starts, his work culminated in the presentation to the Prussian Academy of Science in November 1915 of what are now known as the Einstein field equations, which form the core of Einstein’s famous Theory of General Relativity.

Each sentence is sent to the Google Named-Entity Recognition service with the following function.

This image has an empty alt attribute; its file name is Picture3.png

The NER service responds with two types of entities.   Some are simply nouns or noun phrases and some are entities that Google NER recognizes as having Wikipedia entries.  In those cases, we also use the Wikipedia API to pull out the Wikidata Identifier that is the key to Wikidata.  In our example we see three entity node types in the graph.   The blue entity nodes are the ones that have Wikipedia entries.  The green entities are nodes that are noun phrases.  We discard returned entities consisting of a single noun, like “space”, because there are too many of them, but multiword phrases are more likely to suggest technical content that may appear in other documents.  These nodes are in green.  In the case of the Wikidata (blue) nodes, we can use the Wikidata identifier to find out if the entity is an instance of a class in Wikidata.  If so, these are retrieved and represented as gray entities.   The first pair of sentences are used to create an article node labeled ‘Art0’.   The second pair of sentences are used to form node ‘Art1’. 

This image has an empty alt attribute; its file name is Picture4-1024x630.png

Figure 3.  The rendering of our two-article graph using NetworkX built-in graphing capabilities.

In this case we see that Einstein is an instance of human and an event horizon is an instance of hypersurface (and not a movie).   The elegant way to look for information in Wikidata is to use the SPARQL query service.   The code to pull the instanceOf property (wdt:P31)  of an entity given its wikiID is

This image has an empty alt attribute; its file name is fig2-1024x345.jpg

Unfortunately, we were unable to use this very often because of time-outs with the sparql server.   We had to resort to scraping the wikidata pages.  The data in the graph associated with each entity is reasonably large.   We have a simple utility function showEntity that will display it.   For example,

This image has an empty alt attribute; its file name is fig3-1024x438.jpg

The full Jupyter notebook to construct this simple graph in figure 3 is called build-simple-graph.ipynb in the repository https://github.com/dbgannon/knowledge-graph.

We will build our tiny KG from 14 short documents which provide samples in the topics climate change, extinction, human caused extinction, relativity theory, black holes, quantum gravity and cosmology.  In other words, our tiny KG will have a tiny bit of expertise in two specialized: topics relativistic physics and climate driven extinction.   

English Language Queries, Node Embeddings and Graph Convolutions.

Our tiny KG graph was built with articles about climate change, so it should be able to consider queries like ‘The major cause of climate change is increased carbon dioxide levels.’ And respond with the appropriate related items.   The standard way to do this is to take our library of text articles stored in the KG and  build a list of sentences (or paragraphs) and then use a document embedding algorithm to map each one to a vector in RN for some large N so that semantically similar sentences are mapped to nearby vectors.    There are several standard algorithms to do sentence embedding.   One is Doc2Vec and many more have been derived from Transformers like BERT. As we shall see, it is important that we have one vector for each article node in our KG.  In the example illustrated in Figure 3, we used two sentences for each article. Doc2Vec is designed to encode articles of any size, so 2 is fine.   A newer and more accurate method is based on the BERT Transformer, but it is designed for single sentences but also works with multiple sentences.   To create the BERT sentence embedding mapping we need to first load the pretrained model.

This image has an empty alt attribute; its file name is fig4-1024x89.jpg

The next step is to use the model to encode all of the sentences in our list.  Once that is done, we create a matrix mar where mar[i] contains the sentence embedding vector for the ith sentence normalized to unit length. 

This image has an empty alt attribute; its file name is Picture-new2.png

Now if we have an arbitrary sentence Text and we want to see which sentences are closest to it we simply encode the Text, normalize it and compute the dot product with all the sentences.   To print the k  best fits starting with the best we invoke the following function.

This image has an empty alt attribute; its file name is Picture11.png

For example,  if we ask a simple question:

Is there a sixth mass extinction?

We can invoke find_best to look for the parts of the KG that is the best fit.

This image has an empty alt attribute; its file name is fig5-1024x606.jpg

Which is a reasonably good answer drawn from  our small selection of documents.   However there are many instances where it is not as good.   An important thing to note is that we have not used any properties of the graph structure in this computation. 

Using the Structure of the Graph:  A Convolutional Approach.

 In our previous tutorial “Deep Learning on Graphs” we looked at the graph convolutional network as a way to improve graph node embedding for classification.   We can do the same thing here to use the structure of the graph to augment the Bert embedding.   First, for each article node x  in the graph we collect all its immediate neighbors where we define immediate neighbor to mean those other article nodes linked to an entity node shared with x.   We then computed a weighted sum (using a parameter lambda in [0,1]) of the Bert embedding vectors for each neighbor with the Bert embedding of for x.   We normalize this new vector, and we have a new embedding matrix mar2.  

This image has an empty alt attribute; its file name is Picture13.png

Figure 4.  Illustrating the convolution operation

Intuitively the new embedding captures more of the local properties of the graph. We create a new function find_best2 which we can use for question answering.   There is one problem.   In the original find_best function we convert the query text to a vector using the BERT model encoder.   We have no encoder for the convolved model. Using the original BERT encoder didn’t work… we tried.)  We decided to use properties of the Graph.  We asked the Google NER service to give us all the named entities in our question.  We then compare those entities to entities in the Graph.  We search for the  article node with the largest number of named entities that are also in our query. we can use the convolved mar2 vector for that node as a reasonable encoding for our query.   The problem comes when there is no clear winner in this search.   In that case, we use the article that find_best says is the best fit and use that article’s mar2 vector as our encoding.   Pseudo code  for our new find_best2 based on the convolution matrix mar2 is

This image has an empty alt attribute; its file name is Picture14-1024x479.png

To illustrate how the convolution changes the output consider the following cases.

This image has an empty alt attribute; its file name is fig6-1024x783.jpg

The first two responses are excellent.    Now apply find_best2 we get the following.

This image has an empty alt attribute; its file name is fig7-1024x729.jpg

In this case there were 18 article nodes which had named entities that matched the entity in the text: “carbon dioxide”.   In this case find_best2 just uses the first returned value from find_best.  After that it selected the next 3 best based on mar2.  The second is the same, but the last two are different are arguably a better fit than the result from find_best.

To see a case where find_best2 does not invoke find_best to start, consider the following.

This image has an empty alt attribute; its file name is fig8-1024x555.jpg
This image has an empty alt attribute; its file name is fig10-1024x785.jpg

Both fail on addressing the issue of “Who” (several people are mentioned that answer this question are in the documents used to construct our KG.).  However the convolutional version find_best2 addressed “solve” “field equations” better than find_best.

The process of generating  the  BERT embedding vectors mar[ ] and the results of the convolution transformation mar2[ ] below is illustrated in Figure 5 as a sequence of two operators.  

BertEncode: list(sentences1 .. N ) ->  RNx768    
GraphConv : RNx768 -> RNx768

This image has an empty alt attribute; its file name is Picture19.png

Figure 5.  Going from a list of N sentences to embedding vectors followed by graph convolution.  Additional convolution layers may be applied.

There is no reason to stop with one layer of graph convolutions.    To measure how this impacts the performance we set up a simple experiment.  This was done by computing a score for each invocation of the find_best function.   As can be seen from the results, some of the replies are correct and others are way off.   One way to  quantify  this is to see how far the responses are from the query.  Our graph was built from 14 documents which provide samples in the topics climate change, extinction, human caused extinction, relativity theory, black holes, quantum gravity and cosmology.   Allocating each of the 14 documents to an appropriate one of 6 topics, we can create have a very simple scoring function to compare the documents in the find_best response.  We say score(doc1, docN) = 1 if both doc1 and docN belong to documents associated with the same topic The score is 0 otherwise.  To arrive at a score for a single find_best invocation, we assume that the first response is likely the most accurate and we compute the score in relation to the remaining responses

(score(doc1,doc1) + score(doc1, doc2) + sore(doc1, doc3) + score(doc1, doc4))/4.

The choice of measuring four responses is arbitrary.   It is usually the case that responses 1 and 2 are good and 3 and 4 may be of lower quality. 

The six topics are  climate, extinction, human-related extinction, black holes, quantum gravity and cosmology. 

Notice that we are not measuring the semantic quality of the responses.  We only have a measure of the consistency of the responses.  .  We return to the topic of measuring the quality of response in the final section of the paper. The algorithm to get a  final score for find_best is to simply compute the score for each node in the graph.

This image has an empty alt attribute; its file name is Picture20.png

We do the same for find_best2 for different layers of convolution.  As can be seen from the chart below one convolution does improve the performance.   The performance is peaked out at 3 convolutional layers.  

This image has an empty alt attribute; its file name is Picture21-1024x656.png

Recall that the convolution operator was defined by a parameter lambda in the range 0 to 1.   We also ran a grid search to find the best value of lambda for 1, 2 and 3 convolutional layers.  We found a value of 0.75 gave reasonable results.   (note that when lambda = 1 the layers degenerate to no layers.)

This image has an empty alt attribute; its file name is Picture22.png

Plotting the Subgraph of Nodes Involved in a Query

It is an amusing exercise to look at the subgraph of our KG that is involved in the answer to a query.   It is relatively easy to generate the nodes that connect to the article nodes that are returned by our query function.  We built a simple function to do this.  Here is a sample invocation.

This image has an empty alt attribute; its file name is Picture23-1024x67.png

The resulting subgraph is shown below.

This image has an empty alt attribute; its file name is Picture24-1024x496.png

Figure 6.   Subgraph generated by the statement “’best-known cause of a mass extinction is an Asteroid  impact that killed off the Dinosaurs.”

The  subgraph  shown in figure 6 was created as follows.  We first generate an initial subgraph generated by using the article nodes returned from find_best2 or find_best and the entity nodes they are connected to.   We then compute a ‘closure’ of this subgraph by selecting all the graph nodes that are connected to nodes that are in the initial subgraph. 

Looking at the subgraph we can see interesting features.  An amusing exercise is to follow a path from one of the nodes in the graph to the other and to see what the connections tell us.   The results can form an interesting “story”.   We wrote a simple path following algorithm.

This image has an empty alt attribute; its file name is Picture25.png

Invoking this with the path from node Art166 to Art188

This image has an empty alt attribute; its file name is Picture26.png

gives the following.

This image has an empty alt attribute; its file name is fig11-1024x768.jpg

Trying another path

This image has an empty alt attribute; its file name is Picture28.png

Yields the following

This image has an empty alt attribute; its file name is fig12-1024x895.jpg

There is another, perhaps more interesting reason to look at the subgraph.  This time we use the initial subgraph prior to the ‘closure’ operation.   If we look at the query

This image has an empty alt attribute; its file name is Picture30-1024x54.png

The results include the output from find_best2 which are:

This image has an empty alt attribute; its file name is fig13-1024x680.jpg

You will notice only 2 responses seem to talk about dark energy.   Art186 is related, but Art52 is off topic.  Now, looking at the graph we see why.

This image has an empty alt attribute; its file name is Picture32-1024x499.png

Figure 7.  Subgraph for “What is dark energy”

There are three connected components, and one is clearly the “dark energy” component.   This suggests that a further refinement of the query processing could be to  filter out connected components that are “off topic”. 

We conducted a simple experiment.   In calls to find_best2(4, text)  we searched the ten best and eliminated the responses that were not in the same connected component as the first response.   Unfortunately this, sometimes resulted in  fewer than k responses, but the average score was now 83%. 


In this little tutorial we have illustrated how to build a simple knowledge graph based on a few scientific documents.    The graph was built by using Google’s Named Entity Recognition service to create simple entity nodes.   Article nodes were created from the sentence in the document.   We then used BERT sentence embeddings to enable a basic English language query capability to pull out relevant nodes in the graph.   Next we showed how we can optimize the BERT embedding by apply a graph convolutional transformation.  

The knowledge graph described here is obviously a toy.   The important questions are how well the ideas here scale and how accurate can this query answering system be when the graph is massive.  Years of research and countless hours of engineering have gone into the construction of the state-of-the-art KGs.   In the case of science, this is Google Scholar and Microsoft Academic.   Both of these KGs are built around whole documents as the basic node.   In the case of the Microsoft Academic graph,  there are about 250 million documents and only about 36% come from traditional Journals.  The rest come from a variety of sources discovered by Bing.   In addition to article nodes, other entities  in the graph are authors, affiliations, concepts (fields of study),  journals, conferences and venues.  (see “A Review of Microsoft Academic Services for Science of Science Studies”, Wang, et.  al.   for an excellent overview.)  They use a variety of techniques beyond simple NER including reinforcement learning to improve the quality of the graph entities and the connections.   This involves metrics like eigencentrality and statistical saliency to measure quality of the tuples and nodes.   They have a system MAKES that transforms user queries into queries for the KG.    

If you want to scale from 17 documents in your database to 1700000 you will need a better infrastructure than Python and NetworkX.  In terms of the basic graph build infrastructure there are some good choices.

  • Parallel Boost Graph Library.  Designed or graph processing on supercomputers using MPI communication. 
  • Pregel: A System for Large-Scale Graph Processing” developed by Google for their cloud.
  • GraphX: A Resilient Distributed Graph System on Spark.
  • AWS Neptune is fully managed, which means that database management tasks like hardware provisioning, software patching, setup, configuration, and backups are taken care of for you.  Building KG’s with Neptune is described here.  Gremlin and Sparql are the APIs.
  • Janus Graph is an opensource graph database that is supported on Google’s Bigtable.
  • Neo4J is a commercial product that is well supported, and it has a SQL like API.
  • Azure Cosmos is the Azure globally extreme-scale database that supports graph distributed across planet scale resources.

If we turn to the query processing challenge,  the approach we took in our toy KG, where document nodes were created from as few as a single sentence, it is obvious why the Microsoft KG and Google Academic focus on entire documents as a basic unit.  For the Microsoft graph a variety of techniques are used to extract “factoid” from documents that are part of concepts and taxonomy hierarchy in the Graph.  This is what the MAKES system searches to answer queries.    

It is absolutely unclear if  the convolutional operator applied to BERT sentence embedding described here would have any value when applied to the task of representing knowledge at the scale of MAG or Google Scholar.   It may be of value in more specialized domain specific KGs.   In the future we will experiment with increasing the scale of graph.

All of the code for the examples in this article is in the repository https://github.com/dbgannon/knowledge-graph.

Deep Learning on Graphs (a Tutorial)


This tutorial gives an overview of some of the basic work that has been done over the last five years on the application of deep learning techniques to data represented as graphs.  Convolutional neural networks and transformers have been instrumental in the progress on computer vision and natural language understanding.    Here we look at the generalizations of these methods to solving problems where the data is represented as a graph.   We illustrate this with examples including predicting research topics by using the Microsoft co-author graph or the more heterogeneous ACM author-paper-venue citation graph.   This later case is of interest because it allows us to discuss how these techniques can be applied to the massive heterogeneous Knowledge networks being developed and used by the search engines and smart, interactive digital assistants.   Finally, we look at how knowledge is represented by families of graphs.   The example we use here is from the Tox21 dataset of chemical compounds  and their interaction with important biological pathways and targets. 


Graphs have long been used to describe the relationships between discrete items of information.  A basic Graph consists of a set of nodes and a set edges that connect pairs of nodes together.   We can annotate the nodes and edges with the information that gives the graph meaning.   For example the nodes may be people and the edges may be the relationships  (married, child, parent) that exist between the people.   Or the nodes may be cities and the edges may be the roads that connect the cities.   Graphs like knowledge networks tend to be heterogeneous:  they have more than one type of node and more than one type of edge.   We are also interested in families of related graphs.  For example,  chemical compounds can be described by graphs where the nodes are atoms and the edges represent the chemical bonds between them. 

The most famous deep learning successes involve computer vision tasks such as recognizing objects in two dimensional images or natural language tasks such as understanding linear strings of text.   But these can also be described in graph terms.  Images are grids of pixels.  The nodes are individual pixels and edges are neighbor relations.   A string of text is a one-dimensional graph with nodes that represent words and edges are the predecessor/successor temporal relationships between them.   One of the most important tools in image understanding are the convolutional network layers that exploit the local neighborhood of each pixel to create new views of the data.  In the case of text analysis, Transformers manipulate templates related to the local context of each word in the string.  As we shall see the same concepts of locality are an essential to many of the graph deep learning algorithms that have been developed.

The Basics

Tools to manage graphs have been around for a long time.   Systems like Amazon’s Neptune and Azure CosmosDB scale to very large graphs.  The W3C standard Resource Description Framework (RDF) model and its standard query language, SPARQL are commonly used and well supported.  Apache Tinkerpop is based on the Gremlin graph traversal language.  The excellent Boost Graph Library is based on  C++ generics.

These pages will rely heavily on the tools developed to manage graphs in Python.   More specifically,  NetworkX, originally developed by Aric Hagberg, Dan Schult and Pieter Swart, and the Deep Graph Library led by Minjie Wang  and Da Zheng from AWS and Amazon AI Lab, Shanghai and a large team of collaborators (see this paper.)  These two packages are complimentary and interoperate nicely.   NetworkX  is an excellent took for build graphs, traversing them with standard algorithms and displaying them.  DGL is designed to integrate Torch deep learning methods with data stored in graph form.   Most of our examples will be derived from the excellent DGL tutorials.

To begin let’s build a simple graph with 5 nodes and a list of edges stored in a file ‘edge_list_short.txt’.  (the complete notebook is stored in the archive as  basics-of-graphs.ipynb.  We will label the nodes with their node number. 

In the figure below we have the content of the edge list file on the left and the output of the nx.draw() on the right.

Figure 1. A simple NetworkX graph on the right derived from the short edge list on the left.

As you can see this is a directed graph.  There is an edge from 4 to 0 but not one from 0 to 4.   NetworkX has a very extensive library of standard algorithm.  For example, shortest path computation and graph cycle detection.  

Graph nodes and edges can have properties that are identified with a property name.   For example we can add an edge property called “weight”  by iterating through the edges and, for each assign a value to the property.   In a similar manner we can assign a “value” to each node.

DGL graphs

It is easy to go back and forth between a NetworkX graph and a DGL graph.   Starting with our NetworkX graph G above we can create a DGL graph dG as follows.

Notices that the properties “weight” and “value” are translated over to the DGL graph and Torch tensors.  We can add node properties directly or using some DGL functions like out_degrees and in_degrees.   Here we assign a value of 1 to a property called “one” to each node and the in degree of each node to a property called “deg”.

DGL Message Passing.

One of  the most important low-level feature of DGL is the process by which messages are passed from one node to another.  This is  how we manage operations like convolutions and transformers on graphs.

Messages are associated with edges.   Each message takes a value from the source node  of the edge and delivers it to the “mailbox” of the destination of the edge.   Reduce functions empties the mailboxes and does a computation and modifies a property of the node.   For example, the following message function just sends the 2 times the value “one” from source of each edge to the destination.  After the messages are sent each node mailbox (labeled ‘x’) now has a message containing the value 2  from the nodes that have out going edge to that node.  The reduce functions add up all of those 2’s (which would total twice the in-degree of that node.  It then adds the in-degree again, so the value is now 3 times the in-degree of the node. 

To execute these functions, we simply register the functions with our graph and invoke them with broadcast send and receive messages (with an implicit barrier between the sends and receives.)

A shorthand for the send, receive pair is just dG.update_all().   The notebook for the examples above is in basics-of-graphs.ipynb in the github archive

The MS co-author Graph

Microsoft provided a graph based on their Microsoft Academic Graph from the KDD Cup 2016 challenge 3.  It is a graph of 18333 authors. Edges represent coauthors: two nodes have an edge if the authors co-authored  a paper.  Node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each.  There are 6805 keywords.  The class labels are ‘bioinformatics’, ‘machine learning’, ‘computer vision’, ‘NLP’, ‘graphics’, ‘networks’, ‘security’, ‘databases’, ‘datamining’, ‘game theory’, ‘HCI’, ‘information theory’, ‘medical informatics’, ‘robotics’, ‘theoretical_cs’.   There are two ways to get at the data.   One way is from the DGL data library.

The other way is to pull the raw data collection.

There are two parts of the cofull file that are important to us.   One is the array class_names and the other is the array of attributes which are the text phrases associated with each paper.

As mentioned above, the author node label is an assignment based on most likely topical research area based on publications.   In the following section we will use a graph convolutional network to predict the label. 

The co-author graph coau has a feature vector “feat”  that describes the features for each node.  This alone should be enough to predict the label.

For each node this is a vector of length 6805 with a non-zero entry for each feature attribute associated with that author. 

Using the cofull[‘attr_names’] list we can show what these attributes are as text.   It is an interesting exercise to see what attributes are common to neighboring nodes.   In the following we convert our coau graph to a NetworkX graph and look at the neighbors of node 0.   Then we extract the features that several of the neighbors share with node 0. We can use the fact that the feature vector is a tensor so multiplying them together we get the intersection of the common words.

The way to read this result is that author 0  has co-authored papers with authors 5111, 12716 and 12963.  Collectively they share the following phrases in their research profile.

We can create  a new  feature vector for each author node by extracting the list of key words on the ‘feat’ vector and constructing a Gensim tagged document.

 Then, using Gensim’s Doc2Vec we can create a model of vector embedding of length 100 for each of these tagged documents.  These are  saved in our graph in a node data feature “e-vector”.  

The document model does a reasonable job of separating the nodes by class.  To illustrate this we can use TSNE to project the 100 dimensional space into 2 dimensions.   This is shown below for a subset of the data consisting of the nodes associated with topics ‘bioinformatics’, ‘machine learning’, ‘robotics’, ‘theoretical cs’, ‘networks’, ‘security’ and ‘databases’.  The nodes associated with each topic are plotted with the same color.  As you can see the topics are fairly well defined and clustered together.   We will use this 100 dimension feature vector in the convolutional network we construct below.

The full details of the construction of the construction of the graph, the Doc2Vec node embedding and this visualization are in the notebook fun-with-cojauthors.ipynb

Figure 2.  Subset of MS co-author graph projected with TSNE.

The Graph Convolutional Network

In computer vision applications, convolutional neural networks are based on applying filters to small local regions of each pixel.  In the simplest case the template just involves the nearest neighbors to the central pixel and the template operation  is a dot product with the pixel values of these neighbors. 

Figure 3.   Convolution operations in an image and on a graph.

As shown in the figure, the graph version is the same, and we can use the DGL message system to do the computation.     The standard DGL graph convolutional layer is shown below.

We now create a network with three GCN layers with the first layer of size 100 by 50 because 100 is the size of our new embedded feature vector we constructed with Doc2vec above.  The second layer is 50 by 32 and the third is 32 by 15 because 15 is the number of classes.

We employ a relu activation between the GNC layers.   To train the network we divide the nodes into a training set and a test set by building a Boolean vector that is True when the node is in the test set and False when not.   To evaluate the network  on the test and train set we involve the following

The evaluate function first applies the network to the graph with the entire feature set.  This results in a tensor of size 18333 by 15.  This is a vector of length 15 for each graph node.  The index of the maximum value of this vector is the predicted index of the node.

To train the network we use a conventional Adam optimizer.   When we extract the logits array from the network we compute the log of the soft max along the length 15 axis and compute the loss using the negative log likelihood loss function.   The rest of the training is completely ordinary pytorch training.

The training converged with an accuracy of 92.4%.    Evaluating it on the test_mask gave a value of 86%, which is not great, but it does reflect the fact that the original graph labels are also a “subjective” valuation of the authors primary research interest. 

Another way to look at the results is to plot a confusion matrix as shown below 

Figure 4.   Confusion Matrix for prediction of topics by means of Graph Convolution Network

It is clear from looking at this that the label datamining is easily confused with databases.  And the label machine-learning is often confused with computer vision and datamining.    In the notebook co-author-gnc.ipynb we show how to create this matrix. 

Heterogeneous Graphs

Knowledge Graphs are the basis of much of modern search engines and smart assistants.  Knowledge Graphs are examples of directed heterogeneous graphs where nodes are not all of one type and edges represent relations.   For example, the statement “Joan likes basketball” might connect the person node “Joan” to the sports node “basketball” and the edge from Joan to basketball would be “likes”. 

A really nice and often studied example is the ACM publication dataset.   This is a collection of 12499 papers and 17431 authors and 37055 links between them.   This is still small enough to load and run on your laptop.   (We will discuss a much larger example later.)

The data set consists of a set of tables and sparse matrices.   The tables are ‘A’ for authors,  ‘C’ for conferences, ‘P’ for papers, ‘T’ for subjects (the ACM classification), ‘V’ for  proceedings and ‘F’ is for institutions.  The sparse matrices are like ‘PvsA’ which means the authors of a given paper,  ‘PvsP’ signifies a paper citing another, ‘PvsL’ is paper is of this topic.   Downloading and creating a DGL heterograph for a subset of this data is simple.

We have created a graph with three node types: authors, papers and subjects as illustrated in Figure 5 below.

Figure 5. Heterogeneous Graph with nodes for Subject, Paper, Author and links of various types.

The edge types are the link keywords in the triple that is used to identify the edges.   If we want to find the name of an author node we have to do a search in the data table.  That is easy enough.  The notebook for this example has such a trivial function:The edge types are the link keywords in the triple that is used to identify the edges.   If we want to find the name of an author node we have to do a search in the data table.  That is easy enough.  The notebook for this example has such a trivial function:

We can use the link labels to pull out a subgraph.  For example, the subgraph “authors writing paper” can be extracted by

We can then ask which papers by Dongarra are included in the data set by looking for the out edges from node 5100.  Having the id of the end of the out edge we can get the abstract of the paper (truncated here).

Having this information, we can use the written-by subgraph to get the names of the coauthors.

Or the subject of the paper.

Where C.4 is the cryptic ACM code for performance of systems.

Node classification with the heterogeneous ACM graph.

The classification task will be to match conference papers with the name of the conference it appeared in.   That is, given a  paper that appeared in a conference we train the network to identify the conference.   We have this information in the data[‘PvsC’] component of the dataset, but that was not included in our graph.   There are two solutions we have considered.   One is based on tranformers and is described in the excellent paper “Heterogeneous Graph Transformer”  by Ziniu Hu, et. al.    A notebook version of this paper applied to the ACM graph is called hetero-attention.ipynb in the Github archive.   It is based on the code  in their Github archive.  

In the following paragraphs we will describe the version based on “Modeling Relational Data with Graph Convolutional Networks”  from Schlichtkrull et. al.  which is included in the DGL tutorials.   The idea here is a version of the graph convolutional network.  Recall that for a graph convolution each node receives a message from its neighbors and we then form a reduction sum of those messages. However for a heterogeneous graph things are a bit different.   First we are interested in only one node type (paper) but each paper has several incoming edge types:  author-writing-paper,  paper-citing-paper, and subject-has-paper.  Hence to use all the information available we must consider incoming messages from other papers as well as authors and subjects.  For each edge type r we have a fully connected trainable layer W.  Then for each node i and each edge type r we compute

Where hj(l) is the message from the jth r-neighbor and (l) refers to the  lth training iteration.   Now we sum that over each edge type and apply an activation function to get

The code for this layer is below.

Notice that the forward method requires a feature dictionary for each node type because there are different number of nodes of each type.   We construct the full network using two layers of HeteroGCN layer as follows.

We can create our model with the following.

Where we have 4 output logits corresponding to the 4 conferences we are interested in (SOSP, SODA, SIGCOM, VLDB).  We next select the rows of “PvsC” that correspond to these 4 conferences and build a train and test set from that.  (The details are in our notebook GCN_hetero.ipynb .)  The core of the training is quite simple and conventional.

Notice that the convolutional operations are applied across the entire graph,  but we only compute the loss on those node associated with our 4 selected conferences. The algorithm converges very quickly and and the accuracy on the test set is 87%.   The confusion matrix shown below shows that sigcom is harder to differentiate from the others, but the size of our data set is extremely small.

As mentioned above, we have run the same computation with the heterogeneous graph transformer algorithm.   In this case the computation took much longer, but the results were slightly better as shown in the confusion matrix below.

The graphs we have presented as examples here are all small enough to run on a laptop without a GPU.  However, the techniques presented here are well suited to very large graphs.   For example Amazon’s AWS team we have a large scale graph based on Wikimedia with the Kensho Derived Dataset which is based on over 141 billion wikidata stements and 51 billion wikidata items.  The AWS team focuses on graph embeddings at scale (see this blog) for this dataset. The have a notebook for how to access and use this data.  The also use it for the Drug Repurposing Knowledge Graph (DRKG) to show which drugs can be repurposed to fight COVID-19.

Graph Classification:  The Tox21 Challenge

In the case of life science, Amazon has also contributed a package dgllife DGL-LifeSci.  To illustrate it, we will load data associated with the Tox21 challenge from the US National Institutes of Health.   This data challenge studies the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects.  The dataset consists of 7831 chemical compounds that have been tested against a  a compound toxicity screening classification on Nuclear Receptor Signaling (NR) and Stress Response (SR)] involving 12 different assay targets: Androgen Receptor (AR, AR-LBD), Aryl Hydrocarbon Receptor (AhR), Estrogen Receptor (ER, ER-LBD), Aromatase Inhibitors (aromatase), Peroxisome Proliferator-activated receptor gamma (ppar-gamma), Antioxidant Response Element (ARE), luciferase-tagged ATAD5 (ATAD5), Heat Shock Response (HSE), Mitochondrial Membrane Potential (MMP), and Agonists Of The P53 Signaling Pathway (P53).  The table below show the result for each molecule the results of the target tests.  The data is not complete, so in some cases the result are NaNs and must be excluded.   As you can see below for molecule 0 there was a positive response involving receptor NR-AhR and Sr-ARE, and the Aromatase and NR_ER data was bad. 

Table 1.

We can load the full data set, which includes the data above and a full description of each compound.  In addition, we can load a pretrained graph convolutional network that attempts to predict the toxicology screen process described in the table above.  

To extract the data associated with compound 0, we simply query  the dataset.  This returns a chemical description string (known as a smiles string), a DGL graph g, a label corresponding to the row in the table above and a mask which tells us which of the target NaN values to ignore.

In this case the graph indicates that the chemical has 16 atoms and 34 bonds.   Associated with each atom is a vector of 74  values.  The meaning of these values is given below.

Using the smiles string and a small helper function in our notebook dgllife.ipynb called display_mol, we can see what our compound looks like.

We can also draw the graph by converting it to a network graph that is graph isomorphic to the smiles graph.

As we can see from table 1 that positions 3 and 4 are no good, so using the mask we can apply the model to classify our compound.

If we compare the result to the label which constitutes the correct values we see that the strongest signal is in position 5 which, after dropping two masked positions, corresponds to label position 7 in the label, which is correct. We should also get a strong signal in position 3, but we see the model gives us an even stronger signal in position 4.

Let’s look at a different compound: #7827 in the data set.  This one has 23 atoms and 53 bonds.  

In this case, when we run the model get the following.

If we use the score the prediction  that positive numbers indicate positive reactions (1)   and negative number the opposite(0), then he predicted values are positive on the first two positions, negative on the next two and positive on the following two.   This are all correct except for the last positive.   The remaining three unmasked values are negative and that is correct.  Out of the 9 unmasked values 8 are correct, so we can give this one a “score” of 89%.   If we apply this criteria to the entire dataset we get a “score” of 83%.   We did not train this model, so we don’t know what fraction of the data was used in the training, so this number does not mean much.


The tutorial above just scratches the surface of the work that has been done on deep learning on graphs.  Among the topics that are currently under investigation include managing the updates of dynamic graphs. Of course the most interesting topic these days is the construction and use of large knowledge graphs.  The big challenge there is to use a knowledge graph to do question answering.  An interesting example there is Octavian’s clevr-graph which can answer questions about the knowledge graph built from the London Underground.     Extracting knowledge from KGs involves understanding how to extract key phrases from English queries and how to map those phrases to KG nodes and edges.   This is a non-trivial task and there is a lot of work going on. 

All of the notebooks for the examples described here are in Github https://github.com/dbgannon/graphs

Some Addition References

The following are research papers and blogs that I found interesting in writing this tutorial.

Predicting Tomorrows Temperature with RNNs and Gaussian Processes


This note provides a gentle introduction to streaming data regression and prediction using Recurrent Neural Networks and Gaussian processes.  We look at two examples.  In the first we look at daily high temperature for two years at three different, but nearby NOAA weather stations. In the second example we look at daily confirmed infection of the Coronavirus in several US states.  This study shows that if you have a predictable natural pattern that changes in predicable ways from season to season one can make reasonable predictions, but in the case of complex phenomenon such as the Covid19 pandemic, simple models  do not always work well.

 In a previous post we looked at anomaly detection in streaming data using sophisticated cloud based tools, but here we use simple “classic” ML tools that you can run on a laptop.  We expect the seasoned data scientist will not get much out of this article, but if the reader is not familiar with recurrent networks or Gaussian processes this may be a reasonable introduction.  At the very least, if you are interested in studying the Coronavirus infection data, we show you how to search it using Google’s BigQuery service.

Building a Very Simple LSTM Recurrent Neural Network

Recurrent Neural Networks were invented to capture the dynamic temporal behaviors in streams of data.  They provided some early success in areas related natural language understanding, but in that area they have been superseded by Transformers, which we discussed in an earlier article.   We will construct an extremely simple recurrent network and ask it to predict the temperature tomorrow given the temperature for the last few days.

We will train it on the average daily temperature over 2 years measured at three nearby weather stations in the Skagit valley in Washington state.   The data comes from the Global Summary of the Day (GSOD) weather from the National Oceanographic and Atmospheric Administration (NOAA) for 9,000 weather stations.  The three stations are Bellingham Intl, Padilla Bay Reserve and Skagit Regional.  The data is available in Google’s BigQuery service named “bigquery-public-data.noaa_gsod.stations”. (We will show how to search BigQuery in the next section.)

Averaging the daily temperature for 6 years (2 years each for 3 stations) we get data like Figure 1 below.


                           Figure 1.  Average daily high temperature for 2 years at 3 stations.

RNNs work by having a “memory” state tensor that encodes the sequence of inputs that have been seen so far.  The input to the RNN is a word or signal, along with the state of the system based on words or signals seen so far; the output is a predicted value and a new state of the system, as shown in figure 2 below.


Figure 2.  Basic Recurrent Neural network with input stream x and output stream h

Many variations of the basic RNN exist. One challenge for RNNs is ensuring that the state tensors retain enough long-term memory of the sequence so that patterns are remembered. Several approaches have been used for this purpose.  One popular method is the Long-Short Term Memory (LSTM) version that is defined by the following equations, where the input sequence is x, the output is h and the state vector is the pair [c, h].


Where sigma is the sigmoid function and W is the leaned tensor.   We are not going to use these equations explicitly because PyTorch has a built-in version that we will use.  All we need to do is provide the dimension of the input (which is a sequence of scalar values, so that is 1) and the dimension of the state vector c_h is a tuple of tensors of size (arbitrarily chosen here to be) 100. We also have linear layer that will map the final value of h down to the one-dimensional output.  Our simple LSTM neural network is


It is important to understand the role of the input x as a sequence.   We want to train this network so that if we set x to be a sequence of consecutive daily high temperatures [t1, t2, … tn] then the forward method will return [tn+1].  To accomplish this training task we build a training set by scanning through the weather record andcreating tuples of sequences of length tw and the next-day temperature as a training label. This is accomplished with this function


From this we create a list of such tuples called train_inout_seq and the training loop takes the form


The complete details are in the notebook lstm-for-stream-final  in the Github repository.  This was trained on the average year and one the six yearly record.   The results are shown below.  The first two are the years that were training cases.  The original data is printed in blue and the predicted data is printed in orange.   As you can see virtually no blue shows.  The network has memorized the training data almost perfectly with an average daily error that is less than 0.23 degrees Fahrenheit.



Figure 2.  The results for the average of all the data (Figure 1 above) and for one of the individual stations. The LSTM network has memorized the training data.

In Figure 3 below we show the results when applied to two of the weather records for two of the other stations.   In these cases, the results are not very impressive.  In both cases the error average error was over 3.5 degrees and it was greater than 10 degrees for more than a dozen days.  However, the predictions for one day ahead did track the general trends.   It looks like it was able to predict todays temperature better than tomorrows.



Figure 3.   Predicting tomorrows temperature on two other station records.

Doing Regression and Prediction with Gaussian Processes

Before we define Gaussian Processes let us point to Christopher Bishop’s amazing book “Pattern Recognition and Machine Learning” (Springer 2006) for a complete treatment of the subject.  We will only provide a superficial introduction here.   For on-line resources there is the excellent blog on the subject by Peter Roelants.   We will use many of the code bits from that blog in what follows.  Another fun on-line source for learning about Gaussian Processes is the blog A Visual Exploration of Gaussian Processes by Görtler, Kehlbeck and Deussen.

In the simplest terms, a Gaussian Process is a statistical distribution of functions with some special properties.   In our case, the functions will represent the time evolution of stochastic processes. For example, the temperature at some location as a function of time,  or the number of daily infections of a virus in a community, or the random walk of a particle suspended in a fluid.

The distribution that defines a Gaussian Process are characterized by a mean function u(x) and a covariance function k (x, x).   A function f(x) drawn from this distribution, which is written


has the property that if when we pick a finite set of time points  X = { x1 … xn } which we view as n random variable, the values y=f(X) are normally distributed in a multivariate Gaussian distribution with mean u(X) and a covariance matrix by a kernel function k(X, X).  Written another way,


Given the covariance kernel function k() and mean function u() we can use this multivariant distribution to visualize what functions drawn from the Gaussian distribution look  like.   Let us pick 300 points on the interval [0,2] and a specific kernel (which we will describe later) and a mean function with constant value 1.   The following  numpy function will allow us to draw 10 sample functions


As shown in Figure 4 below they appear to be like random walks but they also appear to be not only continuous be also smooth curves.  That is because nearby points on the x axis correspond to highly correlated random variables due to the choice of k().  If we had set Σ to be the identity matrix the variables at neighboring points would be independent random variables and the path would look like noise.  (We will use that fact below.)


Figure 4. 10 different functions drawn from the Gaussian process.

Now for the interesting part.  What if we have some prior knowledge of values of y for a sample of x points?   We can then ask what is then the  distribution of the  functions?

View the n points on the time axis as n random variables.   Partition them into two sets X1 and X2 where we are going to suppose we have values Y1 for the X1 variables.  We can then ask for  the posterior distribution p(Y2 | Y1 , X1 , X2 ).  Reordering the variables so that X1 and X2  are contiguous the equation takes the form




One can prove that our condition probability distribution p(Y2 | Y1 , X1 , X2) is also a multivariate normal distribution described by the formulas


The proof of this is non-trivial.  See this post for details.   The good news here is we can calculate this if we know the prior kernel function k() and mean m().  Picking these function is a bit of an art.  The usual way to do this is to pick k() so that m(x) = 0 so that u2 and u1 in the above are 0.   Picking the kernel is often done by forming it as a linear combination of well-known standard kernel function and then formulating a hyper-parameter optimization problem to select the best combination.

To illustrate this, we can return to the weather station data.   We have two years of data from three nearby stations.    We note two properties of the data we must exploit: it is noisy and approximately periodic with a period of 365 days.   We will not bother with the optimization and rather take a straightforward linear combination of three standard kernels.


The first of these is the exponential quadratic and it is a very good, default kernel.   The second is the white noise kernel where the parameter sigma gives us the standard distribution of the noise we see in the data and the third is the periodic kernel which if we map our 365 days onto to the unit interval we can set p = 1.   Our kernel of choice (chosen without optimization, but because it seems to work o.k.) is


Where for the first two terms we have set sigma to one and we pick the sigma for the noise term to best fit the data at hand.   The figure below illustrates the result of using the average of the six instrument years as the raw (prior) data.   Then we select 46 points in the first 230 days (spaced 5 days apart) as our X1 days.

In the figure the red dots are the points and the red line is u2|1 conditional mean function.  Three additional lines (blue, green and yellow) are sample function drawn from the posterior.  The pink zone is two sigma of standard deviation in the prediction.   We also calculated the error in terms of average difference between the mean prediction and the raw data.   For this example, that average error was 3.17 degrees Fahrenheit.  The mean function does a reasonable job of predicting the last 130 days of the year.


Figure 5.   Graph of the raw data,  the mean conditional u2|1 (red line), and three additional functions (blue, yellow and green) drawn from the posterior.

The full details of the computation are in the jupyter notebook “Gaussian-processes-temps-periodic”,  but the critical function is the one that computes p(Y2 | Y1 , X1 , X2) and it is shown below (it is taken from Roelants blog)


In this case we invoked it with kernel_function  as keq + kp.   Sigma_noise was 0.3.   The clever part of this code was the use of  a standard linear algebra solver to solve for Z in this equation


But because Sigma11 is symmetric and the transpose of Sigma12 is Sigma21 we have


Once you have that the rest of the computation is accomplished with the matrix multiply (@) operator.

In the notebook Gaussian-process-temps-periodic in the Github repository you can see the Gaussian processes for the six year samples.

The Coronavirus Data

Another interesting source of data comes from the daily confirmed cases of coronavirus infections in various states.   We shall see that the troubling recent growth rate is so large that it is very hard for our Gaussian process models to make predictions based on recent past samples.  However, we thought it may be of value to illustrate how to obtain this data and work with it.

The Covid-19 data is in the Google cloud uploaded from the New York times. To access this you must have a google cloud account which is free for simple first-time use.   We will run google’s bigquery to extract the data and we will run it through a client in a Jupyter  notebook.    You will need to install the bigquery libraries.   A good set of instructions are here.    To use Jupyter go here . You will need to add the json package containing you service account key to your environment variables and described here.  Finally install the local libraries with this command on your machine.

pip install –upgrade google-cloud-bigquery[pandas]

First load the bigquery library and create the client in a Jupyter notebook with the following


There are a number of covid19 data sets available on BigQuery.   The one we will use is the New York Times collection.   The following query will request the data for the state of Washington, load it into a Pandas dataframe and print it.



 In our notebook bigquery-Covid we have the code that will extract the number of cases per day so that we can fit the Gaussian process to that.    That data is stored in the array ar_wash.   We attempted to make predictions with a sample every 9 days until the last 10 day.  Because of the large range of the data we scale it down by a factor of 1000.   The result is shown below.   The function make_gaussian is the same one we used for the weather station data except that the kernel is only the exponential quadratic.


As can be seen the mean function (red line) capture features of the last 10 days reasonably well.   Looking at New York we see similar results, but the fit for the last few days is not as good.


Where we fail most spectacularly is for those states that experienced a wave of new cases on the first week of July.  Here is Florida.


Changing the prediction window to the last 3 days does a bit better.  But 3 days is not much of a prediction window.


However, it is clear that a much more complex process is going on in Florida than is captured by this simple Gaussian process model.   The approach presented here is not a proper infectious disease model such as those from Johns Hopkins and IHME and other universities.  Those models are far more sophisticated and take into account many factors including social and human behavior and living conditions as well as intervention strategies.


As was pointed out in the introduction, this is a very superficial look at the problem of predicting the behavior of streaming data.  We looked at two approaches; one that focuses on accurate prediction of the next event using neural networks and one that attempts to capture long range statistical behavior using Gaussian process models.  The neural net model was able to learn the temperature patterns in the training data very well, but for test data it was much less accurate with average error of about 3- or 4-degrees Fahrenheit per day.   (This is about as good as my local weather person).   On the other hand, the Gaussian process made very good long range (over 100 days) predictions with only a small number of sample points.  This was possible because the Gaussian process model works well for patterns that have reasonably predictable cycles such as the weather.   However, the Gaussian process failed to capture changes in the more complex scenario of a viral infection where the dynamics changes because of social and human behavior or by evolutionary means.

If the reader is interested in extracting the data from Google’s BigQuery,  we have included the detail here and in the notebooks in the repository https://github.com/dbgannon/predicting_streams.