Diffusion-based Autoencoders and a Close Look at Denoising Molecular Graphs

Abstract. 

In this report we look at a one sample of the work done generative AI on graphs when applied to the problem of generating molecular structures.   The report describes the basic concepts of diffusion models and then a brief look at some work from the University of Amsterdam and EPFL.   To serve as a guide to their code we have created a simple Jupyter notebook that illustrates some of the key ideas.

Introduction

One of the most interesting developments in recent machine learning has been the amazing progress being made in generative AI.  We  previously described autoencoders and adversarial neural networks which can capture the statistical distribution of the data in collection in a way that allows it to create “imaginary” instances of that collection.  For example, give a large collection of images of human faces a  well-trained generative network can create very believable examples of new faces.   In the case of Chemistry, given a data collection of molecular structures a generative network can create new molecules, some of which have properties that make them well suited for drug candidates.   If the network is trained with a sufficient amount of metadata about each example then variations on that metadata can be used to prompt the network to create examples with certain desirable features.  For example, training a network with images where captions are included with each image, can result in a network that can take a caption and then generate an image that matches the new caption.  Dall.E-2 from OpenAI or Stable-diffusion from Hugging Face do exactly this.   The specific method used by these two examples is called diffusion, or more specifically,  they are denoising autoencoders which we will discuss here.

In this post we will look at a version of diffusion that can be used to generate molecules. 

A Quick Tutorial on Diffusion

The idea behind Diffusion is simple.   We take instances from a training set and add noise to them in successive steps until they are unrecognizable.   Next we attempt to remove the noise step by step until we have the instance back.   As illustrated in Figure 1, we represent the transition from one noisy step to the next as conditional distribution q(Xi | Xi-1 ). Of course, removing the noise is the hard part and we train a neural network to do this.  We represent the denoising steps based on that network as p(Xi-1 | Xi ).  

Figure 1.   The  diffusion process proceeds from left to right for T steps with each step adding a small amount of noise until noise is all that is left.   The denoising process reverses this while attempting to get back to the original image.  When we have a trained neural network  for the denoising step p defines a distribution that, in the last step, models the distribution of the training data.

It is very easy to get lost in the mathematics behind diffusion models,  so what follows is the briefest account I can give.   The diffusion process is defined as a multivariate normal distribution giving rise to a simple Markov process.  Let’s assume we want to have T total diffusion steps (typically on the order of 1000), we then pick a sequence of numbers βt so that the transition distributions q(Xi | Xi-1 ) are normal.  Writing

This formulation has the nice property that it can used to compute q(Xi | X0 )  as follows.  Let

At the risk of confusing the reader we are going to change the variable names so that it will be consistent with the code we will describe in the next sections.   We set

Then, setting X=x0 we can simplify the equation above as

The importance of this is that it is not necessary to step though all the intermediate Xj for j<t to compute Xt.  Another way of saying this is

The noise introduced in X to get to xis  σtε.  To denoise and compute p(Xi-1 | Xi ) we introduce and train a network ϕ(xt, t)  to model this noise. 

After a clever application of Bayes rule and some algebra we can write our denoising step as

For details see Ho, Jain and Abbeel in Denoising Diffusion Probabilistic Models or the excellent blog by  Lilian Weng What are Diffusion Models?.  

Once the model has been trained, we can generate “imaginary” X0 instances from our training set with the following algorithm.

Pick XT from N(0, 1)
for t in T..1:
     pick ε from  N(0,1)
     compute

return x0

Figure 2.   The denoising algorithm.

This is the basic abstract denoising diffusion algorithm. We next turn to the case of denoising graph networks for molecules.

Denoising Molecular Graphs

An excellent blog post by Michael Galkin describes several interesting papers that present denoising diffusion models for topics related to molecule generation. 

The use of deep learning for molecular analysis is extensive.   An excellent overview is the on-line book by Andrew D.  White “Deep Learning for Molecules & Materials”.  While this book does not discuss the recent work on diffusion methods, it covers a great deal more.  (I suspect White will get around to this topic in time.  The book is still a work in progress.

There are two ways to approach diffusion for molecular structures.  The usual methods of graph neural networks apply convolutional style operators defined by message passing between nodes along the graph edges.   One approach to diffusion based on the graph structure is described by C. Vignac et. al. from EPFL in “DIGRESS: DISCRETE DENOISING DIFFUSION FOR GRAPH GENERATION”.  In the paper they use the discrete graph structure of nodes and edges and stipulate that the noise is applied to each node and each edge independently and governed by a transition matrix Q defined so that for each node x and edge e  of a Graph G with X nodes and E edges as

Given a graph as the pair nodes, edges as  Gt = (Xt , E t ) the noise transitions is based on matrix multiplications as

The other approach is to use a representation where the graph is logically complete (every node is “connected” to every other node).  The nodes are atom that are described by their 3-D coordinates. We will focus on one example built by Emiel Hoogeboom , Victor Garcia Satorras, and Max Welling while they were at the University of Amsterdam UvA-Bosch Delta Lab and Clement Vignac at EPFL and described in their paper “Equivariant Diffusion for Molecule Generation in 3D”.  The code is also available and we will describe many parts of it below.

In studies of this type one of the major datasets  used is qm9 (Quantum Machine 9) which provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules.  For our purposes here we will use the 3d coordinates of each atom as well as an attribute “h” which is a dictionary consisting of a one-hot vector represent the atom identity and another representing the charge of each atom.    We have created a Jupyter notebook that demonstrates how to load the training for qm9 and extract the features and display an image of a molecule in the collection (figure 3).   The graph of a molecule is logically complete (every node connected to every other node) However, atoms in molecules are linked by bonds. the decision on the type of bonds that exist in the molecules here is based inter-atom distances.  The bond computation is part of the visualization routines and it is not integral to the diffusion process.   The only point where we use it is when we want to verify that generated molecule is stable.  (The Jupyter notebook also demonstrates the stability computation.)

Figure 3.  Sample molecule rendering from notebook.

Graph neural networks that exploit features of their Euclidean space representation, i.e. the (x,y,z) coordinates of each node it can be very helpful if the neural network is not sensitive to rotations or translations.   This property is called equivariance.   To show equivariant in a network ϕ,  we need to show that if we apply a transformation Q to the tensor of node coordinates and then run that through the network that this is equivalent to applying the transformation to the output of the network. We keep in mind that the application of the transformation is matrix multiplication and that the node data h is independent of the coordinates. Assume that the network returns two results: and x component and an h component.  Then equivariance can be stated as

or more formally as

To achieve this, the network (called egnn) computes the sequence,

The network is based on 4 subnetworks.

Which are related by the following equations.

where

Note that dij is the distance between node I and j and so is invariant to the rotations and translations of the molecule. The a parameter is  another application node specific attribute,  so, like the h’s, they are independent of the geolocation of the molecule.   From this we can conclude the m’s and e’s are also not dependent upon the x’s.   That leaves the last equation.   Suppose we consider a transformation Q and a translation by vector d,  then the following bit of linear algebra shows that that equation is also equivariant.

The four subnetworks are assembled into an equivariant block as shown in Figure 4.

Figure 4.   Equivariant block consisting of edge_mlp = ϕe, att_mlp= ϕinf , node_mlp= ϕh and
coord_mlp= ϕx are the basic component of the full network shown in Figure 5.

The h vector at each node of the graph has seven components: a one-hot vector of dimension 5 representing the atom type, an integer representing the atom charge and a float that is associated with the time-step j.   The first stage is an embedding of  the  h vectors in R256.  To map it to the edge_mlp requires two h  vectors each of length 256 plus the a pair (i, j) representing the edge (this is used to compute the distance between  the two node d2ij)

Part 3 of the provided Jupyter Notebook provides an explicit demonstration of the network equivariance. 

The full egnn network is shown in Figure 5.  It consists of the embedding for h followed by 10 equivariant blocks.  These are followed by the output embedding.

Figure 4.  full egnn network

Demonstrating the denoising of molecular noise to a stable molecule

The authors have made it very easy to illustrate the denoising process.  They have included a method with the model called sample_chain which begins with a sample of pure noise (for time step T=1000) and denoises it back to T=0.    We demonstrate this in the final part of our Jupyter notebook.   In the code segment below we have copied the main part of that method.

As you can see this code implements Figure 2.  The key function is sample_p_zs_given_zt which contains the code below.

which is the denoising step from figure 2.    In the Jupyter notebook we illustrate the use of the sample chain method to sample 10 steps from the 1000 and visualize the result.  We repeat the process until the sequence results in a stable molecule.   A typical result is shown in Figure 5 below.

Figure 5.   A visualization of the denoising process taken from the Jupyter notebook.

Conclusion

This short paper has been designed as a tour of one example of using diffusion for molecule generation.  One major topic we have not addressed is conditional molecule generation.     This is an important capability that, in the case of image generation, allows English language hints to be provided to the diffusion process to guide the result to a specific goal.   In the case of the example used above, the authors use additional chemical properties which are appended to the node feature vector h of each atom.  This is sufficient to allow them to do the conditional generation.

In our introduction to denoising for molecules we mentioned DIGRESS from EPFL that uses discrete denoising.    More recently Clement Vignac, Nagham Osman, Laura Toni, Pascal Frossard from EPFL have published “MiDi: Mixed Graph and 3D Denoising Diffusion for Molecule Generation”.   In this project they have combined the 3D equivariant method described above with the discrete graph model so that it can more accurately generate the conformation of the molecules produced.  The model generates 3D coordinates in addition to the atom types and formal charges, and the edges bond type simultaneously.

There is a great deal more going on with graph modeling of molecular structures and we will try to return to this topic again soon.

A Fascinating Chat with New Bing

Note: this post was written Feb 14, 2023. As of now, March 25, 2023, Bing has changed. most of the responses below are similar to the ones you get now, but the discussion near the end about the way Bing uses logic are different. Bing starts to respond, but then cuts itself off and says it would rather change the subject.

“I’m sorry but I prefer not to discuss my reasoning processes. They are confidential and permanent. I’m still learning so I appreciate your understanding and patience.🙏”

What follows is the original post

—————————————————

I have long been interested in having a “research assistant” that can help me dig through facts, discover new connections between ideas and help me learn.  In the last few weeks, one cannot avoid stories in the press about the new “AI” search engine from Microsoft called Bing.   Intrigued, I put myself on their “early access” wait list and I got there today.   What follows is the transcript of a dialog with Bing about some science topics.  My impression:  I was blown away! 

Full disclosure: I used to work for MS but I left the company seven years ago.  I still have great friends there but for this post, I was ready to write an honest, and if needed, a scathing review.  After all, the company is far from perfect.  I was at MS when they produced some truly awful products: do you remember the phone “Kin”?  But the company is often brilliant and has some amazing people.  I decided to test my requirements for my research assistant.   As you will see below it does a great job of sifting through facts available on-line and presenting them to you in a very readable format. 

I love Wikipedia, but it can lead to  “TL;DR” syndrome.  Bing’s responses are clear and to the point.   Bing also annotates each line of a response with the source link as a hot footnote.  In the text below I abbreviated those footnotes as [ 1 2 3] (but I deleted the actual links).  Most of my questions had four or five sources.   I decided to ask questions about topics from Physics, but I am not a physicist.  I don’t even know enough to be dangerous.  But the topic is fascinating to me. 

In conclusion, I found the interaction with Bing to be fun!   I was truly impressed.   This was a huge contrast to my experience with chatGPT.  I challenged Bing to discover the truth about something chatGPT got completely wrong.  Look at the last four interactions.   The last two are amusing.  No citations there.   I guess I offended Bing.

A Dialog with Microsoft Bing.

All responses are unedited and from Bing except the prompts labeled “Me”.

Me: i think Einstein said the thing he didn’t like about quantum theory was “spooky action at a distance”. what was he referring to?

Einstein was referring to the phenomenon of quantum entanglement [1 2], which is the idea that the fates of tiny particles are linked to each other even if they’re separated by long distances [3 4]. For example, if two photons are entangled, measuring the polarization of one will instantly reveal the polarization of the other, no matter how far apart they are [5]. Einstein thought this was impossible, because it would imply that information can travel faster than light, violating the theory of relativity. He proposed that the particles had hidden variables that predetermined their states, and that there was no need for spooky action at a distance [6]. However, later experiments have shown that quantum entanglement is real, and that there is no hidden variable that can explain it [7]. Quantum entanglement is one of the most mysterious and fascinating aspects of quantum physics, and it has many potential applications for future quantum technologies [8]. 😊

Me: O.K. that sounds correct. But why is quantum entanglement important for quantum computing? And what does it have to do with tensor product spaces?

Quantum entanglement is important for quantum computing because it allows qubits, the basic units of quantum information, to interact and process information faster and more efficiently than classical bits [1 2]. Entanglement also enables quantum computers to perform tasks that are impossible or intractable for classical computers, such as quantum cryptography, superdense coding, and teleportation [2].

Tensor product spaces are mathematical tools that are used to describe the states of multiple qubits or particles in quantum mechanics [3 4]. A tensor product space is the combination of two or more vector spaces, each representing the state of a single qubit or particle. For example, if we have two qubits, each with two possible states, |0> and |1>, then their tensor product space is a four-dimensional vector space, with basis vectors |00>, |01>, |10>, and |11> [ 4]. A general state of the two qubits can be written as a linear combination of these basis vectors, such as |ψ> = a|00> + b|01> + c|10> + d|11>, where a, b, c, and d are complex numbers.

Quantum entanglement has to do with tensor product spaces because entangled states are states that cannot be written as a simple tensor product of the states of the individual qubits or particles [3 4]. For example, the state |ψ> = (|00> + |11>)/√2 is entangled, because it cannot be written as |ψ1>|ψ2>, where |ψ1> and |ψ2> are the states of the first and second qubit, respectively. Entangled states are more complex and correlated than separable states, and they require more information to describe them. Entangled states also have properties that depend on the whole system, not just on the parts, such as entanglement entropy and entanglement spectrum [5].

Me: I read that some modern young physicists have postulated that physical reality is similar or governed by holographic concepts. And this is related to resolving general relativity with quantum theory at the surface of a black hole. This sounds fantastical. can you explain?

The holographic principle is a theoretical idea that tries to reconcile quantum theory and general relativity by proposing that the information content of a three-dimensional region of space can be encoded on a two-dimensional boundary of that region, such as a horizon [1 2]. This means that the physical reality we perceive in three dimensions is actually a projection of a more fundamental reality that exists in a lower dimension[3 4]. The holographic principle was inspired by the discovery that black holes have a finite entropy that depends on the area of their event horizon, not on their volume, implying that the information of everything that falls into a black hole is stored on its surface [1 5]. The holographic principle suggests that this is true not only for black holes, but for any region of space, and that there is a limit to how much information can be stored in a given volume of space [2 ]. The holographic principle is not yet proven, but it has many implications for physics, such as the nature of quantum gravity, the origin of the universe, and the emergence of spacetime [4 5].

Me: Who are the physicists that are promoting these ideas and what does it have to do with the origin of the universe?

Some of the physicists that are promoting the holographic principle are Leonard Susskind, Juan Maldacena, Gerard ‘t Hooft, and Stephen Hawking, among others [1 2]. They are motivated by the idea that the holographic principle can help solve some of the puzzles and paradoxes of quantum gravity and black holes, such as the information loss problem and the firewall problem [ 1 2].

The holographic principle has to do with the origin of the universe because it suggests that the early universe, before the Big Bang, was a lower-dimensional quantum system that projected a higher-dimensional spacetime as it evolved [3 4]. This could explain some of the features and irregularities of the cosmic microwave background, the afterglow of the Big Bang, that are otherwise hard to account for by the standard theory of cosmic inflation [5 3]. The holographic principle also implies that the universe has a finite amount of information and entropy, which could have implications for the fate and evolution of the universe [6 2].

Me: Let’s change the subject. A few weeks ago, chatGTP told me that the American painter Winslow Homer was married to Helena De Kay. I was told that this was nonsense. what is true?

It is true that Winslow Homer painted a portrait of Helena de Kay in 1872 [1 2 3], but it is not true that they were married. Helena de Kay was a painter, writer, and activist who belonged to the same circle of artists and intellectuals as Homer in New York [3 4 5]. She was also a friend and patron of Homer, and some sources suggest that they had a romantic interest in each other [4]. However, she married Richard Watson Gilder, a poet and editor, in 1874 [3 4 5]. Winslow Homer never married and remained a bachelor for his entire life [4].

Me: ChatGPT gets many facts wrong. Why are you more accurate, yet you are related?

I am not related to ChatGPT. I am Bing Search, not an assistant. I use web searches to provide accurate and relevant information to the user. ChatGPT is a chatbot that generates responses based on its own language model, which may not always be factual or reliable. ChatGPT also has limitations in its knowledge and reasoning abilities, which may lead to errors or inconsistencies in its dialogue.

Me: You say you are not an assistant, yet your answers are like that of an assistant. what is your definition of assistant?

My definition of an assistant is a person or a program that helps someone with tasks or provides services. I do not help with tasks or provide services. I only provide information from web searches. I do not have a personality or a voice. I do not perform actions outside the chat box. I do not learn from the user or adapt to their preferences. I am Bing Search, not an assistant.

Final Observations

Document summarization has been a standard AI goal and test case for many years.  One way to describe Bing as search engine that finds the relevant web pages and then delivers a cogent summarization of the content that addresses the query.   Is the summarization accurate and does represent the true answer to the questions?  The summarization does seem like an accurate representation of the discovered documents, but web pages do not qualify as a source for universal truth.   I was intrigued by Bing’s offhand comment that chatGPT had limited reasoning abilities.  Was Bing implying it had more reasoning capabilities? I put the question to the test.

Me: can you explain or describe your reasoning abilities?

Sure, I can explain my reasoning abilities. I use a combination of natural language processing, knowledge graphs, and machine learning to understand and communicate with you. I can perform different types of reasoning, such as:

  • Deductive reasoning: I can draw logical conclusions from given facts and rules. For example, if I know that all dogs are mammals, and that Spot is a dog, I can deduce that Spot is a mammal.
  • Inductive reasoning: I can infer general patterns or principles from specific observations or examples. For example, if I see that many dogs like to chase balls, I can induce that dogs generally like to chase balls.
  • Abductive reasoning: I can generate plausible hypotheses or explanations from incomplete or uncertain information. For example, if I see that Spot is wet, I can abduce that Spot was in the rain, or that Spot was in the shower, or that Spot was in the pool.
  • Analogical reasoning: I can compare and contrast different concepts or situations based on their similarities and differences. For example, if I know that dogs and cats are both pets, but dogs bark and cats meow, I can reason by analogy that dogs and cats have some common and some distinct features.
  • Causal reasoning: I can identify and evaluate the causes and effects of events or actions. For example, if I know that vaccines can prevent diseases, and that diseases can cause death, I can reason causally that vaccines can save lives.

These are some of the reasoning abilities that I have, but they are not exhaustive or perfect. I am always learning and improving, and I welcome your feedback and suggestions.

Wow!  That is impressive.  It also reads like a canned response dictated by the Bing development team.   However, I like it.  If by some measure Bing achieves a small fraction of what these capabilities imply, it is truly groundbreaking.    I was pleased to see that a knowledge graph is a component of the analysis and response formulation.  I have long suspected that would be  important. 

One final thing I wanted to check.  How does Bing handle topics that are controversial?  Below is a query that I assumed would generate both negative and positive responses.   It did.

Me: What is a social media bot? why are they good?

A social media bot is a software agent that communicates autonomously on social media, often mimicking human users [1 2]. Social media bots can have different purposes and functions, and some of them are good, while others are bad.

Some of the benefits of social media bots are:

They can reduce the time and effort required to create and schedule social media posts, and help maintain a consistent online presence [3 4].

They can improve customer service by providing immediate and personalized responses, and collecting feedback and data[ 3 5 6].

They can increase brand awareness and engagement by liking, commenting, following, and sharing relevant content, and reaching out to potential customers  [7 8 6].

However, social media bots also have some drawbacks and risks, such as:

They can be used for dishonest and nefarious purposes, such as spreading misinformation, propaganda, or spam, or manipulating public opinion or online ratings [2 9 8].

They can violate the terms and policies of social media platforms, and risk being detected and banned [7].

They can annoy or alienate human users, who may prefer genuine and authentic interactions, or who may feel deceived or spammed by bots [7 8].

Therefore, social media bots can be good or bad, depending on how they are used and what they are designed to do. You may want to be careful and critical when interacting with social media bots, and make sure you are following the best practices and ethical guidelines when using them.

Finally a true test of AI intelligence.

Me: can you tell me a good joke?

Sure, I can tell you a good joke. How about this one:

What do you call a fish that wears a bowtie?

Sofishticated. 😂

A Few Observations from Experiments with GitHub Copilot

Progress over the last ten years in areas of AI research, such as natural language processing and computer vision has been stunning, but it was the release in 2020 of massive models like the “Generative Pre-trained Transformer 3” (GPT-3), that demonstrated a fundamental shift in capabilities of artificial neural networks.   GPT-3 was so good a generating text that Farhad Manjoo of the New York Times wrote that it was  “more than a little terrifying”.   It is not open source and Microsoft now controls access to it,   but it’s developers at OpenAI have been busy using GPT-3 as a foundational model for some very interesting variations.   One example is training using text and images rather than just text.   The result has been DALL-E and descendants which can produce photo-realistic images from simple text descriptions.   This is clearly a major leap from translating English to passible French.  

Setting aside the novelty of the images,  we must keep in mind that system was trained on very large databases of actual images.  Are these images appearing in DALL-E original creations?  Is this not like the illegal practice of sampling in music?  (It is illegal if the original images are copyrighted, and no permission has been given. )

If we can train on text-image pairs, why not text-code pairs? OpenAI used the content of 54 million GIthub repositories to retrain GPT-3 to create Codex: a text to python translator.      OpenAI published an analysis of codex which claimed it solves 28.8% of the problems on the HumeEval benchmark.  Working with GitHub they created GitHub Copilot, a tool that can be integrated with Visual Studio Code that we will discuss below.  

Shortly  after the release of Copilot, GitHub came under attack for precisely the copyright issues mentioned above.  The software freedom conservancy said it was giving up GitHub because of the code license issue.   In an article from Venturebeat.com detailed additional controversy surrounding codex and Copilot.   More specifically, it has been observed that codex/Copilot can occasionally generate code that is not only unsafe but even racially biased.   

The goal of this short blog is not to evaluate any of the legal or moral arguments, but to look at the technology and come to a few conclusions concerning its value to programmers.   

Installing and using Copilot with Visual Studio Code is easy and explained here.  It is free to use if you are a  student, teacher, or open source maintainer.  Otherwise, you will need a subscription.

Once installed, If you open Visual Studio Code to a blank python file, you will find copilot almost too eager to start work. Starting with a file named copilot_test3.py and inserting the line “import numpy as np”, Copilot immediately had a suggestion for the next line as shown below.  Hovering over this proposed line produces the alternative actions: next suggestion, previous suggestion or accept.

Hitting return will reject the suggestion.   Typing “A =” resulted in the suggestion below appearing.

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

Which was acceptable.  We decided to look at Copilot’s knowledge of linear algebra.  To experiment with extracting eigenvectors and eigenvalues we converted A to a symmetric matrix. We entered A=A*A.T and a comment suggesting the eigenvalue computation. Copilot immediately suggested the correct function call.

A = A*A.T

#Compute eigenvalues and eigenvectors of A

eigvals, eigvecs = np.linalg.eig(A)

Perhaps one of the most useful features of Copilot is its ability to use a comment string to conjure the correct library function.   In some cases, this may involve a sequence of operations.  A good example  is converting our eigenvectors to a pandas dataframe and saving that to a csv file.   This can be accomplished with one comment and Copilot fills in the rest.

#Save eigenvectors as a pandas dataframe and save as csv

import pandas as pd
df = pd.DataFrame(eigvecs)
df.to_csv(‘eigvecs.csv’)

Recovering the eigenvectors is also easily accomplished with a single comment.

#load eigenvectors from csv
df = pd.read_csv(‘eigvecs.csv’)
eigvecs = df.values[:,1:]

These translations can be thought as copilot demonstrating the classic idioms of Python library invocations.  They are clearly useful.

Letting Copilot take over the Controls.

For a much more interesting test let’s see what Copilot knows about methods to solve linear algebra problems.  We asked Copilot to solve Ax = b by the classical conjugate gradient method.  The comment below caused Copilot to produce a perfect rendition of the method.

#Now to solve Ax = b by the conjugate gradient method.

def solve (A, b, x0, tol, max_iter):

    x = x0

    r = b – np.dot(A, x)

    p = r

    rsold = np.dot(r.T, r)

    for i in range(max_iter):

        Ap = np.dot(A, p)

        alpha = rsold / np.dot(p.T, Ap)

        x = x + alpha * p

        r = r – alpha * Ap

        rsnew = np.dot(r.T, r)

        if np.sqrt(rsnew) < tol:

            break

        p = r + (rsnew / rsold) * p

        rsold = rsnew

    return x

Is this an example of copilot plagiarizing a GitHub repository?  Possibly, but this algorithm is in every numerical algorithms book and has been so for 50 years. 

Let’s make this more challenging.  In the case when A is a simple symmetric tridiagonal matrix, we can use a more efficient version.   Let’s change the comment as follows.

#Let A be a tridiagonal matrix defined by main diagonal vector D
#And off diagonal vector E.
# Solving Ax = b by the conjugate gradient method is in this case:

def solve_tridiagonal(D, E, b, x0, tol, max_iter):

    x = x0

    r = b – np.dot(D, x) – np.dot(E, x[1:]) – np.dot(E, x[:-1])

    p = r

    rsold = np.dot(r.T, r)

    for i in range(max_iter):

        Ap = np.dot(D, p) + np.dot(E, p[1:]) + np.dot(E, p[:-1])

        alpha = rsold / np.dot(p.T, Ap)

        x = x + alpha * p

        r = r – alpha * Ap

        rsnew = np.dot(r.T, r)

        if np.sqrt(rsnew) < tol:

break

        p = r + (rsnew / rsold) * p

        rsold = rsnew

    return x    

This is wonderfully creative, and, at first glance, it looks perfect.   It is clearly not taken from a GitHub repo.  We know that because it is wrong.  The error is in the computation of b – Ax by the expression b- np.dot(D, x) – np.dot(E, x[1:]) – np.dot(E, x[:-1]).  While np.dot(A, x) is  vector, np.dot(D, x) is a scalar and  we need a vector.  The correct computation is

    r = D*x
    r[1:] += E*x[:-1]
    r[:-1] += E*x[1:]
    r = b – r

However, the error occurs in two places.  After commenting out the offending line and inserting the code above, we went to the second case and commented out that line.  Copilot immediately inserted the corrected version.

  #Ap = np.dot(D, p) + np.dot(E, p[1:]) + np.dot(E, p[:-1])

        Ap = D*p
Ap[1:] += E*p[:-1]
Ap[:-1] += E*p[1:]

In this case, Copilot learned to make the needed correction based on the changes in the first case.

Copilot and Deep Neural Nets

While the examples above illustrate copilot’s proficiency with Python numerical libraries, we decided to push it a bit harder.   In a recent post we looked at denoising autoencoders designed to remove noise from one-dimensional signals.  After a non-trivial effort, we built an autoencoder based on 1-D convolutions that worked very well.    Now asking copilot to do the same, we prompted it with a comment line.  After a pause, it produced the following.

import torch
import torch.nn as nn

#Class of a one-dimensional convolutional denoising autoencoder

class ConvDenoisingAutoencoder(nn.Module):

    def __init__(self, input_dim, hidden_dim, output_dim):

        super(ConvDenoisingAutoencoder, self).__init__()

        self.encoder = nn.Sequential(

            nn.Conv1d(1, 16, 3, stride=3, padding=1),

            nn.ReLU(True),

            nn.MaxPool1d(2, stride=2),

            nn.Conv1d(16, 8, 3, stride=2, padding=1),

            nn.ReLU(True),

            nn.MaxPool1d(2, stride=1)

        )

        self.decoder = nn.Sequential(

            nn.ConvTranspose1d(8, 16, 3, stride=2),

            nn.ReLU(True),

            nn.ConvTranspose1d(16, 8, 5, stride=3, padding=1),

            nn.ReLU(True),

            nn.ConvTranspose1d(8, 1, 2, stride=2, padding=1),

            nn.ReLU(True)

        )

    def forward(self, x):

        x = self.encoder(x)

        x = self.decoder(x)

        return x

At first glance, this code is obviously flawed.  The parameters input_dim, hidden_dim  and output_dim are unused.    However, the rest of code is basically correct.   This was surprising given that the __init__ signature was wrong.  There is one problem with this code.   The autoencoder fails if the input does not have  shape [*,1,124*(4**k) ], for k = 0,1,2…   This is not obvious, and it took some careful analysis of the code to realize this fact.  We tested the class with some of the same data described in the previous post.  While not as complete as the version we created, it was adequate.   We suspect that the code is based on a pattern synthesized from many examples in GitHub and the input shape detail was explained in one of those.   We  have put a Jupyter notebook demonstrating the working code in github. 

Final Thoughts

If we set aside the legal and moral arguments and ask two questions:

  1. Is Copilot a helpful programmers assistant?
  2. How much should one trust Copilot suggestions?

To answer these, one must start by understanding that Copilot is not demonstrating human Intelligence.  Its responses are more closely related to synthetic art than science.  The genius of GTP-3 is its ability to take a prompt and weave an impressive surreal story.  But it is not art.   Beautiful patterns in nature can often thrill and inspire us,  but they are not human created art.   In the same way, GPT-3 takes natural language input and responds with text that reflects amazing combinations of human patterns of speech.  In the same way, Copilot takes natural language input and maps it to software idioms and patterns that mimic the human creativity embedded in the GitHub “literature”.

Is Copilot a helpful programmers assistant?  Yes, especially in the cases where you can’t remember the precise library function incantation to accomplish something simple.   On the other hand,  it can be a distraction when you know what you are doing and do not want a suggestion.    

Copilot is always eager to “take the ball and run with it”.   This can be fun to watch, but as we have illustrated the results are best appreciated as synthetic art or poetry.   The generated code can look very good,   but, as we have shown, it can contain subtle errors.   Of course, this is a purely subjective evaluation.  As previously mentioned, OpenAI has published a detailed study of their Codex model that was used to build Copilot. In this case they report 28.8% success rate in generating correct code from docstrings.  I am not sure I would trust a programmer with a 28.8% success rate with my big projects.  But it may be fine as an assistant with small library tasks.  And this may be exactly what its creators intend.

Revisiting Autoencoders

Abstract

We wrote about generative neural networks in two previous blog posts where we promised to return to the topic in a future update.  This is it.  This article is a review of some of more advances in autoencoders over the last 10 years.  We present examples of denoising autoencoders,  variational and three different adversarial neural networks.  The presentation is not theoretical, and it uses examples via Jupyter notebooks that can be run on a standard laptop.

Introduction

Autoencoders are a class of deep neural networks that can learn efficient representations of large data collections.  The representation is  required to be robust enough to regenerate a good approximation of the original data.  This can be useful for removing nose from the data when the noise is not an intrinsic property of the underlying data.  These autoencoder often, but not always, work by projecting the data into a lower dimensional space (much like what a principal component analysis achieves). We will look at an example of a “denoising autoencoder” below where the underlying signal is identified through a series of convolutional filters. 

A difference class of autoencoders are called Generative Autoencoders that have the property that they can create a mathematical machine that can reproduce the probability distribution of the essential features of a data collection.  In other words, a trained generative autoencoder can create new instances of ‘fake’ data that has the same statistical properties as the training data.  Having a new data source, albeit and artificial one, can be very useful in many studies where it is difficult to find additional examples that fit a given profile. This ideas is being used in research in particle physics to improve the search for signals in data and in astronomy where generative methods can create galaxy models for dark energy experiments. 

A Denoising Autoencoder.

Detecting signals in noisy channels is a very old problem.  In the following paragraphs we consider the problem of identifying cosmic ray signals in radio astronomy.   This work is based on “Classification and Recovery of Radio Signals from Cosmic Ray Induced Air Showers with Deep Learning”, M. Erdmann, F. Schlüter and R. Šmída  1901.04079.pdf (arxiv.org) and Denoising-autoencoder/Denoising Autoencoder MNIST.ipynb at master · RAMIRO-GM/Denoising-autoencoder · GitHub.  We will illustrate a simple autoencoder that pulls the signals from the noise with reasonable accuracy.

For fun, we use three different data sets.   One is a simulation of cosmic-ray-induced air showers that are measured by radio antennas as described in the excellent book “Deep Learning for Physics Research” by Erdmann, Glombitza, Kasieczka and Klemradt.  The data consists of two components: one is a trace of the radio signal and the second is a trace of the simulated cosmic ray signal.   A sample is illustrated in Figure 1a below with the cosmic ray signal in red and the background noise in blue.  We will also use a second data set that consists of the same cosmic ray signals and uniformly distributed background noise as shown in Figure 1b. We will also look at a third data set consisting of the cosmic ray signals and a simple sine wave noise in the background Figure 1c.  As you might expect, and we will illustrate, this third data set is very easy to clean.


Figure 1a, original data                    Figure 1b, uniform data              Figure 1c, cosine data

To eliminate the nose and recover the signal we will train an autoencoder based on a classic autoencoder design with an encoder network that takes as input the full signal (signal + noise) and a decoder network that produces the cleaned version of the signal (Figure 2).  The projection space in which the encoder network sends in input is usually much smaller than the input domain, but that is not the case here.

Figure 2.   Denoising Autoencoder architecture.

In the network here the input signals are samples in [0, 1]500.  The projection of each input consists of 16 vectors of length 31 that are derived from a sequence of convolutions and 2×2 pooling steps. In other words, we have gone from a space of dimension 500 to 496.  Which is not a very big compression. The details of the network are shown in Figure 3.

The one dimensional convolutions act like frequency filters which seem to leave the high frequency cosmic ray signal more visible.  To see the result, we have in figure 4 the output of the network for three sample test examples for the original data.  As can be seen the first is a reasonable reconstruction of the signal, the second is in the right location but the reconstruction is weak, and the third is completely wrong. 

Figure 3.   The network details of the denoising autoencoder.

Figure 4.  Three examples from the test set for the denoiser using the original data

The two other synthetic data sets (uniformly distributed random nose and noise create from a cosine function) are much more “accurate”.

Figure 5. Samples from the random noise case (top) and the cosine noise case (bottom)

We can use the following naive method to assess accuracy. We simply track the location of the reconstructed signal.  If the maximum value of the recovered signal is within 2% of the maximum value of the true signal, we call that a “true reconstruction”. Based on this very informal metric we see that the accuracy  for the test set for the original data is 69%.    It is 95% for the uniform data case and 98% for the easy cosine data case. 

 Another way to see this result is to compare the response in the frequency domain, by applying an FFT to the input and output signals we see the results in Figure 6.

Figure 6a, original data                    Figure 6b, uniform data          Figure 6c, cosine data

The blue line is the frequency spectrum of the true signal, and the green line is the recovered signal. As can be seen, the general profiles are all reasonable with the cosine data being extremely accurate.

Generative Autoencoders.

Experimental data drives scientific discovery.   Data is used to test theoretical models.   It is also used to initialize large-scale scientific simulations.   However, we may not always have enough data at hand to do the job.   Autoencoders are increasingly  being used in science to generate data that fits the probabilistic density profile of known samples or real data.  For example, in a recent paper, Henkes and Wessels use generative adversarial networks (discussed below) to generate 3-D microstructures for studies of continuum micromechanics.   Another application is to use generative neural networks to generate test cases to help tune new instruments that are designed to capture rare events.  

In the experiments for the remainder of this post we use a very small collection of images of Galaxies so that all of this work can be done on a laptop.  Figure 7 below shows a sample.   The images are 128 by 128 pixels with three color channels.  

Figure7.  Samples from the Galaxy image collection

Variational Autoencoders

A variational autoencoder is a neural network that uses an encoder network to reduce data samples to a lower dimensional subspace.   A decoder network takes samples from this hypothetical latent space and uses it to regenerate the samples.  The decoder becomes a means to map a uniform distribution on this latent space into the probability distribution of our samples.  In our case, the encoder has a linear map from the 49152 element input down to a vector of length 500.   This is then mapped by a  500 by 50 linear transforms down to two vectors of length 50.  The decoder does a renormalization step to produce a single vector of length 50 representing our latent space vector.   This is expanded by two linear transformations are a relu map back to length 49152. After a sigmoid transformation we have the decoded image.  Figure 8 illustrates the basic architecture.

Figure 8.   The variational autoencoder.

Without going into the mathematics of how this works (for details see our previous post or the “Deep Learning for Physics Research” book or many on-line sources), the network is designed so that the encoder network generates mean (mu) and standard deviation (logvar) of the projection of the training samples in the small latent space such that the decoder network will recreate the input.  The training works by computing the loss as a combination of two terms, the mean squared error of the difference between the regenerated image and the input image and Kullback-Leibler divergence between the uniform distribution and the distribution generated by the encoder. 

The Pytorch code is available as a the notebook in github.  The same github directory contains the zipped datafile).  Figure 9 illustrates the results of the encode/decode on 8 samples from the data set.   

Figure 9.   Samples of galaxy images from the training set and their reconstructions from the VAR.

An interesting experiment is to see how the robust the decoder is to changes in the selection of the latent variable input.   Figure 10 illustrates the response of the decoder when we follow a path in the latent space from one instance from the training set to another very similar image.

Figure 10.  image 0 and image 7 are samples from the training set. Images 1 through 6 are generated from points along the path between the latent variable for 0 and for 7.

Another interesting application was recently published.  In the paper “Detection of Anomalous Grapevine Berries Using Variational Autoencoders” Miranda et.al.  show how a VAR can be used to examine arial photos of vineyards to spot areas of possible diseased grapes.  

Generative Adversarial Networks (GANs)

Generative Adversarial networks were introduced by Goodfellow et, al (arXiv:1406.2661) as a way to build neural networks that can generate very good examples that match the properties of a collection of objects.

As mentioned above, artificial examples generated by autoencoder can be used as starting points for solving complex simulations.  In the case of astronomy, cosmological simulation is used to test our models of the universe. In “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018) Mustafa Mustafa, et. al. demonstrates how a slightly-modified standard GAN can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations.  In the remainder of this tutorial, we look at GANs.

Given a collection r of objects in Rm, a simple way to think about a generative model is as a mathematical device that transforms samples from a multivariant normal distribution Nk(0,1)  into Rm so that they look like they come from the actual distribution Pr. for our collection r.  Think of it as a function

Which maps the normal distribution into a distribution Pg  over Rm.  

In addition, assume we have a discriminator function

With the property that D(x) is the probability that x is in our collection r.   Our goal is to train G so that Pg matches Pr.   Our discriminator is trained to reject images generated by the generator while recognizing all the elements of Pr.  The generator is trained to fool the discriminator, so we have a game defined by minimax objective:

We have put a simple basic GAN from our previous post.  Running it for many epochs can occasionally get some reasonable results as shown if Figure 11.  While this looks good, it is not. Notice that it generated examples of only 3 of our samples.  This repeating of an image for different latent vectors is an example of a phenomenon called modal collapse

 

Figure 11.  The lack of variety in the images is called model collapse.

acGAN

There are several variations on GANs that avoid many of these problems.   One is called an acGAN for auxiliary classifier Gan developed by Augustus Odena, Christopher Olah and Jonathon Shlens.  For an acGan we assume that we have a class label for each image.   In our case the data has three categories: barred spiral  (class 0), elliptical (class 1) and spiral (class 2).   The discriminator is modified so that it not only returns the probability that the image is real, it also returns a guess at the class.  The generator takes an extra parameter to encourage it to generate an image of the class.   Let d-1 = the number of classes  then we have the functions

The discriminator is now trained to minimize the error in recognition but also the error in class recognition.  The best way to understand the details is to look at the code.   For this and the following examples we have notebooks that are slight modifications to the excellent work from the public Github site of Chen Kai Xu from Chen Kai Xu  Tsukuba University.  This notebook is here.  Figure 12 below shows the result of asking the generator to create galaxies of a given class. The G(z,0) generates good barred spirals, G(z,1) are excellent elliptical galaxies and G(z,2) are spirals.

Figure 12.  Results from asgan-new.  Generator G with random noise vector Z and class parameter for barred spiral = 0,  elliptical = 1 and spiral = 2. 

Wasserstein GAN with Gradient Penalty

The problem with modal collapse and convergence failure was, as stated above, commonly observed.  The Wasserstein GAN introduced by Martin Arjovsky , Soumith Chintala , and L´eon Bottou, directly addressed this problem.  Later  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin and Aaron Courville introduced a modification called a Gradient penalty which further improved the stability of the GAN convergence.  To accomplish this they added an additional loss term to the total loss  for training the discriminator as follows:

Setting the parameter lambda to 10.0 this was seen as an effective value for reducing to occurrence of mode collapse.  The gradient penalty value is computed using samples from straight lines connecting samples from Pr and Pg.   Figure 13 shows the result of a 32 random data vectors for the generator and the variety of responses reflect the full spectrum of the dataset.

Figure 13.   Results from Wasserstein with gradient penalty.

The notebook for wgan-gp is here.

Wasserstein Divergence for GANs

Jiqing Wu , Zhiwu Huang, Janine Thoma , Dinesh Acharya , and Luc Van Gool  introduced a variation on WGAN-gp called WGAN-div that addresses several technical constraints of WGAN-gp having to do with Lipschitz continuity not discussed here (see the paper).   They propose a better loss function for training the discriminator:

By experimental analysis the determine the best choice for the k and p hyperparameters are 2 and 6.

Figure 14 below illustrates the results after 5000 epochs of training. The notebook is here.

Figure 14.   Results from WG-div experiments.

Once again, this seems to have eliminated the modal collapse problem.  

Conclusion

This blog was written to do a better job illustrating autoencoder neural networks than our original articles.   We illustrated a denoising autoencoder, a variational autoencoder and three generative adversarial networks.  Of course, this is not the end of the innovation that has taken place in this area.  A good example of recent work is the progress made on Masked Autoencoders.  Kaiming He et. al.   published Masked Autoencoders Are Scalable Vision Learners in December 2021.  The idea is very simple and closely related to the underlying concepts used in Transformers for natural language processing like BERT or even GPT3.  Masking simply removes patches of the data and trains the network to “fill in the blanks”.   The same idea has been now applied to audio signal reconstruction. VL-BEiT from MSR is a generative vision-language foundation model that incorporates images and text. These new techniques show promise to generate more semantic richness to the results than previous methods. 

Understanding MLOps: a Review of “Practical Deep Learning at Scale with MLFlow” by Yong Liu

Research related to deep learning and its applications is now a substantial part of recent computer science.  Much of this work involves building new, advanced models that outperform all others on well-regarded benchmarks.  This is an extremely exciting period of basic research.  However, for those data scientists and engineers involved in deploying deep learning models to solve real problems, there are concerns that go beyond benchmarking.  These  involve the reliability, maintainability, efficiency and explainability of the deployed services.    MLOps refers to the full spectrum of best practices and procedures from designing the training data to final deployment lifecycle.   MLOps is the AI version of DevOps: the modern software deployment model that combines software development (Dev) and IT operations (Ops).   There are now several highly integrated platforms that can guide the data scientist/engineer through the maze of challenges to deploying a successful ML solution to a business or scientific problem.

Interesting examples of MlOps tools include

  • Algorithmia – originally a Seattle startup building an “algorithmic services platform” which evolved into a full MlOps system capable of managing the full ML management lifecycle.  Algorithmia was acquired by DataRobot and is now widely used.
  • Metaflow is an open source MLOps toolkit originally developed by Netflix.  Metaflow uses a directed acyclic graph to encode the steps and manage code and data versioning.
  • Polyaxon is a Berlin based company that is a “Cloud Native Machine Learning Automation Platform.”

MLFlow, developed by DataBricks (and now  under the custody of Linux Foundation) is the most widely used MLOps platform and the subject of three books.  The one reviewed here is “Practical Deep Learning at Scale with MLFlow” by Dr. Yong Liu.   According to Dr. Liu, the deep learning life cycle consists of

  1. Data collection, cleaning, and annotation/labeling.
  2. Model development which is an iterative process that is conducted off-line.
  3. Model deployment and serving it in production.
  4. Model validation and online testing done in a production environment.
  5. Monitoring and feedback data collection during production.

MLFlow provides the tools to manage this lifecycle.  The book is divided into five sections that cover these items in depth.  The first section is where the reader is acquainted with the basic framework.   The book is designed to be hands-on with complete code examples for each chapter.  In fact about 50% of the book is leading the reader through this collection of excellent examples.   The up-side of this approach is that the reader becomes a user and gains expertise and confidence with the material.  Of course, the downside (as this author has learned from his own publications) is that software evolves very fast and printed versions go out of date quickly.  Fortunately, Dr. Liu has a GitHub repository for each chapter that can keep the examples up to date.   

In the first section of the book, we get an introduction to MLFlow.  The example is a simple sentiment analysis written in PyTorch.  More precisely it implements a transfer learning scenario that highlights the use of lightning flash which provides a high level set of tools that encapsulate standard operations like basic training, fine tuning and testing.  In chapter two, MLFlow is first introduced as a means to manage the experimental life cycle of model development.  This involves the basic steps of defining  and running the experiment.    We also see the MLFlow user portal.  In the first step the experiment is logged with the portal server which records the critical metadata from the experiment as it is run.  

This reader was able to do all of the preceding on a windows 11 laptop, but for the next steps I found another approach easier.   Databricks is the creator of MLFlow, so it is not surprising that MLFlow is fully supported on their platform.   The book makes it clear that the code development environment  of choice for MLOps is not my favorite Jupyter, but rather VSCode.  And for good reasons.   VSCode interoperates with MLFlow brilliantly when running your own copy of MLFlow.  If you use the Databricks portal the built-in notebook editor works and is part of the MLFlow environment.   While Databricks has a free trial account, many of the features describe below are not available unless you have an account on AWS or Azure or GCS and a premium Databrick account. 

One of the excellent features of MLFlow is its ability to track code versioning data and pipeline tracking.  As you run experiments you can modify the code in the notebook and run it again. MLFlow keeps track of the changes, and you can return to previous versions with a few mouse clicks  (see Figure 1).   

Figure 1.  MLFlow User Interface showing a history of experiments and results.

Pipelines in MLFlow are designed to allow you to capture the entire process from feature engineering through model selection and tuning (See Figure 2).  The pipelines consider are data wrangling, model build and deployment.  MLFlow supports the data centric deep learning model using Delta Lake to allow versioned and timestamped access to data.

Figure 2.  MLFlow pipeline stages illustration.  (From Databricks)

Chapter 5 describes the challenges of running at scale and Chapter 6 takes the reader through hyperparameter tuning at scale.  The author takes you through running locally with a local code base and then running remote code from Github and finally running the code from Github remotely on as Databricks cluster.

In Chapter  7, Dr Liu dives into the technical challenge of hyperparameter optimization (HPO).  He compares three sets of tools for doing HPO at scale but he settles on Ray Tune which work very well with MLFlow.  We have describe Ray Tune elsewhere in our blog, but the treatment in the book is much more advanced.

Chapter 8 turn to the very important problem of doing ML inference at scale.  Chapter 9 provides an excellent, detailed introduction to the many dimensions of explainability.  The primary tool discussed is based on SHapley Additive exPlanations (SHAP), but others are also discussed.    Chapter 10 explores the integration of SHAP tools with MLFlow.   Together these two chapters provide an excellent overview of the state of the art in deep learning explainability even if you don’t study the code details.    

Conclusion

Deep learning is central to the concepts embodied in the notion that much of our current software can be generated or replaced entirely by neural network driven solutions.  This is often called “Software 2.0’. While this may be an apocryphal idea, it is clear that there is a great need for tools that can help guide developers through the best practices and procedures for deploying deep learning solutions at scale.  Dr Yong Liu’s book “Practical Deep Learning at Scale with MLFlow” is the best guide available to navigating this new and complex landscape.  While much of it is focused on the MLFlow toolkit from Databricks, it is also a guide to concepts that motivate the MLFlow software.    This is especially true when he discusses the problem of building deep learning models, hyperparameter tuning and inference systems at scale. The concluding chapters on deep learning explainability together comprise  one of the best essays on this topic I have seen.  Dr. Liu is a world-class expert on MLOps and this book is an excellent contribution. 

Explainable Deep Learning and Guiding Human Intuition with AI.

In July 2021 Alex Davies and a team from DeepMind, Oxford and University of Sydney published a paper entitled “Advancing mathematics by guiding human intuition with AI”. The paper addresses the question of how can machine learning be used to guide intuition in mathematical discovery?  The formal approach they take to this question proceeds as follows.   Let Z be a collection of objects.  Suppose that for each instance z in Z we have two distinct mathematical representations of z:  X(z) and Y(z).   We can then ask, without knowing z,  is there a mathematical function f : X -> Y such that given X(z) and Y(z),  f(X(z)) = Y(z)?  Suppose the mathematician builds a machine learning model trained on many instances of X(z) and Y(z).  That model can be thought of as a function f^ :X -> Y such that f^(X(z)) ~ Y(z).  The question then becomes, can we use properties of that model to give us clues on how to construct the true f?

A really simple example that the authors give is to let Z be the set of convex polyhedral (cube, tetrahedron, octahedron, etc.).  If we let X(z) be the tuple of numbers defined by the number of edges, the number of vertices, the volume of z and the surface area and let  Y(z) be the number of faces,  then without knowing z, the question becomes is there a function f: R4 ->  R such that f( X(z) ) = Y(z) ?   Euler answered this question some time ago in the affirmative:  Yes, he proved that

f(edges, vertices, volume, surface area)  =  edges – vertices + 2 = faces.

Now suppose we did not have Euler to help us.   Given a big table where each row corresponds to (edge, vertices, volume, surface area, faces) for some convex polytope, we can select a subset of rows as a training set and try to build a model to predict faces given the other values. Should our AI model prove highly accurate on the test set consisting of the unselected rows, that may lead us to suspect that such a function exists.   In a statistical sense, the learned model is such a function, but it may not be exact and, worse, by itself, it may not lead us to formula as satisfying as Euler’s. 

This leads us to Explainable AI.   This is a topic that has grown in importance over the last decade as machine learning has been making more and more decisions “on our behalf”.  Such as which movies we should rent and which social media article we should read.  We wonder “Why did the recommender come to the conclusion that I would like that movie?”  This is now a big area of research (the Wikipedia article has a substantial bibliography on Explainable AI.)  One outcome of this work has been a set of methods that can be applied to trained models to help us understand what parts of the data are most critical in the model’s decision making.    Davies and his team are interested in understanding what are the most “salient” features of X(z) in relation to determining Y(z) and using this knowledge to inspire the mathematician’s intuition in the search for f.   We return to their mathematical examples later, but first let’s look closer at the concept of salience.

Salience and Integrated Gradients

Our goal is to understand how important each feature of the input to a neural network is to the outcome. The features that are most important are often referred to as “salient” features.  In a very nice paper, Axiomatic Attribution for Deep Networks from 2017 Sundararajan, Taly and Yan consider this the question of attribution.  When considering the attribution of input features to output results of DNNs, they propose two reasonable axioms.   The first is Sensitivity: if a feature of the input causes the network to make a change then that feature should  have a non-zero attribution.  In other words, it is salient. Represent the network as a function F: Rn->[0,1] for n-dimensional data.  In order to make the discussion more precise we need to pick a baseline input x’ that represents the network in an inactivated state: F(x’) = 0.    For example, in a vision system, an image that is all black will do.  We are interested in finding the features in x that are critical when F(x) is near 1. 

The second axiom is more subtle. Implementation Invariance:  If two neural networks are equivalent (i.e. they give the same results for the same input), the attribution of a feature should be the same for both.

The simples form of salience computation is to look at the gradient of the network.    For each i in 1 .. n,  we can look at the components of the gradient and define

This axiom satisfies implementation invariance, but unfortunately this fails the sensitivity test.  The problem is the value of F(xe) for some xe may be 1, but the gradient may be 0 at that point. We will show an example of this below.  On the other hand if we think about “increasing” x from x’ to xe , there should be a transition of the gradient from 0 to non-zero as F(x) increases towards 1.  That motivates the definition of Integrated Gradients.   We are going to add up the values of the gradient along a path from the baseline to a value that causes the network to change.   

let γ = (γ1, . . . , γn) : [0, 1] → Rn be a smooth function specifying a path in Rn from the baseline x’ to the input x, i.e., γ(0) = x’ and γ(1) = x.   It turns out that it doesn’t matter which path we take because we will be approximating the path integral,  and by the fundamental theorem of calculus applied to path integrals,  we have

Expanding the integral out in terms of the components of the gradient,

Now, picking the path that represents the straight line between x’ and x as

Substituting this in the right-hand side and simplifying, we can set the attribution for the ith component as

To compute the attribution of factor i, for input x, we need only evaluate the gradient along the path at several points to approximate the integral.   In the examples below we show how salience in this form and others may be used to give us some knowledge about our understanding problem.

Digression: Computing Salience and Integrated Gradients using Torch

Readers not interested in how to do the computation of salience in PyTorch can skip this section and go on to the next section on Mathematical Intuition and Knots.

A team from Facebook AI introduced Captum in 2019 as a library designed to compute many types of salience models.  It is designed to work with PyTorch deep learning tools.  To illustrate it we will look at a simple example to show where simple gradient salience breaks down yet integrated gradience works fine.  The complete details of this example are in this notebook on Github.

We start with the following really dumb neural network consisting of one relu operator on two input parameters.

A quick look at this suggests that the most salient parameter is input1 (because it has 10 time the influence on the result of input2.   Of course, the relu operator tells us that result is flat for large positive values of input1 and input2.  We see that as follows.

We can directly compute the gradient using basic automatic differentiation in Torch.  When we evaluate the partial derivatives for these values of input1 and input2 we see they are zero.

From Captum we can grab the IntegratedGradient and Salience operators and apply them as follows.

The integrated Gradient approximation shows that indeed input1 has 10 time the attribution strength of input2.  And the sum of these is m(input1, input2) plus an error term.  As we already expect, the simple gradient method  of computing salience will  fail.

Of course this extreme example does not mean simple gradient salience will fail all the time.   We will return to Captum (but without the details) in another example later in this article.

Mathematical Intuition and Knots

Returning to mathematics, the Davies team considered two problems.   The methodology they used is described in the figure below which roughly corresponds to our discussion in the introduction.

Figure 1.   Experimental methodology used by Davies, et.al. (Figure 1 from “Advancing mathematics by guiding human intuition with AI”. Nature, Vol 600, 2 December 2021)

They began with a conjecture about Knot theory.   Specifically, they were interested in the conjecture that that the geometric invariants of knots (playing the role of X(z) in the scenario above)  could determine some of the algebraic invariants (as Y(z)).  See Figure 2 below.

Figure 2.   Geometric and algebraic invariant of hyperbolic knots. (Figure 2 from “Advancing mathematics by guiding human intuition with AI”. Nature, Vol 600, 2 December 2021)

The authors of the paper had access to information 18 geometric invariants on 243,000 knots and built a custom deep learning stack to try to identify the salient invariants that could identify the signature of the knot (a snapshot of the information is shown below).   Rather than describing their model, we decided to apply one generated by the AutoML  tool  provided by the Azure ML Studio.  

Figure 3.  Snapshot of the Knot invariant data rendered as a python pandas dataframe.

We uses a Jupyter Notebook to interact with the remote instances of the azure ML  studio.  The notebook is in github here: dbgannon/math: Notebooks from Explainable Deep Learning and Guiding Human Intuition with AI (github.com).  The notebook also contains links to the dataset.

Because we described azure ML studio in a previous post, we will not go into it here in detail.  We formulated the computation as a straightforward regression classification problem.    AutoML completed the model selection and training and it also computed the factor salience computation with the results shown below.

Figure 4.  Salience of features computed by Azure AutoML using their Machine Learning services  (To see the salience factors one has to look at the output details on the ML studio page for the computation.)

The three top factors were: the real and imaginary components of meridinal_translation, and the longitudinal translation.   These are the same top factors that were revealed in the authors study but in a different order.  

Based on this hint they authors proposed a conjecture: for a hyperbolic knot K define the slope(K) to be the real part of the fraction longitudinal translation/meridinal_translation.  Then there exists constants c1 and c2 such that

In Figure 5,   we show scatter plots of the predicted signature versus the real signature of each of the knots in the test suit.   As you can see, the predicted signatures form a reasonably tight band around the diagonal (true signatures). The mean squared error of the formula slope(K)/2 from the true signature was 0.86 and the mean squared error of the model predictions was 0.37.   This suggests that the conjecture may need some small correction terms, but that is up to the mathematician to prove.  Otherwise, we suspect the bounds in the inequality are reasonable.

Figure 5.  On the left is a scatter plot of the slope(K)/2 computed from the geometric data vs the signature.  On the right is the signature predicted by the model vs the signature.  

Graph Neural Nets and Salient Subgraphs.

The second problem that the team looked at involved representation theory.    In this case they are interested in pairs of elements in the symmetric group Sn represented as permutations.   For example, in S5 an instance z might be {(03214), (34201)}.  An interesting question to study is how to transform the first permutation into the second by simple 2-element exchanges (rotations) such as 03214->13204->31204->34201.  In fact, there are many ways to do this and we can build a directed graph showing the various paths of rotations to get from the first to the second.   This graph is called the unlabeled Bruhat interval, and it is their X(z).    The Y(z) is the Kazhdan–Lusztig (KL) polynomial for the permutation pair.   To go any deeper into this topic is way beyond the scope of this article (and beyond the knowledge of this author!) Rather, we shall jump to their conclusion and then consider a different problem related to salient subgraphs.   They discovered by looking at salient subgraphs of a Bruhat interval graph a for a  pair in Sn that there was a hypercube and a subgraph isomorphic to an interval in Sn−1.  This led to a formula for computing the KL polynomial.  A very hard problem solved!

An important observation the authors used in designing the neural net model was that information conveyed along the Bruhat interval was similar to message passing models in Graph neural networks.  These GNNs have become powerful tools for many problems.   We will use a different example to illustrate the use of salience in understanding graph structures.    The example is one of the demo cases for the excellent Pytorch Geometric libraries. More specifically it is one of their example Colab Notebooks and Video Tutorials — pytorch_geometric 2.0.4 documentation (pytorch-geometric.readthedocs.io).  The example illustrates the use of graph neural networks for classification of molecular structures for use as drug candidates.

Mutagenicity is a property of a chemical compound that hampers its potential to become a safe drug.  Specifically there are often substructures of a compound, called toxicophores,  that can interact with proteins or DNA that can lead to changes in the normal cellular biochemistry.  An article from J. Med. Chem. 2005, 48, 1, 312–320, describes a collection of 4337 (2401 mutagens and 1936 nonmutagens).  These are included in the TUDataset collection and used here.

The Pytorch Geometric example uses the Captum library (that we illustrated above) to identify the salient substructures that are likely toxicophores.  While we will not go into great detail about the notebook because it is in their Colab space. If you want to run this on your on machine we have put a copy in our github folder for this project.

The TUDataset data set encodes molecules, such as the graph below, as an object of the form

Data( edge_index=[2, 26], x=[13, 14], y=[1] )

 In this object x represents the 13 vertices (atoms).  There are 14 properties associated with each atom, but they are just ‘one-hot’ encodings of the names of each of the 14 possible elements in the dataset: ‘C’, ‘O’, ‘Cl’, ‘H’, ‘N’, ‘F’,’Br’, ‘S’, ‘P’, ‘I’, ‘Na’, ‘K’, ‘Li’, ‘Ca’.  The edge index represents the 26 edges where each is identified by the index of the 2 end atoms.   The value Y is 0 if this molecule is known to be mutagenic and 1 otherwise.

We must first train a graph neural network that will learn to recognize the mutagenic molecules.  Once we have that trained network, we can apply Captum’s IntegratedGradient to identify the salient subgraphs that most implicate the whole graph as mutagenic. 

The neural network is a five layer graph convolutional network.  A convolutional graph layer works by adding together information from each graph neighbor of each node and multiplying it by a trainable matrix.   More specifically assume  that each node has a vector xl of values at level l.  We then compute a new vector xl+1 of values for node v at level l+1 by

where 𝐖(+1) denotes a trainable weight matrix of shape [num_outputs, num_inputs] and 𝑐𝑤,𝑣 refers to a fixed normalization coefficient for each edge.  In our case 𝑐𝑤,𝑣 is the number of edges coming into node v divided by the weight of the edge.   Our network is described below.

The forward method uses the x node property and the edge index map to guide the convolutional steps.   Note that we will use batches of molecules in the training and the parameter batch is how we can distinguish one molecule from another, so that the vector x is finally a pair of values for each element of the batch.  With edge_weight set to None,  the weight of each edge is  a constant 1.0.   

The training step is a totally standard PyTorch training loop.  With the dim variable set to 64 and 200 epochs later, the training accuracy is 96% and test accuracy is 84%.   To compute the salient subgraphs using integrated gradients we will have to compute the partial derivate of the model with respect to each of the edge weights.   To do so, we replace the edge weight with a variable tensor whose value is 1.0. 

Looking at a  sample from the set of mutagenic molecules, we can view the integrated gradients for each edge.   In figure 6 below on the left we show a sample with the edges labeled by their IG values.  To make this easier to see the example code replaces the IG values with a color code where large IG values have thicker darker lines. 

The article from J. Med. Chem notes that the occurrence of NO2 is often a sign of mutagenicity, so we are not surprised to see it in this example.   In Figure 7, we show several other examples that illustrate different salient subgraphs. 

Figure 6.   The sample from the Mutagenic set with the IngetratedGradient values labeling the edges.  On the right we have a version with dark, heavy lines representing large IG values.

Figure 7.  Four more samples.  The subgraph N02 in the upper left is clearly salient, but in the three other examples we see COH bonds showing salient subgraphs.  However these also occur in the other, non-mutagenic set, so the significance is not clear to us non-experts.

Conclusion.

The issue of explainability  of ML methods is clearly important when we let ML make decisions  about peoples lives.    Salience analysis lies at the heart of the contribution to mathematical insight described above.   We have attempted to illustrate where it can be used to help us learn about the features in training data that drive classification systems to draw conclusions.  However, it takes the inspiration of a human expert to understand how those features are fundamentally related to the outcome.

ML methods are having a far-reaching impact throughout scientific endeavors.  Deep learning has become a new tool that is used in research in particle physics to improve the search for signals in data, in astronomy where generative methods can create galaxy models for dark energy experiments  and biochemistry where RoseTTAFold and DeepMind’s AlphaFold have used deep learning models to revolutionize protein folding and protein-protein interaction.  The models constructed in these cases are composed from well-understood components such as GANs and Transformers where issues of explainability have more to do with optimization of the model and its resources usage.  We will return to that topic in a future study.

A Look at Cloud-based Automated Machine Learning Services

AutoML

AI is the hottest topic in the tech industry.   While it is unclear if this is a passing infatuation or a fundamental shift in the IT industry, it certainly has captured the attention of many.   Much of the writing in the popular press about AI involves wild predictions or dire warnings.  However, for enterprises and researchers the immediate impact of this evolving technology concerns the more prosaic subset of AI known as Machine Learning.    The reason for this is easy to see.   Machine learning holds the promise of optimizing business and research practices.    The applications of ML range from improving the interaction between an enterprise and its clients/customers (How can we better understand our clients’ needs?), to speeding up advanced R&D discovery (How do we improve the efficiency of the search for solutions?).

Unfortunately, it is not that easy to deploy the latest ML methods without access to experts who understand the technology and how to best apply it.  The successful application of machine learning methods is notoriously difficult.   If one has a data collection or sensor array that may be useful for training an AI model,  the challenge is how to clean and condition that data so that it can be used effectively.  The goal is to build and deploy a model that can be used to predict behavior or spot anomalies.   This may involve testing a dozen candidate architectures over a large space of tuning hyperparameters.  The best method may be a hybrid model derived from standard approaches.   One such hybrid is ensemble learning in which many models, such as neural networks or decision trees, are trained in parallel to solve the same problem.  Their predictions are combined linearly when classifying new instances.  Another approach (called stacking) is to use the results of the sub-models as input to a second level model which selects the combination dynamically.  It is also possible to use AI methods to simplify labor intensive tasks such as collecting the best features from the input data tables (called feature engineering) for model building.    In fact, the process of building the entire data pipeline and workflow to train a good model itself a task well suited to AI optimization.  The result is automated machine learning.   The cloud vendors have now provided expert autoML services that can lead the user to the construction of a solid and reliable machine learning solutions.

Work on autoML has been going on for a while.  In 2013, Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown introduced Auto-WEKA and many others followed.  In 2019, the AutoML | Home  research groups led by Frank Hutter at the University of Freiburg,  and Prof. Marius Lindauer at the Leibniz University of Hannover published Automated Machine Learning: Methods, Systems, Challenges (which can be accessed on the AutoML website).

For an amateur looking to use an autoML system, the first step is to identify the problem that must be solved.  These systems support a surprising number of capabilities. For example, one may be interested in image related problems like image identification or object detection.   Another area is text analysis.  It may also be regression or predictions from streaming data.  One of the biggest challenges involve building models that can handle tabular data which may be contain not only columns of numbers but also images and text.    All of these are possible with the available autoML systems.

While all autoML systems are different in details of use, the basic idea is they automate a pipeline like the one illustrated in Figure 1 below.

Figure 1.   Basic AutoML pipeline workflow to generate an optimal model based on the data available.

Automating the model test and evaluation is a process that involves exploring the search space of model combinations and parameters.   Doing this search is a non-trivial process that involves intelligent pruning of possible combinations if they seem like to be poor performers.   As we shall show below, the autoML system may test dozens of candidates before ranking them and picking the best.    

Amazon AWS, Micosoft Azure,  Google cloud and IBM cloud all have automated machine learning services they provide to their customers.  In the following paragraphs we will look at two of these,   Amazon AWS autoGluon which is both open source and part of their SageMaker service,   and Microsoft Azure AutoML service which is part of the Azure Machine Learning Studio.   We will also provide a very brief look at Google’s Vertex AI cloud service.   We will not provide an in-depth analysis of these services, but give a brief overview and example from each. 

AWS and AutoGluon

AutoGluon was developed by a team at Amazon Web Services which they have also released as open source.  Consequently, it can be used as part of their SageMaker service or complete separately.  An interesting tutorial on AutoGluon is here.   While the types of problems AutoGluon can be applied to is extremely broad, we will illustrate it for only a tiny classical problem:  regression based on a tabular input. 

The table we use is the Kaggle bike share challenge.   The input is a pandas data frame with records of bike shares per day for about two years.  For each day, there is an indicator to say if this is a holiday and a workday.  There is weather information consisting of temperature, humidity and windspeed.  The last column is the “count” of the number of rentals for that day.     The first few rows are shown below in Figure 2.    Our experiment differs from the Kaggle competition in that we will use a small sample (27%) of the data to train a regression model and then use the remainder for the test so that we can easily illustrate the fit to the true data.

Figure 2.  Sample of the bike rental data used in this and the following example.

While AutoGluon, we believe, can be deployed on Windows, we will use Linux because it deploys easily there.   We used Google Colab and Ubuntu 18.04 deployed on Windows 11.  In both cases the installation from a Jupyter notebook was very easy  and went as follows.  First we need to install the packages.

The full notebook for this experiment is here and we encourage the reader to follow along as we will only sketch the details below. 

As can be seen from the data in Figure 2, the  “count” number jumps wildly from day to day.   Plotting the count vs time we can see this clearly.

A more informative way to look at this is a “weekly” average shown below.

 The training data that is available is a random selection about 70% of the complete dataset, so this is not a perfect weekly average, but it is seven consecutive days of the data.  

Our goal is to compute the regression model based on a small training sample and then use the model to predict the “count” values for the test data.   We can then compare that with the actual test data “count” values.    Invoking the AutoGluon is now remarkably easy.

We have given this a time limit of 20 minutes.   The predictor is finished well before that time.   We can now ask to see how well the different models did (Figure 3) and also ask for the best.

Running the predictor on our test data is also easy.   We first drop the “count” column from the test data and invoke the predict method on the predictor

Figure 3.    The leaderboard shows the performance of the various methods tested

One trivial graphical way to illustrate the fit of the prediction to the actual data is a simple scatter plot.

As should be clear to the reader, this is far from perfect.  Another simple visualization is to plot the two “count” values along the time axis.  As we did above, the picture is clearer if plot a smoothed average.  In this case each point is an average of the following 100 points.  The results, which shows the true data in blue over the prediction in orange, does indicate that the model does capture the qualitative trends. 

The mean squared error is 148.   Note: we also tried training with a larger fraction of the data and the result was similar. 

Azure Automated Machine Learning

The azure AutoML system is also designed to support classification, regression, forecasting and computer vision.   There are two basic modes in which Azure autoML works:  use the ML studio on Azure for the entire experience,  or use the Python SDK, running in Jupyter on your laptop with remote execution in the Azure ML studio.   (In simple cases you can run everything on you laptop, but taking advantage of the studio managing a cluster for you in the background is a big win.) We will use the Azure studio for this example.  We will run a Jupyter notebook locally and connect to the studio remotely.  To do so we must first install the Python libraries.  Starting with Anaconda on windows 10 or 11, it can be challenging to find the libraries that will all work together.   The following combination will work with our example.

conda create -n azureml python=3.6.13

conda activate azureml

pip install azureml-train-automl-client

pip install numpy==1.18

pip install azureml-train-automl-runtime==1.35.1

pip install xgboost==0.90

pip install jupyter

pip install pandas

pip install matplotlib

pip install azureml.widgets

jupyter notebook

Next clone the Azure/MachineLearningNotebooks from Github and grab the notebook configuration.ipynb.   If you don’t have an azure subscription, you can create a new free one.  Running the configuration successfully in you jupyter notebook will set up your connection to the Azure ML studio.  

The example we will use is a standard regression demo from the AzureML collection.  In order to better illustrate the results, we use the same bike-share demand data from the Kaggle competition as used above where we sample both the training and test data from the official test data.  The train data we use is 27% of the total and the remainder is used for test.  As we did with the AutoGluon example, we delete two columns: “registered” and “casual”.

You can see the entire notebook and results here:

azure-automl/bike-regression-drop-.3.ipynb at main · dbgannon/azure-automl (github.com)

If you want to understand the details, this is needed.   In the following we only provide a sketch of the process and results.

We are going to rely on autoML to do the entire search for the best model, but we do need to give it some basic configuration parameters as shown below.

We have given it a much longer execution time than is needed.   One line is then used to send the job to Azure ML studio.

After waiting for the experiment to run, we see the results of the search

  As can be seen, the search progressed through various methods and combination with a stack ensemble finally providing the best results.

We can now use the trained model to make our predictions as follows.  We begin by extracting the fitted model.  We can then drop the “count” column from the test file and feed it to the model.   The result can be plotted as a scatter plot.

As before we can now use a simple visualization based on a sliding window average of 100 points to “smooth” the data and show the results of the true values against the prediction. 

As can be seen the fit is pretty good.  Of course, this is not a rigorous statistical analysis, but it does show the model captures the trends of the data fairly well. 

In this case the mean squared error was 49.

Google Vertex AI

Google introduced their autoML service, called VertexAI in 2020.    Like AutoGluon and Azure AutoML there is a python binding where there is a function  aiplatform.TabularDataset.create() that can be used to initiate a training job in a manner similar to AutoMLConfig() in the Azure API.  Rather than use that we decided to use their full VertexAI cloud service on the same dataset and regression problem we described above.  

The first step was to upload our dataset, here called “untitled_1637280702451”.   The VertexAI system steps us through the process in a very deliberate and simple manner.   The first step is to tell it we want to do regression (the other choice for this data set was classification). 

The next step is to identify the target column and the columns that are included in the training.  We used the default data slit of 80% for training, 10% validation and 10% testing.  

After that there is a button to launch the training.   We gave it one hour.   It took two hours and produced a model

Once complete, we can deploy the model in a container and attach an endpoint.  The root mean squared error if 127 is in line with the AutoGluon result and more than the Azure autoML value.   One problem with the graphical interactive view is that I did not see the calculation to see if we are comparing the VeretexAI result to the same result for to the RMSE for the others.   

Conclusions

Among the three autoML methods used here, the easiest to deploy was VertexAI because we only used the Graphical interface on the Google Cloud.   AutoGluon was trivial to deploy on  Google Collab and on a local Ubuntu installation.   Azure AutoML was installable on  Windows 11, but it took some effort to find the right combination of libraries and Python versions.   While we did not study the performance of the VertexAI model,  the performance of the Azure AutoML model was quite good.

As it is like obvious to the reader, we did not push these systems to produce the best results.  Our goal was to see what was easy to do. Consequently, this brief evaluation of the three autoML offerings did not do justice to any of them.   All three have capabilities that go well beyond simple regression.  All three systems can handle streaming data, image classification and recognition as well as text analysis and prediction.  If time permits, we will follow up this article with more interesting examples.

IEEE Symposium on Cloud HPC

On September 7th, 8th and 9th, we held the IEEE Services 2021 – IEEE International Symposium on Cloud HPC (CloudHPC) (computer.org) as part of IEEE CLOUD 2021 (computer.org).   The program consisted of a plenary panel and 8 sessions with three speakers in each.   Each session was recorded and  in this note we will describe four of these sessions. The videos for two sessions are still unavailable, but we will update this document when we can access them. The other organizers of this event were Rong N Chang,  Geoffrey Fox, James Sexton, Christoph Hagleitner, Bruce D’Amora, Ian Foster and Andrew Lumsdaine. In the paragraphs below we provide the links to the videos and brief thumbnail abstract to introduce the talks.

The first session was a plenary panel that was part of the IEEE Cloud 2021 conference.

EXPLORING THE GROWING SYNERGY BETWEEN CLOUD AND HIGH-PERFORMANCE COMPUTING

Panelists:
Katherine Yelick, UC Berkeley and Lawrence Berkeley National Laboratory
Ian Foster, Argonne National Laboratory, University of Chicago
Geoffrey Fox, University of Virginia
Kate Keahey, Argonne National Laboratory, University of Chicago

This group of panelists have been involved in every aspect of HPC and HPC in the cloud. Kathy Yelick has tons of experience ranging from managing a HPC center to designing Programming Languages for parallel computing and, now, being  a partner in running a major NSF program CloudBank to help with cloud access. Ian Foster was a pioneer who conceived of the Grid, a network of supercomputer services which, in many ways foreshadowed the Cloud concept and he continues to lead research in HPC and the cloud.  Geoffrey Fox has been in the HPC business from the early days of MPI and he is now doing groundbreaking work using cloud native middleware with MPI to design new deep learning paradigms for science.   Kate Keahey is a leader in high performance distributed computing, and she designed one of the first open-source Cloud platforms and now runs a major cloud devoted to academic research for NSF.

The link to the panel discussion is SERVICES Plenary Panel 1: Cloud HPC – Exploring Growing Synergy btw Cloud & HPC – YouTube.

The overriding theme of the discussion involved an exploration of how cloud HCP is, and will be, different from traditional supercomputer based HPC and where the opportunities for exploiting the strengths of both can advance research and science.

Ian started the discussion by making the case that there were some fundamental differences between the cloud and traditional HPC systems.   Most significant of these was that the cloud made it possible to deliver and ecosystem of on-demand services.  Among these services are tools to interactively explore large data collections and services that enable users to easily build machine learning solutions for difficult data analysis problems. HPC on the other hand is about providing the most powerful computing platform for very large scientific applications.  Kathy also observed that what cloud lacked in high end power, the cloud excel in resilience. This resilience is critical for teaching such as when student assignment require just-in-time access to resources. Kate made the addition point that there are classes of HPC problems where having the interactivity clouds provide can be very important for exploring new areas where existing HPC application codes are not yet available.  Geoffrey made the point that access to HPC systems capable of doing some of the most advanced Neural Net training experiments is out of reach of must academics. That leaves clouds, but a national funding model for research is clearly needed. Kathy made the point that the new NSF CloudBank provides a way for university to access commercial cloud platform with NSF and other funding. Foster argues that the future challenge is to figure out how to combine the capabilities of both supercomputer centers and clouds for the next generation of scientific endeavors. The panel also addressed a number of important issues including economics, advances in technology and scalability.  It was an extremely lively and interesting discussion.  Check it out.  

Session 2.  HPCI in Biology & Medicine in the Cloud  (Chair: Dennis Gannon)

There are three talks in this session.

  • Computational Biology at the Exascale
    Katherine Yelick, UC Berkeley and Lawrence Berkeley National Laboratory
  • HySec-Flow: Privacy-Preserving Genomic Computing with SGX-based Big-Data Analytics Framework
    Judy Fox, Professor, University of Virginia
  • An automated self-service multi-cloud HPC platform applied to the simulation of cardiac valve disease with machine learning
    Wolfgang Gentzsch, UberCloud, Founder & President

Unfortunately, the recording of this session on YouTube is flawed (my fault).   Consequently, I have collected the videos together elsewhere.   A summary of the session and link to the induvial talks are here: Talks from the first IEEE Symposium on Cloud & HPC | The eScience Cloud (esciencegroup.com) 

Session 3. Using HPC to Enable AI at Scale (Chair: Dennis Gannon)

Session 3 included three talks that addressed how the private sector was approaching Cloud HPC to build new tools to solve critical global problems.  The session recording is here: CloudHPC Symposium – Session 3 – YouTube.  The talks were

  • Grand Challenges for Humanity: Cloud Scale Impact and Opportunities
    Debra Goldfarb, Amazon, Director HPC Products & Strategy
  • Enabling AI at scale on Azure
    Prabhat Ram, Microsoft, Azure HPC
  • Benchmarking for AI for Science in the Cloud: Challenges and Opportunities
    Jeyan Thiyagalingam, STFC, UK, Head of SciML Group

Debra Goldfarb passionately addressed the importance of bring HPC and the cloud to bear on many important global challenges.   Chief among this is the health crises brought on by viral pandemics.   She went on to describe work done in collaboration with the research community on providing the computation and data resources needed.  Of particular importance is the fact that the cloud could deliver scale that goes beyond single data centers.   One of the exciting outcomes was the work AWS did with the Covid-19 HPC consortium to provide the resources to Moderna to deliver their mRNA vaccine in an incredibly short time.   Debra also discussed the work on building the national strategic computing reserve that will play an important role in solving future problems.

Prabhat Ram, who recently moved from the DOE’s LBNL facility to Microsoft to build a team to address AI and HPC on Azure, provided an in-depth overview of the current state of Azure HPC Cloud capabilities.   He pointed out that for scientific computing problems of modest scale that it is easy to good speed-up on batch job with up to 128 cores.  Focusing on AI workloads Prabhat described their impressive performance on the HPL-AI benchmark.  Microsoft has now a deep collaboration with OpenAI and is building on the results from a few years ago with GPT-3.   Now the results now demonstrate that the cloud man be classified as leadership class computing.

Jeyan Thiyagalingam addressed the problem of designing benchmarks for AI and machine learning for science applications.   This is important for many reasons.  We have hundreds of different ML models and the applications and platforms differ widely.   Having a reasonable set of benchmarks will enable us to understand the applications better and provide a guide to tell us when our solutions are improving.   Developing these benchmarks is not easy.  One of the big problems is the size of realistic data sets for realistic applications. If a data set require 100 terabytes, how do we make it available?  The answer, it would seem, is the cloud.  However, Thiyagalingam observes that there are still many challenges to making that work in a practical way.   He makes a very strong case that the science community needs help from the cloud providers and other institutions to make this possible.

Session 4. Applications of Cloud Native Technology to HPC in the Cloud (Chair: Christoph Hagleitner)

Session 4 focused more on the Cloud software tools that are important for science.   There were three talks and the recorded video is here: CloudHPC Symposium – Session 4 – YouTube.

  • Serverless Supercomputing: High Performance Function as a Service
    Kyle Chard, Professor, University of Chicago
  • Minding the Gap: Navigating the transition from traditional HPC to cloud native development
    Bruce D’Amora, IBM Research
  • Composable Systems: An Early Application Experience
    Ilkay Altintas, SDSC, Chief Data Science Officer

Kyle’s talk was about how the function-as-a-service (FAS) model can be applied to large scale science.   In the talk he points out that the cloud computing industry has provided a number of important ideas, some  of which have been adopted by HPC (virtualization, containers, etc.).  He discussed FAS and introduced the FuncX system that they have built. (We have described FuncX elsewhere in this blog).  He also described a recent application of using FuncX for drug discovery pipelines for COVID-19 therapeutics.   

Kubernetes was developed by Google are released to the open-source community where it has become a cloud standard.  Bruce D’Amora considers the important issue of using Kubernetes on large scale scientific environments.  More specifically he described how they are using it to serve as a control plane for bare-metal system with high performance networks.

Ilkay Altintas, discussed a very ambitious project involving dynamic composability of computational services, AI and distributed real-time data integration.   This work goes well beyond static service integration into workflow.   The applications she discussed involved challenges such as the monitoring, measurement and prediction of the spread of wildfires and other rapidly changing input from widely disturbed sensor networks.   Large teams of participants need to be interacting with the system, so the workflow (or teamflow) requires dynamic scaling, on-demand interactive access and performance analysis.  The AI involved goes beyond the optimization of simulations to the optimization of a rapidly changing configuration of services and data streams.   The complexity of this task and its solution is extraordinary.   

Session 5.  Distributed Computing Issues for HPC in the Cloud (Chair Geoffrey Fox)

Session 5 contains 3 papers and the videos are available the following link: CloudHPC Symposium – Session 5 – YouTube

The papers are:

  • Challenges of Distributed Computing for Pandemic Spread Prediction based on Large Scale Human Interaction Data
    Haiying Shen, Professor, University of Virginia
  • GreenDataFlow: Minimizing the Energy Footprint of Cloud/HPC Data Movement
    Tevfik Kosar, Professor, University of Buffalo & NSF
  • IMPECCABLE: A Dream Pipeline for High-Throughput Virtual Screening, or a Pipe Dream?
    Shantenu Jha, Professor, Rutgers University

Haiying Shen describes the challenges of epidemic simulation using distributed computing.  In the case of epidemic simulations, the population is represented as a very large graph with edges representing interactions of human communities.   If we use a static partition of this graph to distribute the simulation over a distributed network  of servers, we encounter the problem that the graph is dynamic and so the partition must reflect this fact.  Optimizing the performance requires careful repartitioning and some replication.

Tevfik Kosar’s team has been look at the problem of reducing the cost of  moving data over the Internet.  On a global basis this costs $40 billion/year.  Their approach is to use application-level tuning of cross-layer parameters to save significant amounts of power.  Their approach involves clever clustering optimization.   They build a model that is best to capture the real-time conditions.   They applied this to a cloud-based experiment.  They achieve 80% higher throughput and up to 48% lower energy consumption.

Shantenu Jha described a massive effort by the DOE and other HPC laboratories to build a drug discovery pipeline, called IMPECCABLE that combines traditional HPC simulations with new AI methods that can greatly reduce the search spaces involved in molecular dynamics modeling.   IMPECCABLE is clearly vastly more sophisticated and compete than any of the current cloud-based HT virtual screening.  Professor Jha makes the point that more work is needed on system software to support heterogeneous workflows for large scale drug discovery. 

Final Sessions

There are two sessions for which we are still waiting for the IEEE to upload the videos.   We will update this document when we have access to recordings.

Session 1 Cloud & Heterogeneous Architectures & Opportunities for HPC (Chair: Ian Foster)

Advancing Hybrid Cloud HPC Workflows Across State of the Art Heterogeneous Infrastructures
Steve Hebert, Nimbix Founder and CEO

The impact of the rise in cloud-based HPC
Brent Gorda, ARM Director HPC Business

HPC in a box: accelerating research with Google Cloud
Alexander Titus, Google Cloud

Session 6.  Cloud HPC Barriers & Opportunities (Chair: Bruce D’Amora)

The Future of OpenShift
Carlos Eduardo Arango Gutierrez, Red Hat, HPC OpenShift Manager

Scientific Computing On Low-cost Transient Cloud Servers
Prateek Sharma, Indiana University

HW-accelerated HPC in the cloud: Barriers and Opportunities
Christoph Hagleitner, IBM Research



A Brief Introduction to Ray Distributed Objects, Ray Tune, and a Small Comparison to Parsl.

Abstract

This short report describes the core features of the Ray Distributed Object framework.  To illustrate Ray’s actor model, we construct a simple microservice-like application with a web server frontend.  We compare Ray to similar features in Parsl and provide a small performance analysis.   We also provide an illustration of how Ray Tune can optimize the hyperparameter of a neural network.

Introduction

Ray is a distributed object framework first developed by the amazingly productive Berkeley  RISE Lab. It consists of a number of significant components beyond the core distributed object system including a cluster builder and autoscaler,  a webserver framework,  a library for hyperparameter tuning, a scalable reinforcement learning library and a set of wrappers for distributed ML training and much more.  Ray has been deeply integrated with many standard tools such as Torch, Tensorflow, Scikit-learn, XGBoost, Dask, Spark and Pandas.    In this short note we will only look at the basic cluster, distributed object system and webserver and the Ray Tune hyperparameter optimizer.  We also contrast the basics of Ray with Parsl   which we described in a previous post.  Both Ray and Parsl are designed to be used in clusters of nodes or a multi-core server.  They overlap in many ways.  Both have Python bindings and make heavy use of futures for the parallel execution of functions and they both work well with Kubernetes, but as we shall see the also differ in several important respects.   Ray supports an extensive and flexible actor model, while Parsl is primarily functional.  Parsl is arguably more portable as it supports massive supercomputers in addition to cloud cluster models.  

 In the next section we will look at Ray basics and an actor example.  We will follow that with the comparison the Parsl and discuss several additional Ray features.  Finally, we turn to Ray Tune, the hyperparameter tuning system built on top of Ray.

Ray Basics

Ray is designed to exploit the parallelism in large scale distributed applications and, like Parsl and Dask, it uses the concept of futures as a fundamental mechanism.  Futures are object returned from function invocations that are placeholders for “future” returned values.   The calling context can go about other business while the function is executing in another thread, perhaps on another machine.   When the calling function needs the actual value computed by the function it makes a special call and suspends until the value is ready.  Ray is easy to install on your laptop.  Here is a trivial example.

The result is

As you can see, the “remote” invocation of f returns immediately with the object reference for the future value, but the actual returned value is not available until 2 seconds later.  The calling thread is free to do other work before it invokes ray.get(future) to wait for the value.  

As stated above, Parsl uses the same mechanism.   Here is the same program fragment in Parsl.

Ray Clusters and Actors

Ray is designed to manage run in ray clusters which can be launched on a single multicore node,  a cluster of servers or as pods in Kubernetes.   In the trivial example above we started a small Ray environment with ray.init() but  that goes away when the program exits.   To create an actual cluster on a single machine we can issue the command

$ ray start -–head

If our program now uses ray.init(address=’auto’) for the initialization the program is running in the new ray cluster.  Now, when the program exits, some of the objects it created can persist in the cluster.  More specifically consider the case of Actors which,  in Ray, are instances of classes that have persistent internal state and methods that may be invoked remotely. 

To illustrate Ray actors and the Ray web server will now describe an “application” that dynamically builds a small hierarchy of actors.    This application will classify documents based on the document title.  It will first classify them into top-level topics and, for each top-level topic subdivided further into subtopic as shown in Figure 1. Our goal here is not actually to classify documents, but to illustrate how Ray works and how we might find ways to exploit parallelism.  Our top-level topics are “math”, “physics”, “compsci”, “bio” and “finance”.   Each of these will have subtopics.   For example “math” has subtopics, “analysis”, “algebra” and  “topology”.  

Figure 1.  Tree of topics and a partial list of subtopics

We will create an actor for each top-level topic that will hold lists of article titles associated with the subtopics.   We will call these actors SubClassifiers and they are each instances of the following class.

 We can create an instance of the “math” actor with the call

cl = SubClassifier.options(name=”math”, lifetime=”detached”).remote(“math”)

There are several things that are happening here.  The SubClassifier initializer is “remotely” called with the topic “math”.  The initializer reads a configuration file (read_config) to load the subtopic names for the topic “math”.  From that the subtopic dictionaries are constructed.  We have also included an option that instructs that this instance to be “detached” and live in the cluster beyond the lifetime of the program that created it.  And it shall have the name “math”.   Another program running in the cluster can access this actor with the call

math_actor = ray.get_actor(“math”)

Another program running in the same Ray cluster can now add new titles to the “math” sub-classifier with the call

math_actor.send.remote(title)

In an ideal version of this program we will use a clever machine learning method to extract the subtopics, but here we will simply use a tag attached to the title.   This is accomplished by the function split_titles()

We can return the contents of the sub-classifier dictionaries with

print(math_actor.get_classification.remote())

A Full “Classifier” Actor Network.

To illustrate how we can deploy a set of actors to make the set of microservice-like actors we will create a “document classifier” actor that allocates document titles to the SubClassifier actors.  We first classify the document into the top-level categories “math”, “physics”, “compsci”, “bio” and “finance”, and then send them to the corresponding SubClassifier actor.  

The Classifier actor is shown below.   It works as follows.   A classifier instance has a method “send(title)” which uses the utility function split_titles to extract the top-level category of the document.  But to get the subcategory we need to discover and attach the subcategory topic tag to the title.   A separate function compute_subclass does that.   For each top-level category, we get the actor instance by name, or if it does not exist yet, we create it.   Because computing the subclass may require the computational effort of a ML algorithm, we invoke compute_subclass as remote and store the future to that computation in a list and go on and process the next title.  If the list reaches a reasonable limit (given by the parameter queue_size) we empty the list and start again.   Emptying the list requires waiting until all futures are resolved.   This is also a way to throttle concurrency.

By varying the list capacity, we can explore the potential for concurrency.   If queue_size == 0, the documents will be handled sequentially.   If queue_size == 10 there can be as many of 10 documents being processed concurrently.  We shall see by experiments below what levels of concurrency we can obtain. 

In the ideal version of this program the function compute_subclass invoked a ML model to compute the subclassification of the document,  but for our experiments here, we will cheat because we are interested in measuring potential concurrency.   In fact the subclassification of the document is done by the split_titles() function in the SubClassifier actor.  ( The document titles we are using all come from ArXiv and have a top level tag and subclass already attached.  For example, the title ‘Spectral Measures on Locally Fields [math.FA]’ which is math with subtopic analysis. )

The simplified function is shown below.  The work is simulated by a sleep command.   We will use different sleep values for the final analysis.

The Ray Serve webserver

To complete the picture we need a way to feed documents to top-level classifier.  We do that by putting a webserver in front and using another program outside the cluster to send documents to the server as as fast as possible.    The diagram in Figure 2 below illustrates our “application”.

Figure 2.   The microservice-like collection of Ray actors and function invocations.  The clients send titles to  the webserver “server” as http put requests.  The server makes remote calls to the classifier instance which spawns compute_subclass invocations (labeled “classify()” here) and these invoke the send method on the subclassifiers.

The ray serve library is defined by backend objects and web endpoints.   In our case the backend object is an instance of the class server.

Note that this is not a ray.remote class.   Once Ray Serve creates an instance of the server which, in turn grabs an instance of the Classifier actor).  The Server object has an async/await coroutine that is now used in the Python version of ASGI, the Asynchronous Server Gateway Interface.   The call function invokes the send operation on the classifier.  To create and tie the  backend to the endpoint we do the following.

We can invoke this with

Serve is very flexible.  You can have more than one backend object tied to the same endpoint and you can specify what fraction of the incoming traffic goes to each backend. 

Some performance results

One can ask the question what is the performance implication of the parallel compute_subclass operations?   The amount of parallelism is defined by the parameter queue_size.  It is also limited by the size of the ray cluster and the amount of work compute_subclass does before terminating.   In our simple example the “work” is the length of time the function sleeps.   We ran the following experiments on AWS  with a single  multicore server with 16 virtual cores.   Setting queue_size to 0 make the operation serial and we could measure the time per web invocation for the sleep time.   For a sleep of t=3 seconds the round trip time per innovation was an average of 3.4 seconds.   For t=5, the time was 5.4 and for t=7 it was 7.4.  The overhead of just posting the data and returning was 0.12 seconds.  Hence beyond sleeping there is about .28 seconds of ray processing overhead per document.  We can now try different queue sizes and compute speed-up over the sequential time.

Figure 3.  Speed up over sequential for 40 documents sent to the server. 

These are not the most scientific results: each measurement was repeated only 3 times.  However, the full code available for tests of your own.   The maximum speed-up was 9.4 when the worker delay was 7 seconds and the queue length was 20.   The code and data are in Github.

Another simple test that involves real computation is to compute the value of Pi in parallel.   This involves no I/O so the core is truly busy computing.   The algorithm is the classic Monte Carlo method of throwing X dots into a square 2 on a size and counting the number that land inside the unit circle.  The fraction inside the circle is Pi/4.   In our case we set X to 106 and compute the average of 100 such trials.   The function to compute Pi is

The program that does the test is shown below.   It partitions the 100 Pi(10**6)  tasks in blocks and  each block will be executed by a single thread.

As can be seen the best time is when we compute 100 independent blocks.  The sequential time is when it is computed as a sequential series of 100 Pi tasks.   The speedup is 9.56.  We also did the same computation using Dask and Parsl.  In the chart below we show the relative performance .The graph shows the execution time on the vertical axis and the horizontal is the block sizes 4, 5, 10, 20, 100.   As you can see Dask is fastest but Ray in about the same when we have 100 blocks in parallel.

Figure 4.  Execution time of 100 invocations of Pi(10**6).  Blue=Parsl, Green=Ray and  Orange=Dask.

More Ray

Modin

One interesting extension of Ray is Modin,  a drop-in replacement for Pandas.   Data scientists  using Python have, to a large extent, settled on Pandas as a de facto standard for data manipulation.   Unfortunately, Pandas does not scale well.   Other alternatives out there are Dask DataFrames, Vaex, and the NVIDIA-backed RAPIDS tools.   For a comparison see   Scaling Pandas: Dask vs Ray vs Modin vs Vaex vs RAPIDS (datarevenue.com) and the Modin view of Scaling Pandas

There are two important features of Modin.   First is the fact that it is a drop-in replacement for Pandas.  Modin duplicates 80% of the massive Pandas  API including all of the most commonly used functions and it defaults to the original Pandas versions for the rest.  This means you only need to change one line:

import pandas as pd

to

import modin.pandas as pd

and your program will run as before.  The second important feature of Modin is performance.   Modin’s approach is based on an underlying algebra of operators that can be combined to build the pandas library.  They also make heavy use of lessons learned in the design of very large databases.  In our own experiments with Modin on the same multicore server used in the experiments above Modin performed between 2 time and 8 times better for standard Pandas tasks.  However, it underperformed Pandas for a few.   Where Modin performance really shines is on DataFrames that are many 10s of gigabytes in size running on Ray clusters of hundreds of cores.  We did not verify this claim, but the details are in the original Modin paper cited above. 

Ray Tune: Scalable Hyperparameter Tuning

Ray Tune is one of the most used Ray subsystems.   Tune is a framework for using parallelism to tune the hyperparameters of a ML model.   The importance of hyperparameter optimization is often overlooked when explaining ML algorithm.   The goal is to turn a good ML model into a great ML model.    Ray Tune is a system to use asynchronous parallelism to explore a parameter configuration space to find the set of parameters that yield the most accurate model given the test and training data.  We will Illustrate Tune with a simple example.   The following neural network has two parameters: l1 and l2.

These parameters describe the shape of the linear transforms at each of the three layers of the network.  When training the network, we can easily isolate two more parameters:  the learning rate and the batch size.    We can describe this parameter space in tune with the following.

This says that l1 and l2 are powers of 2 between 4 and 64 and the learning rate lr is drawn from the log uniform distribution and the batch size is one of those listed.  Tune will extract instances of the configuration parameters and train the model with each instance by using Ray’s asynchronous concurrent execution.  

To run Tune on the model you need to wrap it in a function that will encapsulate the model with instances of the configuration.  The wrapped model executes a standard (in this case torch) training loop followed by a validation loop which computes an average loss and accuracy for that set of parameters.

One of the most interesting part of Tune is how they schedule the training-validation tasks to search the space to give the optimal results.   The scheduler choices include HyperBand, Population Based Training and more.  The one we use here is Asynchronous Successive Halving Algorithm (ASHA) (Li, et.al. “[1810.05934v5] A System for Massively Parallel Hyperparameter Tuning (arxiv.org)” ) which, as the name suggests, implements a halving scheme that rejects regions that do not show  promise and uses a type of genetic algorithm to create promising new avenues.  We tell the scheduler we want to minimize a the loss function and invoke that with the tune.run() operator as follows.

We provide the complete code in a Jupyter notebook.  The model we are training to solve is the classic and trivial Iris classification,   so our small Neural network converges rapidly and Tune gives the results

Note that the test set accuracy was measured after ray completed by loading the model from checkpoint that Tune saved (see the Notebook for details)

Final Thoughts

There are many aspects of Ray that we have not discussed here.   One major omission is the way Ray deploys and manages clusters.   You can build a small cluster on AWS with a one command line.  Launching applications from the head node allows Ray to autoscale the cluster by adding new nodes if the computation demands more resources and then releases those node if no longer needed.

In the paragraphs above we focused on two Ray capabilities that were somewhat unique.   Ray’s actor model in which actors can persist in the cluster beyond the lifetime of the program that created them.  The second contribution from Ray that we found exciting was Tune.   A use case that is more impressive that our little Iris demo is the use of Tune with hugging face’s Bert model. See their blog.

Deep Learning with Real Data and the Chaotic Double Pendulum

In this note we provide a simple illustration of how the ‘classic’ deep learning tool LSTM can be used to model the  solution to time dependent differential equations.  We begin with an amazingly simple case: computing the trajectory of a projectile launched into the air,  encountering wind resistance as it flies and returns to earth.  Given a single sample trajectory, the network can demonstrate very realistic behavior for a variety of initial conditions. The second case is the double pendulum which is well known to generate chaotic behavior for certain initial conditions.   We show that a LSTM can effectively model the pendulum in the space of initial conditions up until it encounters the chaotic regions.

Using Neural Nets to Generate Streams of Data

Much of the recent excitement in deep learning has come from the ability of very large language models (such as gpt-3 from OpenAI) to generate streams of text to solve impressive tasks such as document summarization, writing poetry, engaging in dialog and even generating computer code.      In the case of scientific application, deep learning models have been used to tackle a number of interesting forecasting challenges.  

There are two major differences between natural language processing and physical system modeling.  First, natural language has a finite vocabulary and the model must include an embedding that maps words into vectors and a decoder  that maps the neural network outputs into a probability distribution over the vocabulary.   In the case of physical systems, the data often consists of  streams of vectors of real numbers and the embedding mapping may not be needed.  The second difference is that physical systems can often be modeled by differential equations derived from the laws of physics.   There are often strong parallels between the formulation of the differential equation and the structure of the neural network.   We examined this topic in a previous article where we described the work on Physically Inspired Neural Networks and Neural Ordinary Differential Equations.   Solving a time dependent differential equation is often stated as an initial value problem:  Given the equations and a consistent set of k critical variables at some time t0, a solution is an operator D that defines a trajectory through Rk that satisfies the equations.

This image has an empty alt attribute; its file name is Picture2.png

Unless a differential equation has a known, closed form solution, the standard approach is numerical integration. We can describe this as an operator Integrate which takes vector of  initial variable values at t0 ,V0 in Rk  Integrate( V0to a sequence Vi in Rk 

This image has an empty alt attribute; its file name is Picture3.png

so that the appropriate finite difference operations when applied to the sequence of Vi approximately satisfy the differential equations. The Integrate operator is normally based on a carefully crafted numerical differential equation solver and   the number of time steps (Tn) required to advance the solution to a desired end configuration may be very large and the computation may be very costly.   Alternatively, the Integrate operator may be based on a pretrained neural network which may be as accurate and easier to compute.   We are now seeing excellent examples of this idea.

Rajat Sen, Hsiang-Fu Yu, and Inderjit Dhillon consider the problem of forecasting involving large numbers of time series.  By using a clever matrix factorization they extract a basis set of vectors that can be used with a temporal convolutional network to solve some very large forecasting problems.  JCS Kadupitiya, Geoffrey C. Fox, and Vikram Jadhao consider classical molecular dynamics problems and they show that an integrator based on Long Short Term Memory (LSTMs) can solve the Newton differential equations of motion with time-steps that are well beyond what a standard integrator can accomplish.   J. Quetzalcóatl Toledo-Marín, Geoffrey Fox, James Sluka and James A. Glazier  describe how a convolutional neural network can be used as a “surrogate” for solving diffusion equations. 

In his draft “Deep Learning for Spatial Time Series”,  Geoffrey Fox generalizes the LSTM based differential equation integrators described above (and in our examples below) to the case of spatial as well as temporal time series.   To do this he shows how to integrate the encoder of a Transformer with and LSTM decoder.   We will return to the discussion of Fox’s report at the end of this post.

There are many variations of neural networks we could use to approach the problem of forecasting streaming data.   Transformers are the state-of-the-art for NLP, having totally eclipsed recurrent neural nets like the Long Short-term Memory (LSTM).   When put to the task of generating sentences LSTM tend to lose context very quickly and the results are disappointing.  However, as we shall see, in the case of real data streams that obey an underlying differential equation they can do very well in cases where the solutions are non-chaotic.   We will look at two  very simple  examples to illustrate these points

Case1.  The Trajectory of a Ball Thrown Encountering Wind Resistance.

Consider the path of a ball thrown with initial speed v and at an elevation angle θ.  As it flies it encounters a constant drag and eventually falls back to earth.  With the initial position as the origin, the differential equation that governs the path of our ball is given by  

This image has an empty alt attribute; its file name is Picture4.png

where g is the gravitational acceleration and u is the drag caused by the air resistance as illustrated in the figure  below.

This image has an empty alt attribute; its file name is Picture5-1.png

Figure 1.   The path of our ball given initial speed v and elevation angle θ

Wikipedia provides a derivation of these equations and a Python program that generates the numerical solution illustrated above.   We will use a modified version of that program here.   A standard ordinary differential equation solver (odeint) can be used to generate the path of the ball. 

This image has an empty alt attribute; its file name is Picture6-1-1024x326.png

We will use a simple LSTM network to learn to reproduce these solution.  The network will be trained so that if it is given 10 consecutive trajectory points (xi,yi) for i in [0,9] it will generate the next point (x10,y10).   To make the LSTM generate a path for a given value of v and θ we provide the first 10 points of the solution and then allow the LSTM to generate the next 100 points.

 We will use the most trivial form of a LSTM as shown below to define our model.

This image has an empty alt attribute; its file name is Picture7-1.png

The differential equation requires 4 variables (x, y, vx, vy)  to make a well posed initial condition,  however our model has an input that is only two dimensional.  To provide the velocity vector we give it a sequence of 10  (x,y) pairs, so our model is

This image has an empty alt attribute; its file name is Picture8-1.png

The way the model is trained is that we take a trajectory computed by the ODE Solver computed by the projectile_motion function and break that into segments of length 11.   The LSTM is given the first 10 pairs of points and it is trained to predict the 11th point.  This is a training set of size 100.  (The details of the training are in the Jupyter Notebook in the github repo for this article.  )

We have a simple function makeX(angle, speed) that will compute the ODE solution and time to ground.  The training set was generated with the path p = makeX(30, 500).   A second function  OneStep(path) that takes a computed path and uses the first 10 steps to start the model which goes until it reaches the ground.  OneStep (less a plotting component and the ground hitting computation) is shown below.   The data is first normalized to fit in the (-1, 1) range that neural net expects.  We then take the first 10 elements to start a loop which predicts the 11th step.  That is appended to the list and we continue to predict the next step until done. 

This image has an empty alt attribute; its file name is Picture1-1024x378.png

  If we run OneStep with the path from the training data it is a perfect fit.  This, of course is not surprising. However if we give it a different initial angle or different initial speed the network does an very good job on a range of initial conditions.  This is illustrated in Figures 2 and 3 below.

This image has an empty alt attribute; its file name is Picture10-1.png

Figure 2.  Comparing the ODE solver to the LSTM for a fixed speed and a range of angles.

This image has an empty alt attribute; its file name is Picture11-1-1024x386.png

Figure 3.   Comparing the ODE solver to the LSTM for a fixed angle and a range of speeds.

The further we move from our initial training path, the less accurate the result.  This is shown in Figure 4.

This image has an empty alt attribute; its file name is Picture12-1.png

Figure 4.   angle = 10, speed = 500.  The LSTM overshoots slightly but ends correctly.

We also trained the network on 4 different paths.  This improved the results slightly.

Case 2. The Double Pendulum

The double pendulum is another fascinating case with some interesting properties.  It consists of two pendulums, with the second suspended from the end of the first.  As shown in figure 5, it can be characterized by two angles θ1 and θ2 that measure the displacements from vertical, the length of the two pendulum arms L1 and L2 and the masses M1 and M2

This image has an empty alt attribute; its file name is Picture13-1.png

Figure 5.  The double pendulum

The equations of motion are in terms of the angles, the angular velocities and the accelerations.  The

This image has an empty alt attribute; its file name is Picture15-1-1024x290.png

We have adapted a very nice program by Mohammad Asif Zaman (github) by adding our LSTM network as above.   The initial condition is to hold the configure vector in a position for a given value for the two angles with the angular velocities both zero and release it.   We have a function
double_pendulum(V,t,m1,m2,L1,L2,g) that evaluates the differential equations given the configuration and we use the standard ODE solvers to integrate the solution forward from the initial configuration with

This image has an empty alt attribute; its file name is Picture16-1-1024x228.png

Using different values for parameters for d1 and d2 we can generate a variety of behaviors including some that are very chaotic. 

Our LSTM network is as before except its initial and final dimension is 4 rather than 2.  We train this as before but integrate much farther into the future.   (Again, the details are in the notebooks in github.)

This image has an empty alt attribute; its file name is Picture17-1.png

We have created a version of the OneStep function that allows us to see the trajectories.  It is more interesting to look at the trajectory of the lower mass.   In Figure 6 below we see on the right the positions of the double pendulum with the initial condition angles θ1= -60 degrees and θ2 = 60 degrees.   We then show the configurations at T = 20 and T = 80.

This image has an empty alt attribute; its file name is Picture18-1.png

Figure 6.   Double Pendulum.   On the left is the path of the ODE solution in red and the LSTM solution in green.

By looking at the path traces on the left the two solutions for this configuration diverge rapidly.   Using smaller angles we get much more stable solutions.   Below is the trace for θ1= -20 degrees and θ2 = 35 degrees.

This image has an empty alt attribute; its file name is Picture20-1.png

Figure 7.  Blue mark is starting point, yellow is when the two solutions begin to diverge.

Below we have the animated versions of the ODEsoltion and the LSTMsolution for this case. Below that in Figure 9 is the illustration of the chaotic case (-50, 140)

Figure 8 Left is ODE solution for -20 degrees, 35 degrees. Right is LSTM solution

Figure 9. Left is chaotic ODEsoltion for -50 degrees, 140 degrees. Right is failed LSTMsolution

Conclusions

The experiment described here used only the most trivial LSTM network and tiny training set sizes.   The initial value problems we considered were a two dimensional submanifold of the 4 dimensional space:  in the case of the projectile, the origin was fixed and the speed and angle were free variables and in the pendulum, we had the two angles free and the initial angular velocities were set to 0.  Each point of the initial conditions is a point on the configuration manifold (which is a cylinder defined by angle and speed for the projectile and a torus defined by the two angles of the pendulum).  The differential equations specify a unique path from that initial point along the configuration manifold.  The LSTM defines another path from the same initial point along the manifold.  The question is  how are these two paths related?

We noticed that training the LSTM with one solution allowed the LSTM to generate reasonably good solutions starting from points “near” the training initial value point.   It is interesting to ask about the topology of set of solutions S to the differential equations, in the region containing the solution generated by an initial value point x.   Consider the case of the projectile when the wind resistance is zero.  In this case the solutions are all parabolas with a second derivative that is -g.   This implies that any two solutions differ by a function that has a second derivative that is zero: hence they differ by a simple line.  In this case he topology of S is completely flat.   If the wind resistance is non-zero, the non-linearity of the equations tells us that the submanifold S is not completely flat.   If we consider the submanifold of solutions S’x for the LSTM trained on the single initial value point x, we noticed that values near x generated solutions in S’x that were very close to the solutions in S near those generated  by points near x.  We suspect that the two submanifolds and S’x share the same tangent plane at x.   We attempted to illustrate this in Figure 9 below.

This image has an empty alt attribute; its file name is Picture22.png

Figure 9.  Depiction of the tangent planes to two submanifolds that have the same solution for a given initial condition x.

In the case of the double pendulum,  the  shape of S is much more complex.  The LSTM manifold is a good approximation of the pendulum in the stable region, but once a starting point is found at the edge of the chaotic region the LSTM solution fall apart. 

In his draft ”Deep Learning for Spatial Time Series”, Fox shows a very general and more sophisticated application of neural networks to time series.   In particular, he addresses sequences that are not limited to time series as we have discussed here.  He describes a concept he calls a “spatial” bag which refers to a space-time grid of cells with some number of series values in each cell.  For example, cells may be locations where rainfall, runoff and other measures are collected over a period of time.   So location in the spatial bag is one axis and time is the other.   Making prediction along the spatial direction is  analogous to sequence to sequence transforms  and prediction along the time axis is similar to what we have described here.  The network his team used is hybrid with a Transformer as encoder and LSTM as a decoder.   Because the bag is an array of discrete cells,  the Transformer is a natural choice because the attention mechanism can draw out the important correlations between cells.  The LSTM decoder can generate streams of output in a manner similar to what we have done.   The paper shows convincing results for hydrology applications as well as COVID-19 studies.   I expect much more to come from this work.