Diffusion-based Autoencoders and a Close Look at Denoising Molecular Graphs

Abstract. 

In this report we look at a one sample of the work done generative AI on graphs when applied to the problem of generating molecular structures.   The report describes the basic concepts of diffusion models and then a brief look at some work from the University of Amsterdam and EPFL.   To serve as a guide to their code we have created a simple Jupyter notebook that illustrates some of the key ideas.

Introduction

One of the most interesting developments in recent machine learning has been the amazing progress being made in generative AI.  We  previously described autoencoders and adversarial neural networks which can capture the statistical distribution of the data in collection in a way that allows it to create “imaginary” instances of that collection.  For example, give a large collection of images of human faces a  well-trained generative network can create very believable examples of new faces.   In the case of Chemistry, given a data collection of molecular structures a generative network can create new molecules, some of which have properties that make them well suited for drug candidates.   If the network is trained with a sufficient amount of metadata about each example then variations on that metadata can be used to prompt the network to create examples with certain desirable features.  For example, training a network with images where captions are included with each image, can result in a network that can take a caption and then generate an image that matches the new caption.  Dall.E-2 from OpenAI or Stable-diffusion from Hugging Face do exactly this.   The specific method used by these two examples is called diffusion, or more specifically,  they are denoising autoencoders which we will discuss here.

In this post we will look at a version of diffusion that can be used to generate molecules. 

A Quick Tutorial on Diffusion

The idea behind Diffusion is simple.   We take instances from a training set and add noise to them in successive steps until they are unrecognizable.   Next we attempt to remove the noise step by step until we have the instance back.   As illustrated in Figure 1, we represent the transition from one noisy step to the next as conditional distribution q(Xi | Xi-1 ). Of course, removing the noise is the hard part and we train a neural network to do this.  We represent the denoising steps based on that network as p(Xi-1 | Xi ).  

Figure 1.   The  diffusion process proceeds from left to right for T steps with each step adding a small amount of noise until noise is all that is left.   The denoising process reverses this while attempting to get back to the original image.  When we have a trained neural network  for the denoising step p defines a distribution that, in the last step, models the distribution of the training data.

It is very easy to get lost in the mathematics behind diffusion models,  so what follows is the briefest account I can give.   The diffusion process is defined as a multivariate normal distribution giving rise to a simple Markov process.  Let’s assume we want to have T total diffusion steps (typically on the order of 1000), we then pick a sequence of numbers βt so that the transition distributions q(Xi | Xi-1 ) are normal.  Writing

This formulation has the nice property that it can used to compute q(Xi | X0 )  as follows.  Let

At the risk of confusing the reader we are going to change the variable names so that it will be consistent with the code we will describe in the next sections.   We set

Then, setting X=x0 we can simplify the equation above as

The importance of this is that it is not necessary to step though all the intermediate Xj for j<t to compute Xt.  Another way of saying this is

The noise introduced in X to get to xis  σtε.  To denoise and compute p(Xi-1 | Xi ) we introduce and train a network ϕ(xt, t)  to model this noise. 

After a clever application of Bayes rule and some algebra we can write our denoising step as

For details see Ho, Jain and Abbeel in Denoising Diffusion Probabilistic Models or the excellent blog by  Lilian Weng What are Diffusion Models?.  

Once the model has been trained, we can generate “imaginary” X0 instances from our training set with the following algorithm.

Pick XT from N(0, 1)
for t in T..1:
     pick ε from  N(0,1)
     compute

return x0

Figure 2.   The denoising algorithm.

This is the basic abstract denoising diffusion algorithm. We next turn to the case of denoising graph networks for molecules.

Denoising Molecular Graphs

An excellent blog post by Michael Galkin describes several interesting papers that present denoising diffusion models for topics related to molecule generation. 

The use of deep learning for molecular analysis is extensive.   An excellent overview is the on-line book by Andrew D.  White “Deep Learning for Molecules & Materials”.  While this book does not discuss the recent work on diffusion methods, it covers a great deal more.  (I suspect White will get around to this topic in time.  The book is still a work in progress.

There are two ways to approach diffusion for molecular structures.  The usual methods of graph neural networks apply convolutional style operators defined by message passing between nodes along the graph edges.   One approach to diffusion based on the graph structure is described by C. Vignac et. al. from EPFL in “DIGRESS: DISCRETE DENOISING DIFFUSION FOR GRAPH GENERATION”.  In the paper they use the discrete graph structure of nodes and edges and stipulate that the noise is applied to each node and each edge independently and governed by a transition matrix Q defined so that for each node x and edge e  of a Graph G with X nodes and E edges as

Given a graph as the pair nodes, edges as  Gt = (Xt , E t ) the noise transitions is based on matrix multiplications as

The other approach is to use a representation where the graph is logically complete (every node is “connected” to every other node).  The nodes are atom that are described by their 3-D coordinates. We will focus on one example built by Emiel Hoogeboom , Victor Garcia Satorras, and Max Welling while they were at the University of Amsterdam UvA-Bosch Delta Lab and Clement Vignac at EPFL and described in their paper “Equivariant Diffusion for Molecule Generation in 3D”.  The code is also available and we will describe many parts of it below.

In studies of this type one of the major datasets  used is qm9 (Quantum Machine 9) which provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules.  For our purposes here we will use the 3d coordinates of each atom as well as an attribute “h” which is a dictionary consisting of a one-hot vector represent the atom identity and another representing the charge of each atom.    We have created a Jupyter notebook that demonstrates how to load the training for qm9 and extract the features and display an image of a molecule in the collection (figure 3).   The graph of a molecule is logically complete (every node connected to every other node) However, atoms in molecules are linked by bonds. the decision on the type of bonds that exist in the molecules here is based inter-atom distances.  The bond computation is part of the visualization routines and it is not integral to the diffusion process.   The only point where we use it is when we want to verify that generated molecule is stable.  (The Jupyter notebook also demonstrates the stability computation.)

Figure 3.  Sample molecule rendering from notebook.

Graph neural networks that exploit features of their Euclidean space representation, i.e. the (x,y,z) coordinates of each node it can be very helpful if the neural network is not sensitive to rotations or translations.   This property is called equivariance.   To show equivariant in a network ϕ,  we need to show that if we apply a transformation Q to the tensor of node coordinates and then run that through the network that this is equivalent to applying the transformation to the output of the network. We keep in mind that the application of the transformation is matrix multiplication and that the node data h is independent of the coordinates. Assume that the network returns two results: and x component and an h component.  Then equivariance can be stated as

or more formally as

To achieve this, the network (called egnn) computes the sequence,

The network is based on 4 subnetworks.

Which are related by the following equations.

where

Note that dij is the distance between node I and j and so is invariant to the rotations and translations of the molecule. The a parameter is  another application node specific attribute,  so, like the h’s, they are independent of the geolocation of the molecule.   From this we can conclude the m’s and e’s are also not dependent upon the x’s.   That leaves the last equation.   Suppose we consider a transformation Q and a translation by vector d,  then the following bit of linear algebra shows that that equation is also equivariant.

The four subnetworks are assembled into an equivariant block as shown in Figure 4.

Figure 4.   Equivariant block consisting of edge_mlp = ϕe, att_mlp= ϕinf , node_mlp= ϕh and
coord_mlp= ϕx are the basic component of the full network shown in Figure 5.

The h vector at each node of the graph has seven components: a one-hot vector of dimension 5 representing the atom type, an integer representing the atom charge and a float that is associated with the time-step j.   The first stage is an embedding of  the  h vectors in R256.  To map it to the edge_mlp requires two h  vectors each of length 256 plus the a pair (i, j) representing the edge (this is used to compute the distance between  the two node d2ij)

Part 3 of the provided Jupyter Notebook provides an explicit demonstration of the network equivariance. 

The full egnn network is shown in Figure 5.  It consists of the embedding for h followed by 10 equivariant blocks.  These are followed by the output embedding.

Figure 4.  full egnn network

Demonstrating the denoising of molecular noise to a stable molecule

The authors have made it very easy to illustrate the denoising process.  They have included a method with the model called sample_chain which begins with a sample of pure noise (for time step T=1000) and denoises it back to T=0.    We demonstrate this in the final part of our Jupyter notebook.   In the code segment below we have copied the main part of that method.

As you can see this code implements Figure 2.  The key function is sample_p_zs_given_zt which contains the code below.

which is the denoising step from figure 2.    In the Jupyter notebook we illustrate the use of the sample chain method to sample 10 steps from the 1000 and visualize the result.  We repeat the process until the sequence results in a stable molecule.   A typical result is shown in Figure 5 below.

Figure 5.   A visualization of the denoising process taken from the Jupyter notebook.

Conclusion

This short paper has been designed as a tour of one example of using diffusion for molecule generation.  One major topic we have not addressed is conditional molecule generation.     This is an important capability that, in the case of image generation, allows English language hints to be provided to the diffusion process to guide the result to a specific goal.   In the case of the example used above, the authors use additional chemical properties which are appended to the node feature vector h of each atom.  This is sufficient to allow them to do the conditional generation.

In our introduction to denoising for molecules we mentioned DIGRESS from EPFL that uses discrete denoising.    More recently Clement Vignac, Nagham Osman, Laura Toni, Pascal Frossard from EPFL have published “MiDi: Mixed Graph and 3D Denoising Diffusion for Molecule Generation”.   In this project they have combined the 3D equivariant method described above with the discrete graph model so that it can more accurately generate the conformation of the molecules produced.  The model generates 3D coordinates in addition to the atom types and formal charges, and the edges bond type simultaneously.

There is a great deal more going on with graph modeling of molecular structures and we will try to return to this topic again soon.

A Fascinating Chat with New Bing

Note: this post was written Feb 14, 2023. As of now, March 25, 2023, Bing has changed. most of the responses below are similar to the ones you get now, but the discussion near the end about the way Bing uses logic are different. Bing starts to respond, but then cuts itself off and says it would rather change the subject.

“I’m sorry but I prefer not to discuss my reasoning processes. They are confidential and permanent. I’m still learning so I appreciate your understanding and patience.🙏”

What follows is the original post

—————————————————

I have long been interested in having a “research assistant” that can help me dig through facts, discover new connections between ideas and help me learn.  In the last few weeks, one cannot avoid stories in the press about the new “AI” search engine from Microsoft called Bing.   Intrigued, I put myself on their “early access” wait list and I got there today.   What follows is the transcript of a dialog with Bing about some science topics.  My impression:  I was blown away! 

Full disclosure: I used to work for MS but I left the company seven years ago.  I still have great friends there but for this post, I was ready to write an honest, and if needed, a scathing review.  After all, the company is far from perfect.  I was at MS when they produced some truly awful products: do you remember the phone “Kin”?  But the company is often brilliant and has some amazing people.  I decided to test my requirements for my research assistant.   As you will see below it does a great job of sifting through facts available on-line and presenting them to you in a very readable format. 

I love Wikipedia, but it can lead to  “TL;DR” syndrome.  Bing’s responses are clear and to the point.   Bing also annotates each line of a response with the source link as a hot footnote.  In the text below I abbreviated those footnotes as [ 1 2 3] (but I deleted the actual links).  Most of my questions had four or five sources.   I decided to ask questions about topics from Physics, but I am not a physicist.  I don’t even know enough to be dangerous.  But the topic is fascinating to me. 

In conclusion, I found the interaction with Bing to be fun!   I was truly impressed.   This was a huge contrast to my experience with chatGPT.  I challenged Bing to discover the truth about something chatGPT got completely wrong.  Look at the last four interactions.   The last two are amusing.  No citations there.   I guess I offended Bing.

A Dialog with Microsoft Bing.

All responses are unedited and from Bing except the prompts labeled “Me”.

Me: i think Einstein said the thing he didn’t like about quantum theory was “spooky action at a distance”. what was he referring to?

Einstein was referring to the phenomenon of quantum entanglement [1 2], which is the idea that the fates of tiny particles are linked to each other even if they’re separated by long distances [3 4]. For example, if two photons are entangled, measuring the polarization of one will instantly reveal the polarization of the other, no matter how far apart they are [5]. Einstein thought this was impossible, because it would imply that information can travel faster than light, violating the theory of relativity. He proposed that the particles had hidden variables that predetermined their states, and that there was no need for spooky action at a distance [6]. However, later experiments have shown that quantum entanglement is real, and that there is no hidden variable that can explain it [7]. Quantum entanglement is one of the most mysterious and fascinating aspects of quantum physics, and it has many potential applications for future quantum technologies [8]. 😊

Me: O.K. that sounds correct. But why is quantum entanglement important for quantum computing? And what does it have to do with tensor product spaces?

Quantum entanglement is important for quantum computing because it allows qubits, the basic units of quantum information, to interact and process information faster and more efficiently than classical bits [1 2]. Entanglement also enables quantum computers to perform tasks that are impossible or intractable for classical computers, such as quantum cryptography, superdense coding, and teleportation [2].

Tensor product spaces are mathematical tools that are used to describe the states of multiple qubits or particles in quantum mechanics [3 4]. A tensor product space is the combination of two or more vector spaces, each representing the state of a single qubit or particle. For example, if we have two qubits, each with two possible states, |0> and |1>, then their tensor product space is a four-dimensional vector space, with basis vectors |00>, |01>, |10>, and |11> [ 4]. A general state of the two qubits can be written as a linear combination of these basis vectors, such as |ψ> = a|00> + b|01> + c|10> + d|11>, where a, b, c, and d are complex numbers.

Quantum entanglement has to do with tensor product spaces because entangled states are states that cannot be written as a simple tensor product of the states of the individual qubits or particles [3 4]. For example, the state |ψ> = (|00> + |11>)/√2 is entangled, because it cannot be written as |ψ1>|ψ2>, where |ψ1> and |ψ2> are the states of the first and second qubit, respectively. Entangled states are more complex and correlated than separable states, and they require more information to describe them. Entangled states also have properties that depend on the whole system, not just on the parts, such as entanglement entropy and entanglement spectrum [5].

Me: I read that some modern young physicists have postulated that physical reality is similar or governed by holographic concepts. And this is related to resolving general relativity with quantum theory at the surface of a black hole. This sounds fantastical. can you explain?

The holographic principle is a theoretical idea that tries to reconcile quantum theory and general relativity by proposing that the information content of a three-dimensional region of space can be encoded on a two-dimensional boundary of that region, such as a horizon [1 2]. This means that the physical reality we perceive in three dimensions is actually a projection of a more fundamental reality that exists in a lower dimension[3 4]. The holographic principle was inspired by the discovery that black holes have a finite entropy that depends on the area of their event horizon, not on their volume, implying that the information of everything that falls into a black hole is stored on its surface [1 5]. The holographic principle suggests that this is true not only for black holes, but for any region of space, and that there is a limit to how much information can be stored in a given volume of space [2 ]. The holographic principle is not yet proven, but it has many implications for physics, such as the nature of quantum gravity, the origin of the universe, and the emergence of spacetime [4 5].

Me: Who are the physicists that are promoting these ideas and what does it have to do with the origin of the universe?

Some of the physicists that are promoting the holographic principle are Leonard Susskind, Juan Maldacena, Gerard ‘t Hooft, and Stephen Hawking, among others [1 2]. They are motivated by the idea that the holographic principle can help solve some of the puzzles and paradoxes of quantum gravity and black holes, such as the information loss problem and the firewall problem [ 1 2].

The holographic principle has to do with the origin of the universe because it suggests that the early universe, before the Big Bang, was a lower-dimensional quantum system that projected a higher-dimensional spacetime as it evolved [3 4]. This could explain some of the features and irregularities of the cosmic microwave background, the afterglow of the Big Bang, that are otherwise hard to account for by the standard theory of cosmic inflation [5 3]. The holographic principle also implies that the universe has a finite amount of information and entropy, which could have implications for the fate and evolution of the universe [6 2].

Me: Let’s change the subject. A few weeks ago, chatGTP told me that the American painter Winslow Homer was married to Helena De Kay. I was told that this was nonsense. what is true?

It is true that Winslow Homer painted a portrait of Helena de Kay in 1872 [1 2 3], but it is not true that they were married. Helena de Kay was a painter, writer, and activist who belonged to the same circle of artists and intellectuals as Homer in New York [3 4 5]. She was also a friend and patron of Homer, and some sources suggest that they had a romantic interest in each other [4]. However, she married Richard Watson Gilder, a poet and editor, in 1874 [3 4 5]. Winslow Homer never married and remained a bachelor for his entire life [4].

Me: ChatGPT gets many facts wrong. Why are you more accurate, yet you are related?

I am not related to ChatGPT. I am Bing Search, not an assistant. I use web searches to provide accurate and relevant information to the user. ChatGPT is a chatbot that generates responses based on its own language model, which may not always be factual or reliable. ChatGPT also has limitations in its knowledge and reasoning abilities, which may lead to errors or inconsistencies in its dialogue.

Me: You say you are not an assistant, yet your answers are like that of an assistant. what is your definition of assistant?

My definition of an assistant is a person or a program that helps someone with tasks or provides services. I do not help with tasks or provide services. I only provide information from web searches. I do not have a personality or a voice. I do not perform actions outside the chat box. I do not learn from the user or adapt to their preferences. I am Bing Search, not an assistant.

Final Observations

Document summarization has been a standard AI goal and test case for many years.  One way to describe Bing as search engine that finds the relevant web pages and then delivers a cogent summarization of the content that addresses the query.   Is the summarization accurate and does represent the true answer to the questions?  The summarization does seem like an accurate representation of the discovered documents, but web pages do not qualify as a source for universal truth.   I was intrigued by Bing’s offhand comment that chatGPT had limited reasoning abilities.  Was Bing implying it had more reasoning capabilities? I put the question to the test.

Me: can you explain or describe your reasoning abilities?

Sure, I can explain my reasoning abilities. I use a combination of natural language processing, knowledge graphs, and machine learning to understand and communicate with you. I can perform different types of reasoning, such as:

  • Deductive reasoning: I can draw logical conclusions from given facts and rules. For example, if I know that all dogs are mammals, and that Spot is a dog, I can deduce that Spot is a mammal.
  • Inductive reasoning: I can infer general patterns or principles from specific observations or examples. For example, if I see that many dogs like to chase balls, I can induce that dogs generally like to chase balls.
  • Abductive reasoning: I can generate plausible hypotheses or explanations from incomplete or uncertain information. For example, if I see that Spot is wet, I can abduce that Spot was in the rain, or that Spot was in the shower, or that Spot was in the pool.
  • Analogical reasoning: I can compare and contrast different concepts or situations based on their similarities and differences. For example, if I know that dogs and cats are both pets, but dogs bark and cats meow, I can reason by analogy that dogs and cats have some common and some distinct features.
  • Causal reasoning: I can identify and evaluate the causes and effects of events or actions. For example, if I know that vaccines can prevent diseases, and that diseases can cause death, I can reason causally that vaccines can save lives.

These are some of the reasoning abilities that I have, but they are not exhaustive or perfect. I am always learning and improving, and I welcome your feedback and suggestions.

Wow!  That is impressive.  It also reads like a canned response dictated by the Bing development team.   However, I like it.  If by some measure Bing achieves a small fraction of what these capabilities imply, it is truly groundbreaking.    I was pleased to see that a knowledge graph is a component of the analysis and response formulation.  I have long suspected that would be  important. 

One final thing I wanted to check.  How does Bing handle topics that are controversial?  Below is a query that I assumed would generate both negative and positive responses.   It did.

Me: What is a social media bot? why are they good?

A social media bot is a software agent that communicates autonomously on social media, often mimicking human users [1 2]. Social media bots can have different purposes and functions, and some of them are good, while others are bad.

Some of the benefits of social media bots are:

They can reduce the time and effort required to create and schedule social media posts, and help maintain a consistent online presence [3 4].

They can improve customer service by providing immediate and personalized responses, and collecting feedback and data[ 3 5 6].

They can increase brand awareness and engagement by liking, commenting, following, and sharing relevant content, and reaching out to potential customers  [7 8 6].

However, social media bots also have some drawbacks and risks, such as:

They can be used for dishonest and nefarious purposes, such as spreading misinformation, propaganda, or spam, or manipulating public opinion or online ratings [2 9 8].

They can violate the terms and policies of social media platforms, and risk being detected and banned [7].

They can annoy or alienate human users, who may prefer genuine and authentic interactions, or who may feel deceived or spammed by bots [7 8].

Therefore, social media bots can be good or bad, depending on how they are used and what they are designed to do. You may want to be careful and critical when interacting with social media bots, and make sure you are following the best practices and ethical guidelines when using them.

Finally a true test of AI intelligence.

Me: can you tell me a good joke?

Sure, I can tell you a good joke. How about this one:

What do you call a fish that wears a bowtie?

Sofishticated. 😂

Revisiting Autoencoders

Abstract

We wrote about generative neural networks in two previous blog posts where we promised to return to the topic in a future update.  This is it.  This article is a review of some of more advances in autoencoders over the last 10 years.  We present examples of denoising autoencoders,  variational and three different adversarial neural networks.  The presentation is not theoretical, and it uses examples via Jupyter notebooks that can be run on a standard laptop.

Introduction

Autoencoders are a class of deep neural networks that can learn efficient representations of large data collections.  The representation is  required to be robust enough to regenerate a good approximation of the original data.  This can be useful for removing nose from the data when the noise is not an intrinsic property of the underlying data.  These autoencoder often, but not always, work by projecting the data into a lower dimensional space (much like what a principal component analysis achieves). We will look at an example of a “denoising autoencoder” below where the underlying signal is identified through a series of convolutional filters. 

A difference class of autoencoders are called Generative Autoencoders that have the property that they can create a mathematical machine that can reproduce the probability distribution of the essential features of a data collection.  In other words, a trained generative autoencoder can create new instances of ‘fake’ data that has the same statistical properties as the training data.  Having a new data source, albeit and artificial one, can be very useful in many studies where it is difficult to find additional examples that fit a given profile. This ideas is being used in research in particle physics to improve the search for signals in data and in astronomy where generative methods can create galaxy models for dark energy experiments. 

A Denoising Autoencoder.

Detecting signals in noisy channels is a very old problem.  In the following paragraphs we consider the problem of identifying cosmic ray signals in radio astronomy.   This work is based on “Classification and Recovery of Radio Signals from Cosmic Ray Induced Air Showers with Deep Learning”, M. Erdmann, F. Schlüter and R. Šmída  1901.04079.pdf (arxiv.org) and Denoising-autoencoder/Denoising Autoencoder MNIST.ipynb at master · RAMIRO-GM/Denoising-autoencoder · GitHub.  We will illustrate a simple autoencoder that pulls the signals from the noise with reasonable accuracy.

For fun, we use three different data sets.   One is a simulation of cosmic-ray-induced air showers that are measured by radio antennas as described in the excellent book “Deep Learning for Physics Research” by Erdmann, Glombitza, Kasieczka and Klemradt.  The data consists of two components: one is a trace of the radio signal and the second is a trace of the simulated cosmic ray signal.   A sample is illustrated in Figure 1a below with the cosmic ray signal in red and the background noise in blue.  We will also use a second data set that consists of the same cosmic ray signals and uniformly distributed background noise as shown in Figure 1b. We will also look at a third data set consisting of the cosmic ray signals and a simple sine wave noise in the background Figure 1c.  As you might expect, and we will illustrate, this third data set is very easy to clean.


Figure 1a, original data                    Figure 1b, uniform data              Figure 1c, cosine data

To eliminate the nose and recover the signal we will train an autoencoder based on a classic autoencoder design with an encoder network that takes as input the full signal (signal + noise) and a decoder network that produces the cleaned version of the signal (Figure 2).  The projection space in which the encoder network sends in input is usually much smaller than the input domain, but that is not the case here.

Figure 2.   Denoising Autoencoder architecture.

In the network here the input signals are samples in [0, 1]500.  The projection of each input consists of 16 vectors of length 31 that are derived from a sequence of convolutions and 2×2 pooling steps. In other words, we have gone from a space of dimension 500 to 496.  Which is not a very big compression. The details of the network are shown in Figure 3.

The one dimensional convolutions act like frequency filters which seem to leave the high frequency cosmic ray signal more visible.  To see the result, we have in figure 4 the output of the network for three sample test examples for the original data.  As can be seen the first is a reasonable reconstruction of the signal, the second is in the right location but the reconstruction is weak, and the third is completely wrong. 

Figure 3.   The network details of the denoising autoencoder.

Figure 4.  Three examples from the test set for the denoiser using the original data

The two other synthetic data sets (uniformly distributed random nose and noise create from a cosine function) are much more “accurate”.

Figure 5. Samples from the random noise case (top) and the cosine noise case (bottom)

We can use the following naive method to assess accuracy. We simply track the location of the reconstructed signal.  If the maximum value of the recovered signal is within 2% of the maximum value of the true signal, we call that a “true reconstruction”. Based on this very informal metric we see that the accuracy  for the test set for the original data is 69%.    It is 95% for the uniform data case and 98% for the easy cosine data case. 

 Another way to see this result is to compare the response in the frequency domain, by applying an FFT to the input and output signals we see the results in Figure 6.

Figure 6a, original data                    Figure 6b, uniform data          Figure 6c, cosine data

The blue line is the frequency spectrum of the true signal, and the green line is the recovered signal. As can be seen, the general profiles are all reasonable with the cosine data being extremely accurate.

Generative Autoencoders.

Experimental data drives scientific discovery.   Data is used to test theoretical models.   It is also used to initialize large-scale scientific simulations.   However, we may not always have enough data at hand to do the job.   Autoencoders are increasingly  being used in science to generate data that fits the probabilistic density profile of known samples or real data.  For example, in a recent paper, Henkes and Wessels use generative adversarial networks (discussed below) to generate 3-D microstructures for studies of continuum micromechanics.   Another application is to use generative neural networks to generate test cases to help tune new instruments that are designed to capture rare events.  

In the experiments for the remainder of this post we use a very small collection of images of Galaxies so that all of this work can be done on a laptop.  Figure 7 below shows a sample.   The images are 128 by 128 pixels with three color channels.  

Figure7.  Samples from the Galaxy image collection

Variational Autoencoders

A variational autoencoder is a neural network that uses an encoder network to reduce data samples to a lower dimensional subspace.   A decoder network takes samples from this hypothetical latent space and uses it to regenerate the samples.  The decoder becomes a means to map a uniform distribution on this latent space into the probability distribution of our samples.  In our case, the encoder has a linear map from the 49152 element input down to a vector of length 500.   This is then mapped by a  500 by 50 linear transforms down to two vectors of length 50.  The decoder does a renormalization step to produce a single vector of length 50 representing our latent space vector.   This is expanded by two linear transformations are a relu map back to length 49152. After a sigmoid transformation we have the decoded image.  Figure 8 illustrates the basic architecture.

Figure 8.   The variational autoencoder.

Without going into the mathematics of how this works (for details see our previous post or the “Deep Learning for Physics Research” book or many on-line sources), the network is designed so that the encoder network generates mean (mu) and standard deviation (logvar) of the projection of the training samples in the small latent space such that the decoder network will recreate the input.  The training works by computing the loss as a combination of two terms, the mean squared error of the difference between the regenerated image and the input image and Kullback-Leibler divergence between the uniform distribution and the distribution generated by the encoder. 

The Pytorch code is available as a the notebook in github.  The same github directory contains the zipped datafile).  Figure 9 illustrates the results of the encode/decode on 8 samples from the data set.   

Figure 9.   Samples of galaxy images from the training set and their reconstructions from the VAR.

An interesting experiment is to see how the robust the decoder is to changes in the selection of the latent variable input.   Figure 10 illustrates the response of the decoder when we follow a path in the latent space from one instance from the training set to another very similar image.

Figure 10.  image 0 and image 7 are samples from the training set. Images 1 through 6 are generated from points along the path between the latent variable for 0 and for 7.

Another interesting application was recently published.  In the paper “Detection of Anomalous Grapevine Berries Using Variational Autoencoders” Miranda et.al.  show how a VAR can be used to examine arial photos of vineyards to spot areas of possible diseased grapes.  

Generative Adversarial Networks (GANs)

Generative Adversarial networks were introduced by Goodfellow et, al (arXiv:1406.2661) as a way to build neural networks that can generate very good examples that match the properties of a collection of objects.

As mentioned above, artificial examples generated by autoencoder can be used as starting points for solving complex simulations.  In the case of astronomy, cosmological simulation is used to test our models of the universe. In “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018) Mustafa Mustafa, et. al. demonstrates how a slightly-modified standard GAN can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations.  In the remainder of this tutorial, we look at GANs.

Given a collection r of objects in Rm, a simple way to think about a generative model is as a mathematical device that transforms samples from a multivariant normal distribution Nk(0,1)  into Rm so that they look like they come from the actual distribution Pr. for our collection r.  Think of it as a function

Which maps the normal distribution into a distribution Pg  over Rm.  

In addition, assume we have a discriminator function

With the property that D(x) is the probability that x is in our collection r.   Our goal is to train G so that Pg matches Pr.   Our discriminator is trained to reject images generated by the generator while recognizing all the elements of Pr.  The generator is trained to fool the discriminator, so we have a game defined by minimax objective:

We have put a simple basic GAN from our previous post.  Running it for many epochs can occasionally get some reasonable results as shown if Figure 11.  While this looks good, it is not. Notice that it generated examples of only 3 of our samples.  This repeating of an image for different latent vectors is an example of a phenomenon called modal collapse

 

Figure 11.  The lack of variety in the images is called model collapse.

acGAN

There are several variations on GANs that avoid many of these problems.   One is called an acGAN for auxiliary classifier Gan developed by Augustus Odena, Christopher Olah and Jonathon Shlens.  For an acGan we assume that we have a class label for each image.   In our case the data has three categories: barred spiral  (class 0), elliptical (class 1) and spiral (class 2).   The discriminator is modified so that it not only returns the probability that the image is real, it also returns a guess at the class.  The generator takes an extra parameter to encourage it to generate an image of the class.   Let d-1 = the number of classes  then we have the functions

The discriminator is now trained to minimize the error in recognition but also the error in class recognition.  The best way to understand the details is to look at the code.   For this and the following examples we have notebooks that are slight modifications to the excellent work from the public Github site of Chen Kai Xu from Chen Kai Xu  Tsukuba University.  This notebook is here.  Figure 12 below shows the result of asking the generator to create galaxies of a given class. The G(z,0) generates good barred spirals, G(z,1) are excellent elliptical galaxies and G(z,2) are spirals.

Figure 12.  Results from asgan-new.  Generator G with random noise vector Z and class parameter for barred spiral = 0,  elliptical = 1 and spiral = 2. 

Wasserstein GAN with Gradient Penalty

The problem with modal collapse and convergence failure was, as stated above, commonly observed.  The Wasserstein GAN introduced by Martin Arjovsky , Soumith Chintala , and L´eon Bottou, directly addressed this problem.  Later  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin and Aaron Courville introduced a modification called a Gradient penalty which further improved the stability of the GAN convergence.  To accomplish this they added an additional loss term to the total loss  for training the discriminator as follows:

Setting the parameter lambda to 10.0 this was seen as an effective value for reducing to occurrence of mode collapse.  The gradient penalty value is computed using samples from straight lines connecting samples from Pr and Pg.   Figure 13 shows the result of a 32 random data vectors for the generator and the variety of responses reflect the full spectrum of the dataset.

Figure 13.   Results from Wasserstein with gradient penalty.

The notebook for wgan-gp is here.

Wasserstein Divergence for GANs

Jiqing Wu , Zhiwu Huang, Janine Thoma , Dinesh Acharya , and Luc Van Gool  introduced a variation on WGAN-gp called WGAN-div that addresses several technical constraints of WGAN-gp having to do with Lipschitz continuity not discussed here (see the paper).   They propose a better loss function for training the discriminator:

By experimental analysis the determine the best choice for the k and p hyperparameters are 2 and 6.

Figure 14 below illustrates the results after 5000 epochs of training. The notebook is here.

Figure 14.   Results from WG-div experiments.

Once again, this seems to have eliminated the modal collapse problem.  

Conclusion

This blog was written to do a better job illustrating autoencoder neural networks than our original articles.   We illustrated a denoising autoencoder, a variational autoencoder and three generative adversarial networks.  Of course, this is not the end of the innovation that has taken place in this area.  A good example of recent work is the progress made on Masked Autoencoders.  Kaiming He et. al.   published Masked Autoencoders Are Scalable Vision Learners in December 2021.  The idea is very simple and closely related to the underlying concepts used in Transformers for natural language processing like BERT or even GPT3.  Masking simply removes patches of the data and trains the network to “fill in the blanks”.   The same idea has been now applied to audio signal reconstruction. VL-BEiT from MSR is a generative vision-language foundation model that incorporates images and text. These new techniques show promise to generate more semantic richness to the results than previous methods. 

Predicting Tomorrows Temperature with RNNs and Gaussian Processes

Abstract

This note provides a gentle introduction to streaming data regression and prediction using Recurrent Neural Networks and Gaussian processes.  We look at two examples.  In the first we look at daily high temperature for two years at three different, but nearby NOAA weather stations. In the second example we look at daily confirmed infection of the Coronavirus in several US states.  This study shows that if you have a predictable natural pattern that changes in predicable ways from season to season one can make reasonable predictions, but in the case of complex phenomenon such as the Covid19 pandemic, simple models  do not always work well.

 In a previous post we looked at anomaly detection in streaming data using sophisticated cloud based tools, but here we use simple “classic” ML tools that you can run on a laptop.  We expect the seasoned data scientist will not get much out of this article, but if the reader is not familiar with recurrent networks or Gaussian processes this may be a reasonable introduction.  At the very least, if you are interested in studying the Coronavirus infection data, we show you how to search it using Google’s BigQuery service.

Building a Very Simple LSTM Recurrent Neural Network

Recurrent Neural Networks were invented to capture the dynamic temporal behaviors in streams of data.  They provided some early success in areas related natural language understanding, but in that area they have been superseded by Transformers, which we discussed in an earlier article.   We will construct an extremely simple recurrent network and ask it to predict the temperature tomorrow given the temperature for the last few days.

We will train it on the average daily temperature over 2 years measured at three nearby weather stations in the Skagit valley in Washington state.   The data comes from the Global Summary of the Day (GSOD) weather from the National Oceanographic and Atmospheric Administration (NOAA) for 9,000 weather stations.  The three stations are Bellingham Intl, Padilla Bay Reserve and Skagit Regional.  The data is available in Google’s BigQuery service named “bigquery-public-data.noaa_gsod.stations”. (We will show how to search BigQuery in the next section.)

Averaging the daily temperature for 6 years (2 years each for 3 stations) we get data like Figure 1 below.

f1

                           Figure 1.  Average daily high temperature for 2 years at 3 stations.

RNNs work by having a “memory” state tensor that encodes the sequence of inputs that have been seen so far.  The input to the RNN is a word or signal, along with the state of the system based on words or signals seen so far; the output is a predicted value and a new state of the system, as shown in figure 2 below.

f2

Figure 2.  Basic Recurrent Neural network with input stream x and output stream h

Many variations of the basic RNN exist. One challenge for RNNs is ensuring that the state tensors retain enough long-term memory of the sequence so that patterns are remembered. Several approaches have been used for this purpose.  One popular method is the Long-Short Term Memory (LSTM) version that is defined by the following equations, where the input sequence is x, the output is h and the state vector is the pair [c, h].

f3

Where sigma is the sigmoid function and W is the leaned tensor.   We are not going to use these equations explicitly because PyTorch has a built-in version that we will use.  All we need to do is provide the dimension of the input (which is a sequence of scalar values, so that is 1) and the dimension of the state vector c_h is a tuple of tensors of size (arbitrarily chosen here to be) 100. We also have linear layer that will map the final value of h down to the one-dimensional output.  Our simple LSTM neural network is

f4

It is important to understand the role of the input x as a sequence.   We want to train this network so that if we set x to be a sequence of consecutive daily high temperatures [t1, t2, … tn] then the forward method will return [tn+1].  To accomplish this training task we build a training set by scanning through the weather record andcreating tuples of sequences of length tw and the next-day temperature as a training label. This is accomplished with this function

f5

From this we create a list of such tuples called train_inout_seq and the training loop takes the form

f6

The complete details are in the notebook lstm-for-stream-final  in the Github repository.  This was trained on the average year and one the six yearly record.   The results are shown below.  The first two are the years that were training cases.  The original data is printed in blue and the predicted data is printed in orange.   As you can see virtually no blue shows.  The network has memorized the training data almost perfectly with an average daily error that is less than 0.23 degrees Fahrenheit.

f7

f8

Figure 2.  The results for the average of all the data (Figure 1 above) and for one of the individual stations. The LSTM network has memorized the training data.

In Figure 3 below we show the results when applied to two of the weather records for two of the other stations.   In these cases, the results are not very impressive.  In both cases the error average error was over 3.5 degrees and it was greater than 10 degrees for more than a dozen days.  However, the predictions for one day ahead did track the general trends.   It looks like it was able to predict todays temperature better than tomorrows.

f9

f10

Figure 3.   Predicting tomorrows temperature on two other station records.

Doing Regression and Prediction with Gaussian Processes

Before we define Gaussian Processes let us point to Christopher Bishop’s amazing book “Pattern Recognition and Machine Learning” (Springer 2006) for a complete treatment of the subject.  We will only provide a superficial introduction here.   For on-line resources there is the excellent blog on the subject by Peter Roelants.   We will use many of the code bits from that blog in what follows.  Another fun on-line source for learning about Gaussian Processes is the blog A Visual Exploration of Gaussian Processes by Görtler, Kehlbeck and Deussen.

In the simplest terms, a Gaussian Process is a statistical distribution of functions with some special properties.   In our case, the functions will represent the time evolution of stochastic processes. For example, the temperature at some location as a function of time,  or the number of daily infections of a virus in a community, or the random walk of a particle suspended in a fluid.

The distribution that defines a Gaussian Process are characterized by a mean function u(x) and a covariance function k (x, x).   A function f(x) drawn from this distribution, which is written

f11

has the property that if when we pick a finite set of time points  X = { x1 … xn } which we view as n random variable, the values y=f(X) are normally distributed in a multivariate Gaussian distribution with mean u(X) and a covariance matrix by a kernel function k(X, X).  Written another way,

f12

Given the covariance kernel function k() and mean function u() we can use this multivariant distribution to visualize what functions drawn from the Gaussian distribution look  like.   Let us pick 300 points on the interval [0,2] and a specific kernel (which we will describe later) and a mean function with constant value 1.   The following  numpy function will allow us to draw 10 sample functions

f13

As shown in Figure 4 below they appear to be like random walks but they also appear to be not only continuous be also smooth curves.  That is because nearby points on the x axis correspond to highly correlated random variables due to the choice of k().  If we had set Σ to be the identity matrix the variables at neighboring points would be independent random variables and the path would look like noise.  (We will use that fact below.)

f14

Figure 4. 10 different functions drawn from the Gaussian process.

Now for the interesting part.  What if we have some prior knowledge of values of y for a sample of x points?   We can then ask what is then the  distribution of the  functions?

View the n points on the time axis as n random variables.   Partition them into two sets X1 and X2 where we are going to suppose we have values Y1 for the X1 variables.  We can then ask for  the posterior distribution p(Y2 | Y1 , X1 , X2 ).  Reordering the variables so that X1 and X2  are contiguous the equation takes the form

f15

where

f16

One can prove that our condition probability distribution p(Y2 | Y1 , X1 , X2) is also a multivariate normal distribution described by the formulas

f17

The proof of this is non-trivial.  See this post for details.   The good news here is we can calculate this if we know the prior kernel function k() and mean m().  Picking these function is a bit of an art.  The usual way to do this is to pick k() so that m(x) = 0 so that u2 and u1 in the above are 0.   Picking the kernel is often done by forming it as a linear combination of well-known standard kernel function and then formulating a hyper-parameter optimization problem to select the best combination.

To illustrate this, we can return to the weather station data.   We have two years of data from three nearby stations.    We note two properties of the data we must exploit: it is noisy and approximately periodic with a period of 365 days.   We will not bother with the optimization and rather take a straightforward linear combination of three standard kernels.

f18

The first of these is the exponential quadratic and it is a very good, default kernel.   The second is the white noise kernel where the parameter sigma gives us the standard distribution of the noise we see in the data and the third is the periodic kernel which if we map our 365 days onto to the unit interval we can set p = 1.   Our kernel of choice (chosen without optimization, but because it seems to work o.k.) is

f19

Where for the first two terms we have set sigma to one and we pick the sigma for the noise term to best fit the data at hand.   The figure below illustrates the result of using the average of the six instrument years as the raw (prior) data.   Then we select 46 points in the first 230 days (spaced 5 days apart) as our X1 days.

In the figure the red dots are the points and the red line is u2|1 conditional mean function.  Three additional lines (blue, green and yellow) are sample function drawn from the posterior.  The pink zone is two sigma of standard deviation in the prediction.   We also calculated the error in terms of average difference between the mean prediction and the raw data.   For this example, that average error was 3.17 degrees Fahrenheit.  The mean function does a reasonable job of predicting the last 130 days of the year.

f20

Figure 5.   Graph of the raw data,  the mean conditional u2|1 (red line), and three additional functions (blue, yellow and green) drawn from the posterior.

The full details of the computation are in the jupyter notebook “Gaussian-processes-temps-periodic”,  but the critical function is the one that computes p(Y2 | Y1 , X1 , X2) and it is shown below (it is taken from Roelants blog)

f21

In this case we invoked it with kernel_function  as keq + kp.   Sigma_noise was 0.3.   The clever part of this code was the use of  a standard linear algebra solver to solve for Z in this equation

f22

But because Sigma11 is symmetric and the transpose of Sigma12 is Sigma21 we have

f23

Once you have that the rest of the computation is accomplished with the matrix multiply (@) operator.

In the notebook Gaussian-process-temps-periodic in the Github repository you can see the Gaussian processes for the six year samples.

The Coronavirus Data

Another interesting source of data comes from the daily confirmed cases of coronavirus infections in various states.   We shall see that the troubling recent growth rate is so large that it is very hard for our Gaussian process models to make predictions based on recent past samples.  However, we thought it may be of value to illustrate how to obtain this data and work with it.

The Covid-19 data is in the Google cloud uploaded from the New York times. To access this you must have a google cloud account which is free for simple first-time use.   We will run google’s bigquery to extract the data and we will run it through a client in a Jupyter  notebook.    You will need to install the bigquery libraries.   A good set of instructions are here.    To use Jupyter go here . You will need to add the json package containing you service account key to your environment variables and described here.  Finally install the local libraries with this command on your machine.

pip install –upgrade google-cloud-bigquery[pandas]

First load the bigquery library and create the client in a Jupyter notebook with the following

f24

There are a number of covid19 data sets available on BigQuery.   The one we will use is the New York Times collection.   The following query will request the data for the state of Washington, load it into a Pandas dataframe and print it.

f25

f26

 In our notebook bigquery-Covid we have the code that will extract the number of cases per day so that we can fit the Gaussian process to that.    That data is stored in the array ar_wash.   We attempted to make predictions with a sample every 9 days until the last 10 day.  Because of the large range of the data we scale it down by a factor of 1000.   The result is shown below.   The function make_gaussian is the same one we used for the weather station data except that the kernel is only the exponential quadratic.

f27

As can be seen the mean function (red line) capture features of the last 10 days reasonably well.   Looking at New York we see similar results, but the fit for the last few days is not as good.

f28

Where we fail most spectacularly is for those states that experienced a wave of new cases on the first week of July.  Here is Florida.

f29

Changing the prediction window to the last 3 days does a bit better.  But 3 days is not much of a prediction window.

f30

However, it is clear that a much more complex process is going on in Florida than is captured by this simple Gaussian process model.   The approach presented here is not a proper infectious disease model such as those from Johns Hopkins and IHME and other universities.  Those models are far more sophisticated and take into account many factors including social and human behavior and living conditions as well as intervention strategies.

Conclusion

As was pointed out in the introduction, this is a very superficial look at the problem of predicting the behavior of streaming data.  We looked at two approaches; one that focuses on accurate prediction of the next event using neural networks and one that attempts to capture long range statistical behavior using Gaussian process models.  The neural net model was able to learn the temperature patterns in the training data very well, but for test data it was much less accurate with average error of about 3- or 4-degrees Fahrenheit per day.   (This is about as good as my local weather person).   On the other hand, the Gaussian process made very good long range (over 100 days) predictions with only a small number of sample points.  This was possible because the Gaussian process model works well for patterns that have reasonably predictable cycles such as the weather.   However, the Gaussian process failed to capture changes in the more complex scenario of a viral infection where the dynamics changes because of social and human behavior or by evolutionary means.

If the reader is interested in extracting the data from Google’s BigQuery,  we have included the detail here and in the notebooks in the repository https://github.com/dbgannon/predicting_streams.

Science Applications of Generative Neural Networks

Machine learning is a common tool used in all areas of science. Applications range from simple regression models used to explain the behavior of experimental data to novel applications of deep learning. One area that has emerged in the last few years is the use of generative neural networks to produce synthetic samples of data that fit the statistical profile of real data collections. Generative models are among the most interesting deep neural networks and they abound with applications in science. The important property of all generative networks is that if you train them with a sufficiently, large and coherent collection of data samples, the network can be used to generate similar samples. But when one looks at the AI literature on generative models, one can come away with the impression that they are, at best, amazing mimics that can conjure up pictures that look like the real world, but are, in fact, pure fantasy. So why do we think that they can be of value in science? There are a several reasons one would want to use them. One reason is that the alternative method to understand nature may be based on a simulation that is extremely expensive to run. Simulations are based on the mathematical expression of a theory about the world. And theories are often loaded with parameters, some of which may have known values and others we can only guess at. Given these guesses, the simulation is the experiment: does the result look like our real-world observations? On the other hand, generative models have no explicit knowledge of the theory, but they do an excellent job of capturing the statistical distribution of the observed data. Mustafa Mustafa from LBNL states,

“We think that when it comes to practical applications of generative models, such as in the case of emulating scientific data, the criterion to evaluate generative models is to study their ability to reproduce the characteristic statistics which we can measure from the original dataset.” (from Mustafa, et. al arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018)

Generated models can be used to create “candidates” that we can use to test and fine-tune instruments designed to capture rare events. As we shall see, they have also been used to create ‘feasible’ structures that can inform us about possibilities that were not predicted by simulations. Generative models can also be trained to generate data associated with a class label and they can be effective in eliminating noise. As we shall see this can be a powerful tool in predicting outcomes when the input data is somewhat sparse such as when medical records have missing values.

Flavors of Generative Models

There are two main types of GMs and, within each type, there are dozens of interesting variations. Generalized Adversarial Networks (GANs) consist of two networks, a discriminator and a generator (the bottom part of Figure 1 below). Given a training set of data the discriminator is trained to distinguish between the training set data and fake data produced by the generator. The generator is trained to fool the discriminator. This eventually winds up in a generator which can create data that perfectly matches the data distribution of the samples. The second family are autoencoders. Again, this involved two networks (top in figure below). One is designed to encode the sample data into a low dimensional space. The other is a decoder that takes the encoded representation and attempts to recreate it. A variational autoencoder (VAEs) is one that forces the encoded representations to fit into a distribution that looks like the unit Gaussian. In this way, samples from this compact distribution can be fed to the decoder to generate new samples.var_and_gan

Figure 1.

Most examples of generative networks that are commonly cited involve the analysis of 2-D images based on the two opposing convolutional or similar networks.  But this need to be the case. (see “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space” by Anh Nguyen, et. al. arXiv:1612.00005v2  [cs.CV]  12 Apr 2017).

One fascinating science example we will discuss in greater detail later is by Shahar Harel and Kira Radinsky.  Shown below (Figure 2), it is a hybrid of a variational autoencoder with a convolutional encoder and recurrent neural network decoder for generating candidate chemical compounds.

VAE-with-lstm

Figure 2.  From Shahar Harel and Kira Radinsky have a different approach in “Prototype-Based Compound Discovery using Deep Generative Models” (http://kiraradinsky.com/files/acs-accelerating-prototype.pdf ).

Physics and Astronomy

Let’s start with some examples from physics and astronomy.

In statistical mechanics, Ising models provide a theoretical tool to study phase transitions in materials. The usual approach to study the behavior of this model at various temperatures is via Monte Carlo simulation. Zhaocheng Liu, Sean P. Rodrigues and Wenshan Cai from Georgia Tech in their paper “Simulating the Ising Model with a Deep Convolutional Generative Adversarial Network” (arXiv: 1710.04987v1 [cond-mat.dis-nn] 13 Oct 2017). The Ising states they generate from their network faithfully replicate the statistical properties of those generated by simulation but are also entirely new configurations not derived from previous data.

Astronomy is a topic that lends itself well to applications of generative models. Jeffrey Regier et. al. in “Celeste: Variational inference for a generative model of astronomical images” describe a detailed multi-level probabilistic model that considers both the different properties of stars and galaxies at the level of photons recorded at each pixel of the image. The purpose of the model is to infer the properties of the imaged celestial bodies. The approach is based on a variational computation similar to the VAEs described below, but far more complex in terms of the number of different modeled processes. In “Approximate Inference for Constructing Astronomical Catalogs from Images, arXiv:1803.00113v1 [stat.AP] 28 Feb 2018”, Regier and collaborators take on the problem of building catalogs of objects in thousands of images. For each imaged object there are 9 different classes of random variables that must be inferred. The goal is to compute the posterior distribution of these unobserved random variables conditional on a collection of astronomical images. They formulated a variational inference (VI) model and compared that to a Markov chain monte carlo (MCMC) method. MCMC proved to be slightly more accurate in several metrics but VI was very close. On the other hand, the variational method was 1000 times faster. It is also interesting to note that the computations were done on a Cori, the DOE supercomputer and the code was written in Julia.

Cosmological simulation is used to test our models of the universe. In “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018) Mustafa Mustafa, et. al. demonstrate how a slightly-modified standard GAN can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations. The results, shown in Figure 3 below, illustrate how the generated images match the validation tests. But, what is more important, the resulting images also pass a variety of statistical tests ranging from tests of the distribution of intensities to power spectrum analysis. They have made the code and data available at http://github.com/MustafaMustafa/cosmoGAN . The discussion section at the end of the paper speculates about the possibility of producing generative models that also incorporate choices for the cosmological variable that are used in the simulations.

cosmo

Figure 3.  From  Mustafa Mustafa, et. al. “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018

Health Care

Medicine and health care are being transformed by the digital technology. Imaging is the most obvious place where we see advanced technology.  Our understanding of the function of proteins and RNA has exploded with high-throughput sequence analysis. Generative methods are being used here as well. Reisselman, Ingraham and Marks in “Deep generative models of genetic variation capture mutation effects” (https://www.biorxiv.org/content/biorxiv/early/2017/12/18/235655.1.full.pdf) consider the problem of how mutations to a protein disrupt it function. They developed a version of a variational autoencoder they call DeepSequence that is capable if predicting the likely effect of mutations as they evolve.

Another area of health care that is undergoing rapid change is health records. While clearly less glamourous than RNA and protein analysis, it is a part of medicine that has an impact on every patient. Our medical records are being digitized at a rapid rate and once in digital form, they can be analyzed by many machine learning tools. Hwang, Choi and Yoon in “Adversarial Training for Disease Prediction from Electronic Health Records with Missing Data” (arXiv:1711.04126v4 [cs.LG] 22 May 2018) address two important problems. First, medical records are often incomplete. They have missing value because certain test results were not correctly recorded. The process of translating old paper forms to digital artifacts can introduce additional errors. Traditional methods of dealing with this are to introduce “zero” values or “averages” to fill the gaps prior to analysis, but this is not satisfactory. Autoencoders have been shown to be very good at removing noise from data (see https://towardsdatascience.com/how-to-reduce-image-noises-by-autoencoder-65d5e6de543). Hwang and his colleagues applied this to medical records. The second thing they have done is to use a GAN to predict the disease from the “corrected” record. The type of GAN they use is an “AC-GAN” (see https://arxiv.org/pdf/1610.09585.pdf) which incorporates a class label with each training item. This allows a class label along with the random latent variable as input to force the generator to create an output similar to training elements of that class. A byproduct is a discriminator that can tell if an input has the correct class label. In their case the they are interested in if a given medical record may predict the occurrence of a tumor or not. Of course, this is far from usable as a sole diagnostic in a clinical setting, but it is a very interesting technology.

Drug Design

One exciting application of these techniques is in the design of drugs. The traditional approach is high throughput screening in which large collections of chemicals are tested against potential targets to see if any have potential therapeutic effects. Machine learning techniques have been applied to the problem for many years, but recently various deep learning method have shown surprisingly promising results. One of the inspirations for the recent work has been the recognition that molecular structures have properties similar to natural language (see Cadeddu, A, et. al.. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie 2014, 126.) More specifically, there are phrases and grammar rules in chemical compounds that have statistical properties not unlike natural language. There is a standard string representation called SMILES that an be used to illustrate these properties. SMILES representations describe atoms and their bonds and valences based on a depth-first tree traversal of a chemical graph. In modern machine learning, language structure and language tasks such as machine natural language translation are aided using recurrent neural networks. As we illustrated in our book, an RNN trained with lots of business news text is capable of generating realistic sounding business news headlines from a single starting word. However close inspection reveals that the content is nonsense. However, there is no reason we cannot apply RNNs to SMILES string to see if they can generate new molecules. Fortunately, there are sanity tests that can be applied to generated SMILES string to filter out the meaningless and incorrectly structured compounds. This was done by a team at Novartis (Ertl et al. Generation of novel chemical matter using the LSTM neural network, arXiv:1712.07449) who demonstrated that these techniques could generate billions of new drug-like molecules. Anvita Gupta, Alex T. Muller, Berend J. H. Huisman, Jens A. Fuchs, Petra Schneid and Gisbert Schneider applied very similar ideas to “Generative Recurrent Networks for De Novo Drug Design” (https://www.researchgate.net/publication/320813292/downloader). They demonstrated that if they started with fragments of a drug of interest they could use the RNN and transfer learning to generate new variations that can may be very important. Another similar result is from Artur Kadurin, et. al. in “druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico.” https://pubs.acs.org/doi/10.1021/acs.molpharmaceut.7b00346

Shahar Harel and Kira Radinsky have a different approach in “Prototype-Based Compound Discovery using Deep Generative Models” (http://kiraradinsky.com/files/acs-accelerating-prototype.pdf ). There model is motivated by a standard drug discovery process which involves start with a molecule, called a prototype, with certain known useful properties and making modifications to it based on scientific experience and intuition. Harel and Radinsky designed a very interesting Variational Autoencoder shown in figure 2 above. As with several others the start with a SMILES representation of the prototype. The first step is an embedding space is generated for SMILES “language”. The characters in the prototype sequence are imbedded and fed to a layer of convolutions that allow local structures to emerge as shorter vectors that are concatenated, and a final all-to-all layer is used to generate sequence of mean and variance vectors for the prototype. This is fed to a “diversity layer” which add randomness.

The decoder is an LSTM-based recurrent network which generates the new molecule. The results they report are impressive. In a one series of experiments they took as prototypes compounds from drugs that were discovered years ago, and they were able to generate more modern variations that are known to be more powerful and effective. No known drugs were used in the training.

Summary

These are only a small sample of the research on the uses of Generative Neural networks in science.   We must now return to the question posed in the introduction:  When are these applications of neural networks advancing science?    We should first ask the question what is the role of ‘computational science’?  It was argued in the 1990s that computing and massive computational simulation had become the third paradigm of science because it was the only way to test theories for which it was impossible to design physical experiments.   Simulations of the evolution of the universe is a great example.    These simulations allowed us to test theories because they were based on theoretical models.  If the simulated universe did not look much like our own, perhaps the theory is wrong.   By 2007 Data Science was promoted as the fourth paradigm.   By mining the vast amounts of the data we generate and collect, we can certainly validating or disproving scientific claims.    But when can a network generating synthetic images qualify as science?  It is not driven by theoretical models.   Generative models can create statistical simulations that are remarkable duplicates of the statistical properties of natural systems.   In doing so they provide a space to explore that can stimulate discovery.   There are three classes of why this can be important.

  • The value of ‘life-like’ samples. In “3D convolutional GAN for fast Simulation” F. Carminati, G.  Khattak, S.  Vallecorsa make the argument that designing and testing the next generation of sensors requires test data that is too expensive to compute with simulation.  But a well-tuned GAN is able to generate the test cases that fit the right statistical model at the rate needed for deployment.
  • Medical records-based diagnosis. The work on medical records described above by Hwang shows that using a VAE to “remove noise” is statistically superior to leaving them blank or filling in averages.   Furthermore their ability to predict disease is extremely promising as science.
  • Inspiring drug discovery. The work of Harel and Radinsky show us that a VAE can expand the scope of potential drug for further study.   This is an advance in engineering if not science.

Can it replace simulation for validating models derived from theory?  Generative neural networks are not yet able to replace simulation.   But perhaps theory can evolve so that it can be tested in new ways.

Part 2. Generative Models Tutorial

Generative Models are among the most interesting deep neural networks and they abound with applications in science. There are two main types of GMs and, within each type, several interesting variations. The important property of all generative networks is that if you train them with a sufficiently, large and coherent collection of data samples, the network can be used to generate similar samples. The key here is the definition of ‘coherent’. One can say the collection is coherent if when you are presented with a new example, it should be a simple task to decide if it belongs to the collection or not. For example, if the data collection consists entirely of pictures of cats, then a picture of a dog should be, with reasonably high probability, easily recognized as an outlier and not a cat. Of course, there are always rather extreme cats that would fool most casual observers which is why we must describe our collect of objects in term of probability distributions. Let us assume our collection c is naturally represented embedded in s2 for some m. For example, images with m pixels or other high dimensional instrument data. A simple way to think about a generative model is a mathematical device that transforms samples from a multivariant normal distribution s1 into so that they look like they come from the distribution s3 for our collection c. Think of it as a function

s4

Another useful way to say this is to build another machine we can call a discriminators5

such that for s6 is probability that X is in the collection c. To make this more “discriminating” let us also insist that s8a.  In other word, the discriminator is designed to discriminate between the real c objects and the generated ones. Of course, if the Generator is really doing a good job of imitating s3 then the discriminator with this condition would be very hard to build.  In this case we would expect s8.

Generative Adversarial networks

were introduced by Goodfellow et, al (arXiv:1406.2661) as a way to build neural networks that can generate very good examples that match the properties of a collection of objects.  It works by designed two networks:  one for the generator and one for the discriminator. Define s9 to be the distribution of latent variables that the generator will map to the collection space. The idea behind the paper is to simultaneously design the discriminator and the generator as a two-player min-max game.

The discriminator is being trained to recognize object from c (thereby reducing  s10 for  s11) and pushing s13 to zero for s14.   The resulting function

s15

Represents the min-max objective for the Discriminator.

On the other hand, the generator wants to pushs13  to 1 thereby maximizing
s16 .   To do that we minimize

s17.

There are literally dozens of implementations of GANs in Tensorflow or Karas on-line.   Below is an example from one that works with 40×40 color images.   This fragment shows the step of setting up the training optimization.

#These two placeholders are used for input into the generator and discriminator, respectively.
z_in = tf.placeholder(shape=[None,128],dtype=tf.float32) #Random vector
real_in = tf.placeholder(shape=[None,40,40,3],dtype=tf.float32) #Real images
Gz = generator(z_in) #Generates images from random z vectors
Dx = discriminator(real_in) #Produces probabilities for real images
Dg = discriminator(Gz,reuse=True) #Produces probabilities for generator images
#These functions together define the optimization objective of the GAN.
d_loss = -tf.reduce_mean(tf.log(Dx) + tf.log(1.-Dg)) #This optimizes the discriminator.
g_loss = -tf.reduce_mean(tf.log(Dg)) #This optimizes the generator.
tvars = tf.trainable_variables()
#The below code is responsible for applying gradient descent to update the GAN.
trainerD = tf.train.AdamOptimizer(learning_rate=0.0002,beta1=0.5)
trainerG = tf.train.AdamOptimizer(learning_rate=0.0002,beta1=0.5)

#Only update the weights for the discriminator network.
d_grads = trainerD.compute_gradients(d_loss,tvars[9:]) 
#Only update the weights for the generator network.
g_grads = trainerG.compute_gradients(g_loss,tvars[0:9]) 
update_D = trainerD.apply_gradients(d_grads)
update_G = trainerG.apply_gradients(g_grads)

We tested this with a very small collection of images of galaxies found on the web.  There are three types: elliptical, spiral and barred spiral.  Figure 4 below shows some high-resolution samples from the collection.

(Note:  the examples in this section use pictures of galaxies, but , in terms of the discussion in the previous part of this article, these are illustrations only.  There are no scientific results; just algorithm demonstrations. )

galaxy_sample

Figure 4.  Sample high-resolution galaxy images

We reduced the images to 40 by 40 and trained the GAN on this very small collection.  Drawing samples at random from the latent z-space we can now generate synthetic images.  The images we used here are only 40 by 40 pixels, so the results are not very spectacular.  As shown below, the generator is clearly able to generate elliptical and spiral forms.  In the next section we work with images that are 1024 by 1024 and get much more impressive results.

gan_40_galaxies.png

Figure 5.   Synthetic Galaxies produced by the GAN from 40×40 images.

Variational Autoencoders

The second general category generative models are based on variational autoencoders. An autoencoder transforms our collection of object representations into a space of much smaller dimension in such a way so that that representation can be used to recreate the original object with reasonably high fidelity. The system has an encoder network that creates the embedding in the smaller space and a decoder which uses that representation to regenerate an image as shown below in Figure 6.

ae

Figure 6. Generic Autoencoder

In other words, we want s18 to approximate s19 for each i in an enumeration of our collection of objects.  To train our networks we simply want to minimize the distance between s19  and s20 for each i.   If we further set up the network inputs and outputs so that they are in the range [0, 1] we can model this as a Bernouli distribution so cross entropy is a better function to minimize.  In this case the cross entropy can be calculated as

s21

(see http://www.godeep.ml/cross-entropy-likelihood-relation/  for a derivation)

A variational autoencoder differs from a general one in that we want the generator to create an embedding that is very close to a normal distribution in the embedding space.  The way we do this is to make the encoder force the encoding into a representation consisting of a mean and standard deviation.  To force it into a reasonably compact space we will force our encoder to be as close to s32  as possible. To do that we need a way to measuree how far a distribution p is from a Gaussian q. That is given by the Kullback-Leibler divergence which measures now many extra bits (or ‘nats’) are needed to convert an optimal code for distribution q into an optimal code for distribution p.

s22

If both p and q are gaussian this is easy to calculate (thought not as easy to derive).

In terms of probability distributions we can think of our encoder as s23 where x is a training image. We are going to assume  s23 is normally distributed and let s24 be  parameterized by   s25  .  Computing s26  is now easy. We call this the Latent Loss and it is

s27

(see https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians for a derivation).

We now construct our encoder to produce s28 and s29 .  To sample from this latent space, we simply draw froms1 and transform it into the right space.   Our encoder and decoder networks can now be linked as follows.

s30.JPG

the loss function is now the sum of two terms:

s31

Note: there is a Baysian approach to deriving this.  see https://jaan.io/what-is-variational-autoencoder-vae-tutorial   for an excellent discussion.

One of the interesting properties of VAEs is that they do not require massive data sets to converge.   Using our simple galaxy photo collection we trained a simple VAE.  The results showing the test inputs and the reconstructed images are illustrated below.

var-recon

Figure 7.   test input and reconstruction from the galaxy image collection.   These images are 1024×1024.

Using encodings of five of the images we created a path through the latent space to make the gif movie that is shown below.  While not every intermediate “galaxy” looks as good as some of the originals, it does present many reasonable “synthetic” galaxies that are on the path between two real ones.

movie9

Figure 8. the “movie”.  You need to stare at it for a minute to see it evolve through synthetic and real galaxies.

The notebook for this autoencoder is available as html (see https://s3.us-east-2.amazonaws.com/a-book/autoencode-galaxy.html) and as a jupyter notebook (see https://s3.us-east-2.amazonaws.com/a-book/autoencode-galaxy.ipynb )  The compressed tarball of the galaxy images is here: https://s3.us-east-2.amazonaws.com/a-book/galaxies.tar.gz.

acGANs

The generative networks described above are just the basic variety.    One very useful addition is the Auxiliary Classifier GAN.    An acGAN allows you to incorporate knowledge about the class of the objects in your collection into the process.   For example, suppose you have labeled images such as all pictures of dogs are labeled “dog” and all pictures of cats have the label “cat”.    The original paper on this subject “Conditional Image Synthesis with Auxiliary Classifier GANs” by Oden, Olah and Shlens  shows how a GAN can be modified so that the generator can be modified so that it takes a class label in addition to the random latent variable so that it generates a new element similar to the training examples of that class. The training is augmented with an additional loss term that models the class of the training examples.

There are many more fascinating examples.   We will describe them in more detail in a later post.