Revisiting Autoencoders


We wrote about generative neural networks in two previous blog posts where we promised to return to the topic in a future update.  This is it.  This article is a review of some of more advances in autoencoders over the last 10 years.  We present examples of denoising autoencoders,  variational and three different adversarial neural networks.  The presentation is not theoretical, and it uses examples via Jupyter notebooks that can be run on a standard laptop.


Autoencoders are a class of deep neural networks that can learn efficient representations of large data collections.  The representation is  required to be robust enough to regenerate a good approximation of the original data.  This can be useful for removing nose from the data when the noise is not an intrinsic property of the underlying data.  These autoencoder often, but not always, work by projecting the data into a lower dimensional space (much like what a principal component analysis achieves). We will look at an example of a “denoising autoencoder” below where the underlying signal is identified through a series of convolutional filters. 

A difference class of autoencoders are called Generative Autoencoders that have the property that they can create a mathematical machine that can reproduce the probability distribution of the essential features of a data collection.  In other words, a trained generative autoencoder can create new instances of ‘fake’ data that has the same statistical properties as the training data.  Having a new data source, albeit and artificial one, can be very useful in many studies where it is difficult to find additional examples that fit a given profile. This ideas is being used in research in particle physics to improve the search for signals in data and in astronomy where generative methods can create galaxy models for dark energy experiments. 

A Denoising Autoencoder.

Detecting signals in noisy channels is a very old problem.  In the following paragraphs we consider the problem of identifying cosmic ray signals in radio astronomy.   This work is based on “Classification and Recovery of Radio Signals from Cosmic Ray Induced Air Showers with Deep Learning”, M. Erdmann, F. Schlüter and R. Šmída  1901.04079.pdf ( and Denoising-autoencoder/Denoising Autoencoder MNIST.ipynb at master · RAMIRO-GM/Denoising-autoencoder · GitHub.  We will illustrate a simple autoencoder that pulls the signals from the noise with reasonable accuracy.

For fun, we use three different data sets.   One is a simulation of cosmic-ray-induced air showers that are measured by radio antennas as described in the excellent book “Deep Learning for Physics Research” by Erdmann, Glombitza, Kasieczka and Klemradt.  The data consists of two components: one is a trace of the radio signal and the second is a trace of the simulated cosmic ray signal.   A sample is illustrated in Figure 1a below with the cosmic ray signal in red and the background noise in blue.  We will also use a second data set that consists of the same cosmic ray signals and uniformly distributed background noise as shown in Figure 1b. We will also look at a third data set consisting of the cosmic ray signals and a simple sine wave noise in the background Figure 1c.  As you might expect, and we will illustrate, this third data set is very easy to clean.

Figure 1a, original data                    Figure 1b, uniform data              Figure 1c, cosine data

To eliminate the nose and recover the signal we will train an autoencoder based on a classic autoencoder design with an encoder network that takes as input the full signal (signal + noise) and a decoder network that produces the cleaned version of the signal (Figure 2).  The projection space in which the encoder network sends in input is usually much smaller than the input domain, but that is not the case here.

Figure 2.   Denoising Autoencoder architecture.

In the network here the input signals are samples in [0, 1]500.  The projection of each input consists of 16 vectors of length 31 that are derived from a sequence of convolutions and 2×2 pooling steps. In other words, we have gone from a space of dimension 500 to 496.  Which is not a very big compression. The details of the network are shown in Figure 3.

The one dimensional convolutions act like frequency filters which seem to leave the high frequency cosmic ray signal more visible.  To see the result, we have in figure 4 the output of the network for three sample test examples for the original data.  As can be seen the first is a reasonable reconstruction of the signal, the second is in the right location but the reconstruction is weak, and the third is completely wrong. 

Figure 3.   The network details of the denoising autoencoder.

Figure 4.  Three examples from the test set for the denoiser using the original data

The two other synthetic data sets (uniformly distributed random nose and noise create from a cosine function) are much more “accurate”.

Figure 5. Samples from the random noise case (top) and the cosine noise case (bottom)

We can use the following naive method to assess accuracy. We simply track the location of the reconstructed signal.  If the maximum value of the recovered signal is within 2% of the maximum value of the true signal, we call that a “true reconstruction”. Based on this very informal metric we see that the accuracy  for the test set for the original data is 69%.    It is 95% for the uniform data case and 98% for the easy cosine data case. 

 Another way to see this result is to compare the response in the frequency domain, by applying an FFT to the input and output signals we see the results in Figure 6.

Figure 6a, original data                    Figure 6b, uniform data          Figure 6c, cosine data

The blue line is the frequency spectrum of the true signal, and the green line is the recovered signal. As can be seen, the general profiles are all reasonable with the cosine data being extremely accurate.

Generative Autoencoders.

Experimental data drives scientific discovery.   Data is used to test theoretical models.   It is also used to initialize large-scale scientific simulations.   However, we may not always have enough data at hand to do the job.   Autoencoders are increasingly  being used in science to generate data that fits the probabilistic density profile of known samples or real data.  For example, in a recent paper, Henkes and Wessels use generative adversarial networks (discussed below) to generate 3-D microstructures for studies of continuum micromechanics.   Another application is to use generative neural networks to generate test cases to help tune new instruments that are designed to capture rare events.  

In the experiments for the remainder of this post we use a very small collection of images of Galaxies so that all of this work can be done on a laptop.  Figure 7 below shows a sample.   The images are 128 by 128 pixels with three color channels.  

Figure7.  Samples from the Galaxy image collection

Variational Autoencoders

A variational autoencoder is a neural network that uses an encoder network to reduce data samples to a lower dimensional subspace.   A decoder network takes samples from this hypothetical latent space and uses it to regenerate the samples.  The decoder becomes a means to map a uniform distribution on this latent space into the probability distribution of our samples.  In our case, the encoder has a linear map from the 49152 element input down to a vector of length 500.   This is then mapped by a  500 by 50 linear transforms down to two vectors of length 50.  The decoder does a renormalization step to produce a single vector of length 50 representing our latent space vector.   This is expanded by two linear transformations are a relu map back to length 49152. After a sigmoid transformation we have the decoded image.  Figure 8 illustrates the basic architecture.

Figure 8.   The variational autoencoder.

Without going into the mathematics of how this works (for details see our previous post or the “Deep Learning for Physics Research” book or many on-line sources), the network is designed so that the encoder network generates mean (mu) and standard deviation (logvar) of the projection of the training samples in the small latent space such that the decoder network will recreate the input.  The training works by computing the loss as a combination of two terms, the mean squared error of the difference between the regenerated image and the input image and Kullback-Leibler divergence between the uniform distribution and the distribution generated by the encoder. 

The Pytorch code is available as a the notebook in github.  The same github directory contains the zipped datafile).  Figure 9 illustrates the results of the encode/decode on 8 samples from the data set.   

Figure 9.   Samples of galaxy images from the training set and their reconstructions from the VAR.

An interesting experiment is to see how the robust the decoder is to changes in the selection of the latent variable input.   Figure 10 illustrates the response of the decoder when we follow a path in the latent space from one instance from the training set to another very similar image.

Figure 10.  image 0 and image 7 are samples from the training set. Images 1 through 6 are generated from points along the path between the latent variable for 0 and for 7.

Another interesting application was recently published.  In the paper “Detection of Anomalous Grapevine Berries Using Variational Autoencoders” Miranda  show how a VAR can be used to examine arial photos of vineyards to spot areas of possible diseased grapes.  

Generative Adversarial Networks (GANs)

Generative Adversarial networks were introduced by Goodfellow et, al (arXiv:1406.2661) as a way to build neural networks that can generate very good examples that match the properties of a collection of objects.

As mentioned above, artificial examples generated by autoencoder can be used as starting points for solving complex simulations.  In the case of astronomy, cosmological simulation is used to test our models of the universe. In “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018) Mustafa Mustafa, et. al. demonstrates how a slightly-modified standard GAN can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations.  In the remainder of this tutorial, we look at GANs.

Given a collection r of objects in Rm, a simple way to think about a generative model is as a mathematical device that transforms samples from a multivariant normal distribution Nk(0,1)  into Rm so that they look like they come from the actual distribution Pr. for our collection r.  Think of it as a function

Which maps the normal distribution into a distribution Pg  over Rm.  

In addition, assume we have a discriminator function

With the property that D(x) is the probability that x is in our collection r.   Our goal is to train G so that Pg matches Pr.   Our discriminator is trained to reject images generated by the generator while recognizing all the elements of Pr.  The generator is trained to fool the discriminator, so we have a game defined by minimax objective:

We have put a simple basic GAN from our previous post.  Running it for many epochs can occasionally get some reasonable results as shown if Figure 11.  While this looks good, it is not. Notice that it generated examples of only 3 of our samples.  This repeating of an image for different latent vectors is an example of a phenomenon called modal collapse


Figure 11.  The lack of variety in the images is called model collapse.


There are several variations on GANs that avoid many of these problems.   One is called an acGAN for auxiliary classifier Gan developed by Augustus Odena, Christopher Olah and Jonathon Shlens.  For an acGan we assume that we have a class label for each image.   In our case the data has three categories: barred spiral  (class 0), elliptical (class 1) and spiral (class 2).   The discriminator is modified so that it not only returns the probability that the image is real, it also returns a guess at the class.  The generator takes an extra parameter to encourage it to generate an image of the class.   Let d-1 = the number of classes  then we have the functions

The discriminator is now trained to minimize the error in recognition but also the error in class recognition.  The best way to understand the details is to look at the code.   For this and the following examples we have notebooks that are slight modifications to the excellent work from the public Github site of Chen Kai Xu from Chen Kai Xu  Tsukuba University.  This notebook is here.  Figure 12 below shows the result of asking the generator to create galaxies of a given class. The G(z,0) generates good barred spirals, G(z,1) are excellent elliptical galaxies and G(z,2) are spirals.

Figure 12.  Results from asgan-new.  Generator G with random noise vector Z and class parameter for barred spiral = 0,  elliptical = 1 and spiral = 2. 

Wasserstein GAN with Gradient Penalty

The problem with modal collapse and convergence failure was, as stated above, commonly observed.  The Wasserstein GAN introduced by Martin Arjovsky , Soumith Chintala , and L´eon Bottou, directly addressed this problem.  Later  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin and Aaron Courville introduced a modification called a Gradient penalty which further improved the stability of the GAN convergence.  To accomplish this they added an additional loss term to the total loss  for training the discriminator as follows:

Setting the parameter lambda to 10.0 this was seen as an effective value for reducing to occurrence of mode collapse.  The gradient penalty value is computed using samples from straight lines connecting samples from Pr and Pg.   Figure 13 shows the result of a 32 random data vectors for the generator and the variety of responses reflect the full spectrum of the dataset.

Figure 13.   Results from Wasserstein with gradient penalty.

The notebook for wgan-gp is here.

Wasserstein Divergence for GANs

Jiqing Wu , Zhiwu Huang, Janine Thoma , Dinesh Acharya , and Luc Van Gool  introduced a variation on WGAN-gp called WGAN-div that addresses several technical constraints of WGAN-gp having to do with Lipschitz continuity not discussed here (see the paper).   They propose a better loss function for training the discriminator:

By experimental analysis the determine the best choice for the k and p hyperparameters are 2 and 6.

Figure 14 below illustrates the results after 5000 epochs of training. The notebook is here.

Figure 14.   Results from WG-div experiments.

Once again, this seems to have eliminated the modal collapse problem.  


This blog was written to do a better job illustrating autoencoder neural networks than our original articles.   We illustrated a denoising autoencoder, a variational autoencoder and three generative adversarial networks.  Of course, this is not the end of the innovation that has taken place in this area.  A good example of recent work is the progress made on Masked Autoencoders.  Kaiming He et. al.   published Masked Autoencoders Are Scalable Vision Learners in December 2021.  The idea is very simple and closely related to the underlying concepts used in Transformers for natural language processing like BERT or even GPT3.  Masking simply removes patches of the data and trains the network to “fill in the blanks”.   The same idea has been now applied to audio signal reconstruction. VL-BEiT from MSR is a generative vision-language foundation model that incorporates images and text. These new techniques show promise to generate more semantic richness to the results than previous methods. 

Understanding MLOps: a Review of “Practical Deep Learning at Scale with MLFlow” by Yong Liu

Research related to deep learning and its applications is now a substantial part of recent computer science.  Much of this work involves building new, advanced models that outperform all others on well-regarded benchmarks.  This is an extremely exciting period of basic research.  However, for those data scientists and engineers involved in deploying deep learning models to solve real problems, there are concerns that go beyond benchmarking.  These  involve the reliability, maintainability, efficiency and explainability of the deployed services.    MLOps refers to the full spectrum of best practices and procedures from designing the training data to final deployment lifecycle.   MLOps is the AI version of DevOps: the modern software deployment model that combines software development (Dev) and IT operations (Ops).   There are now several highly integrated platforms that can guide the data scientist/engineer through the maze of challenges to deploying a successful ML solution to a business or scientific problem.

Interesting examples of MlOps tools include

  • Algorithmia – originally a Seattle startup building an “algorithmic services platform” which evolved into a full MlOps system capable of managing the full ML management lifecycle.  Algorithmia was acquired by DataRobot and is now widely used.
  • Metaflow is an open source MLOps toolkit originally developed by Netflix.  Metaflow uses a directed acyclic graph to encode the steps and manage code and data versioning.
  • Polyaxon is a Berlin based company that is a “Cloud Native Machine Learning Automation Platform.”

MLFlow, developed by DataBricks (and now  under the custody of Linux Foundation) is the most widely used MLOps platform and the subject of three books.  The one reviewed here is “Practical Deep Learning at Scale with MLFlow” by Dr. Yong Liu.   According to Dr. Liu, the deep learning life cycle consists of

  1. Data collection, cleaning, and annotation/labeling.
  2. Model development which is an iterative process that is conducted off-line.
  3. Model deployment and serving it in production.
  4. Model validation and online testing done in a production environment.
  5. Monitoring and feedback data collection during production.

MLFlow provides the tools to manage this lifecycle.  The book is divided into five sections that cover these items in depth.  The first section is where the reader is acquainted with the basic framework.   The book is designed to be hands-on with complete code examples for each chapter.  In fact about 50% of the book is leading the reader through this collection of excellent examples.   The up-side of this approach is that the reader becomes a user and gains expertise and confidence with the material.  Of course, the downside (as this author has learned from his own publications) is that software evolves very fast and printed versions go out of date quickly.  Fortunately, Dr. Liu has a GitHub repository for each chapter that can keep the examples up to date.   

In the first section of the book, we get an introduction to MLFlow.  The example is a simple sentiment analysis written in PyTorch.  More precisely it implements a transfer learning scenario that highlights the use of lightning flash which provides a high level set of tools that encapsulate standard operations like basic training, fine tuning and testing.  In chapter two, MLFlow is first introduced as a means to manage the experimental life cycle of model development.  This involves the basic steps of defining  and running the experiment.    We also see the MLFlow user portal.  In the first step the experiment is logged with the portal server which records the critical metadata from the experiment as it is run.  

This reader was able to do all of the preceding on a windows 11 laptop, but for the next steps I found another approach easier.   Databricks is the creator of MLFlow, so it is not surprising that MLFlow is fully supported on their platform.   The book makes it clear that the code development environment  of choice for MLOps is not my favorite Jupyter, but rather VSCode.  And for good reasons.   VSCode interoperates with MLFlow brilliantly when running your own copy of MLFlow.  If you use the Databricks portal the built-in notebook editor works and is part of the MLFlow environment.   While Databricks has a free trial account, many of the features describe below are not available unless you have an account on AWS or Azure or GCS and a premium Databrick account. 

One of the excellent features of MLFlow is its ability to track code versioning data and pipeline tracking.  As you run experiments you can modify the code in the notebook and run it again. MLFlow keeps track of the changes, and you can return to previous versions with a few mouse clicks  (see Figure 1).   

Figure 1.  MLFlow User Interface showing a history of experiments and results.

Pipelines in MLFlow are designed to allow you to capture the entire process from feature engineering through model selection and tuning (See Figure 2).  The pipelines consider are data wrangling, model build and deployment.  MLFlow supports the data centric deep learning model using Delta Lake to allow versioned and timestamped access to data.

Figure 2.  MLFlow pipeline stages illustration.  (From Databricks)

Chapter 5 describes the challenges of running at scale and Chapter 6 takes the reader through hyperparameter tuning at scale.  The author takes you through running locally with a local code base and then running remote code from Github and finally running the code from Github remotely on as Databricks cluster.

In Chapter  7, Dr Liu dives into the technical challenge of hyperparameter optimization (HPO).  He compares three sets of tools for doing HPO at scale but he settles on Ray Tune which work very well with MLFlow.  We have describe Ray Tune elsewhere in our blog, but the treatment in the book is much more advanced.

Chapter 8 turn to the very important problem of doing ML inference at scale.  Chapter 9 provides an excellent, detailed introduction to the many dimensions of explainability.  The primary tool discussed is based on SHapley Additive exPlanations (SHAP), but others are also discussed.    Chapter 10 explores the integration of SHAP tools with MLFlow.   Together these two chapters provide an excellent overview of the state of the art in deep learning explainability even if you don’t study the code details.    


Deep learning is central to the concepts embodied in the notion that much of our current software can be generated or replaced entirely by neural network driven solutions.  This is often called “Software 2.0’. While this may be an apocryphal idea, it is clear that there is a great need for tools that can help guide developers through the best practices and procedures for deploying deep learning solutions at scale.  Dr Yong Liu’s book “Practical Deep Learning at Scale with MLFlow” is the best guide available to navigating this new and complex landscape.  While much of it is focused on the MLFlow toolkit from Databricks, it is also a guide to concepts that motivate the MLFlow software.    This is especially true when he discusses the problem of building deep learning models, hyperparameter tuning and inference systems at scale. The concluding chapters on deep learning explainability together comprise  one of the best essays on this topic I have seen.  Dr. Liu is a world-class expert on MLOps and this book is an excellent contribution.