# Autoencoder

## Neural network that learns efficient data encoding in an unsupervised manner / From Wikipedia, the free encyclopedia

#### Dear Wikiwand AI, let's keep it short by simply answering these key questions:

Can you list the top facts and stats about Autoencoder?

Summarize this article for a 10 year old

An **autoencoder** is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning).^{[1]}^{[2]} An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Variants exist, aiming to force the learned representations to assume useful properties.^{[3]} Examples are regularized autoencoders (*Sparse*, *Denoising* and *Contractive*), which are effective in learning representations for subsequent classification tasks,^{[4]} and *Variational* autoencoders, with applications as generative models.^{[5]} Autoencoders are applied to many problems, including facial recognition,^{[6]} feature detection,^{[7]} anomaly detection and acquiring the meaning of words.^{[8]}^{[9]} Autoencoders are also generative models which can randomly generate new data that is similar to the input data (training data).^{[7]}

### Definition

An autoencoder is defined by the following components:

Two sets: the space of decoded messages ${\mathcal {X}}$; the space of encoded messages ${\mathcal {Z}}$. Almost always, both ${\mathcal {X}}$ and ${\mathcal {Z}}$ are Euclidean spaces, that is, ${\mathcal {X}}=\mathbb {R} ^{m},{\mathcal {Z}}=\mathbb {R} ^{n}$ for some $m,n$.

Two parametrized families of functions: the encoder family $E_{\phi }:{\mathcal {X}}\rightarrow {\mathcal {Z}}$, parametrized by $\phi$; the decoder family $D_{\theta }:{\mathcal {Z}}\rightarrow {\mathcal {X}}$, parametrized by $\theta$.

For any $x\in {\mathcal {X}}$, we usually write $z=E_{\phi }(x)$, and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any $z\in {\mathcal {Z}}$, we usually write $x'=D_{\theta }(z)$, and refer to it as the (decoded) message.

Usually, both the encoder and the decoder are defined as multilayer perceptrons. For example, a one-layer-MLP encoder $E_{\phi }$ is:

- $E_{\phi }(\mathbf {x} )=\sigma (Wx+b)$

where $\sigma$ is an element-wise activation function such as a sigmoid function or a rectified linear unit, $W$ is a matrix called "weight", and $b$ is a vector called "bias".

### Training an autoencoder

An autoencoder, by itself, is simply a tuple of two functions. To judge its *quality*, we need a *task*. A task is defined by a reference probability distribution $\mu _{ref}$ over ${\mathcal {X}}$, and a "reconstruction quality" function $d:{\mathcal {X}}\times {\mathcal {X}}\to [0,\infty ]$, such that $d(x,x')$ measures how much $x'$ differs from $x$.

With those, we can define the loss function for the autoencoder as

The *optimal* autoencoder for the given task $(\mu _{ref},d)$ is then $\arg \min _{\theta ,\phi }L(\theta ,\phi )$. The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder".
In most situations, the reference distribution is just the empirical distribution given by a dataset $\{x_{1},...,x_{N}\}\subset {\mathcal {X}}$, so that

where and $\delta _{x_{i}}$ is the Dirac measure, and the quality function is just L2 loss: $d(x,x')=\|x-x'\|_{2}^{2}$, $\|\cdot \|_{2}$ is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization:

### Interpretation

An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function $d$.

The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space ${\mathcal {Z}}$ usually has fewer dimensions than the message space ${\mathcal {X}}$.

Such an autoencoder is called *undercomplete*. It can be interpreted as compressing the message, or reducing its dimensionality.^{[1]}^{[10]}

At the limit of an ideal undercomplete autoencoder, every possible code $z$ in the code space is used to encode a message $x$ that really appears in the distribution $\mu _{ref}$, and the decoder is also perfect: $D_{\theta }(E_{\phi }(x))=x$. This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code $z$ and obtaining $D_{\theta }(z)$, which is a message that really appears in the distribution $\mu _{ref}$.

If the code space ${\mathcal {Z}}$ has dimension larger than (*overcomplete*), or equal to, the message space ${\mathcal {X}}$, or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features.^{[11]}

In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below.^{[3]}

The autoencoder was first proposed as a nonlinear generalization of principal components analysis (PCA) by Kramer.^{[1]} The autoencoder has also been called the autoassociator,^{[12]} or Diabolo network.^{[13]}^{[11]} Its first applications date to early 1990s.^{[3]}^{[14]}^{[15]} Their most traditional application was dimensionality reduction or feature learning, but the concept became widely used for learning generative models of data.^{[16]}^{[17]} Some of the most powerful AIs in the 2010s involved autoencoders stacked inside deep neural networks.^{[18]}

### Regularized autoencoders

Various techniques exist to prevent autoencoders from learning the identity function and to improve their ability to capture important information and learn richer representations.

#### Sparse autoencoder (SAE)

Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders are variants of autoencoders, such that the codes $E_{\phi }(x)$ for messages tend to be *sparse codes*, that is, $E_{\phi }(x)$ is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time.^{[18]} Encouraging sparsity improves performance on classification tasks.^{[19]}

There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the **k-sparse autoencoder**.^{[20]}

The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder:

where $b_{i}=1$ if $|x_{i}|$ ranks in the top k, and 0 otherwise.

Backpropagating through $f_{k}$ is simple: set gradient to 0 for $b_{i}=0$ entries, and keep gradient for $b_{i}=1$ entries. This is essentially a generalized ReLU function.^{[20]}

The other way is a relaxed version of the k-sparse autoencoder. Instead of forcing sparsity, we add a **sparsity regularization loss**, then optimize for

where $\lambda >0$ measures how much sparsity we want to enforce.^{[21]}

Let the autoencoder architecture have $K$ layers. To define a sparsity regularization loss, we need a "desired" sparsity ${\hat {\rho }}_{k}$ for each layer, a weight $w_{k}$ for how much to enforce each sparsity, and a function $s:[0,1]\times [0,1]\to [0,\infty ]$ to measure how much two sparsities differ.

For each input $x$, let the actual sparsity of activation in each layer $k$ be

where $a_{k,i}(x)$ is the activation in the $i$ -th neuron of the $k$ -th layer upon input $x$. The sparsity loss upon input $x$ for one layer is $s({\hat {\rho }}_{k},\rho _{k}(x))$, and the sparsity regularization loss for the entire autoencoder is the expected weighted sum of sparsity losses:

Typically, the function $s$ is either the Kullback-Leibler (KL) divergence, as^{[19]}^{[21]}^{[22]}^{[23]}

- $s(\rho ,{\hat {\rho }})=KL(\rho ||{\hat {\rho }})=\rho \log {\frac {\rho }{\hat {\rho }}}+(1-\rho )\log {\frac {1-\rho }{1-{\hat {\rho }}}}$

or the L1 loss, as $s(\rho ,{\hat {\rho }})=|\rho -{\hat {\rho }}|$, or the L2 loss, as $s(\rho ,{\hat {\rho }})=|\rho -{\hat {\rho }}|^{2}$.

Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can define the sparsity regularization loss as

where $h_{k}$ is the activation vector in the $k$-th layer of the autoencoder. The norm $\|\cdot \|$ is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder).

#### Denoising autoencoder (DAE)

Denoising autoencoders (DAE) try to achieve a *good* representation by changing the *reconstruction criterion*.^{[3]}^{[4]}

A DAE, originally called a "robust autoassociative network",^{[2]} is trained by intentionally corrupting the inputs of a standard autoencoder during training. A noise process is defined by a probability distribution $\mu _{T}$ over functions $T:{\mathcal {X}}\to {\mathcal {X}}$. That is, the function $T$ takes a message $x\in {\mathcal {X}}$, and corrupts it to a noisy version $T(x)$. The function $T$ is selected randomly, with a probability distribution $\mu _{T}$.

Given a task $(\mu _{ref},d)$, the problem of training a DAE is the optimization problem:

That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising"*.*

Usually, the noise process $T$ is applied only during training and testing, not during downstream use.

The use of DAE depends on two assumptions:

- There exist representations to the messages that are relatively stable and robust to the type of noise we are likely to encounter;
- The said representations capture structures in the input distribution that are useful for our purposes.
^{[4]}

Example noise processes include:

- additive isotropic Gaussian noise,
- masking noise (a fraction of the input is randomly chosen and set to 0)
- salt-and-pepper noise (a fraction of the input is randomly chosen and randomly set to its minimum or maximum value).
^{[4]}

#### Contractive autoencoder (CAE)

A contractive autoencoder adds the contractive regularization loss to the standard autoencoder loss:

where $\lambda >0$ measures how much contractive-ness we want to enforce. The contractive regularization loss itself is defined as the expected Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input:

To understand what $L_{contractive}$ measures, note the fact

for any message $x\in {\mathcal {X}}$, and small variation $\delta x$ in it. Thus, if $\|\nabla _{x}E_{\phi }(x)\|_{F}^{2}$ is small, it means that a small neighborhood of the message maps to a small neighborhood of its code. This is a desired property, as it means small variation in the message leads to small, perhaps even zero, variation in its code, like how two pictures may look the same even if they are not exactly the same.

The DAE can be understood as an infinitesimal limit of CAE: in the limit of small Gaussian input noise, DAEs make the reconstruction function resist small but finite-sized input perturbations, while CAEs make the extracted features resist infinitesimal input perturbations.

#### Minimal description length autoencoder

^{[24]}

### Concrete autoencoder

The concrete autoencoder is designed for discrete feature selection.^{[25]} A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous relaxation of the categorical distribution to allow gradients to pass through the feature selector layer, which makes it possible to use standard backpropagation to learn an optimal subset of input features that minimize reconstruction loss.

### Variational autoencoder (VAE)

Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architecture with different goals and with a completely different mathematical formulation. The latent space is in this case composed by a mixture of distributions instead of a fixed vector.

Given an input dataset $x$ characterized by an unknown probability function $P(x)$ and a multivariate latent encoding vector $z$, the objective is to model the data as a distribution $p_{\theta }(x)$, with $\theta$ defined as the set of the network parameters so that $p_{\theta }(x)=\int _{z}p_{\theta }(x,z)dz$.