# Vanishing gradient problem

## Machine learning model training problem / From Wikipedia, the free encyclopedia

#### Dear Wikiwand AI, let's keep it short by simply answering these key questions:

Can you list the top facts and stats about Vanishing gradient problem?

Summarize this article for a 10 year old

In machine learning, the **vanishing gradient problem** is encountered when training recurrent neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight.^{[1]} The problem is that as the sequence length increases, the gradient magnitude typically is expected to decrease (or grow uncontrollably), slowing the training process.^{[1]} In the worst case, this may completely stop the neural network from further training.^{[1]} As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range [-1,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n while the early layers train very slowly.

Back-propagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem",^{[2]}^{[3]} which not only affects many-layered feedforward networks,^{[4]} but also recurrent networks.^{[5]} The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network. (The combination of unfolding and backpropagation is termed backpropagation through time.)

When activation functions are used whose derivatives can take on larger values, one risks encountering the related **exploding gradient problem**.