# How does learning take place in deep learning?

Short answer: By gradient descent on the fault function

Longer answer: I assume in the following from supervised learning (the “typical” case of deep learning), i.e. there are training data in the form of [mathn[/math different pairs [math(x, y)[/math, where [mathx[/math is an input vector and [mathy[/ math is the corresponding “correct” output vector.

The “neural” network contains a lot of [mathW[/math of variable weights) depending on the network structure selected.

Let’s say it’s total [mathm[/math weights, i.e.

[math-begin-equation-W = .w_1, …, w_m-end-equation-[/math

The network calculates an output for a given input vector [mathx[/math

[matho=f(W,x) [/math

which may differ significantly from the correct output [mathy[/math, especially before training.

The sum of the deviations is summarized in an error function, e.g. the summed square error = SSE:

[math E(W) = .displaystyle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The smaller [mathE(W)[/math, the more accurate the network maps the training data and hopefully also new data (simplified).So the primary goal is to minimize the error [mathE(W)[/math.

Changing the weights [mathW[/math (and that’s learning!) also changes the respective error [mathE(W)[/math.In order to reduce the error as quickly as possible, in the [mathm[/math-dimensional space of the weights (we had assumed that there are [mathm[/math variable weights in the network) one goes a bit in the direction in which the error decreases the most.The easiest way to imagine this is for two weights and the error as a third dimension (see figure above). Then you move like a good skier standing on a mountain and always driving it down exactly towards the biggest slope.

Mathematically, the direction of the largest error gradient is expressed by the negative gradient [math-‘nabla E[/math, i.e. the vector of the partial derivatives of the error function [mathE[/math( which must be differentiable for this).The weight quantity [mathW[/math is then changed in each learning step as follows:

[math-Delta W = –eta-nabla E = –eta-frac-partial-E-partial W-=- ————————————————————————————————————————————————————————————————————————————————————-

Here, the learning rate , which indicates the respective step size and is constant or variable depending on the optimizer used, is the learning rate.

The well-known backpropagation learning method is usually an efficient method of efficiently calculating the partial derivatives for all weights of a multi-layernetwork (using the chain rule of differential calculation).The specific formula for the error [mathE[/math and thus also the gradient depends, among other things, on the activation function used (logistical, tanh, ReLU). By the way, common deep learning frameworks such as tensorflow or pytorch calculate the gradient automatically, so that one can therefore concentrate on the creation of a suitable network structure and the training data.

In practice, one usually does not use all training data for a learning step, but only a small subset, which is selected differently over and over again.This is called “stochastic gradient descent” and this often leads to results faster than the complete gradient descent described above.

With regard to the minimization of errors, one should take into account that ultimately it is not the error on the (already known) training data that is relevant, but the best possible processing of new data, i.e. “generalization”.Keywords for further research: over-fitting, regularization, validation quantity, test quantity. The good performance on the training data is usually necessary but not sufficient for a good performance on new data.

Gradient descent is used for almost all common network types:

• Multi-Layer Perzeptron (MLP)
• Convolutional Network (ConvNet)
• Generative Adversarial Network (GAN)
• Long Short-Term Memory (LSTM)