Regularization techniques: In short

Rolando Quiroz
6 min readSep 2, 2020
Figure 1: Overfitting

When training neural networks, it is not so much the performance on the training set that matters, but rather that the network is able to apply the knowledge acquired during training to new data. This skill is known as generalization and there are techniques to improve this ability. As a whole these techniques are called regularization and it is about these techniques that we will talk in this post.

What is regularization? What is its purpose?

Regularization is a method of limiting the effects of what is called overfitting.

Overfitting is a phenomenon that occurs especially when the model’s capacity is too large in relation to the data set. In other words, our model has become too specialized on the training set, until it has somehow memorized it.

The reason the model can store the training set is because its capacity is too large relative to the data, i.e. the model is too complex.

Figure 2: Model complexity

This is why regularization is used. It takes the form of a term that is added to the cost function in order to penalize certain behaviors of the algorithm.

In a nutshell: regularization is the penalty on the complexity of a model and regularization helps prevent over fitting.

Some of the different types of regularization include the following:

  • L1 regularization

More specifically called LASSO. LASSO stands for Least Absolute Shrinkage and Selection Operator. There are 2 key words here — ‘absolute‘ and ‘selection‘.

Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. Thus, lasso regression optimizes the following:

Objective = RSS + α * (sum of absolute value of coefficients)

Here, RSS refers to ‘Residual Sum of Squares’ which is nothing but the sum of square of errors between the predicted and actual values in the training data set and α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients and α can take various values:

α = 0: Same coefficients as simple linear regression

α = ∞: All coefficients zero (same logic as before)

0 < α < ∞: coefficients between 0 and that of simple linear regression

Figure 3: L1 regularization does both shrinkage and variable selection

In a nutshell, what the L1 regularization does is to penalize the absolute sum of the coefficients (L1 norm). If α becomes larger, the values of the estimators have to zero, unlike ridge also allows some coefficients to be zero, allowing a selection of variables.

  • L2 regularization

Ridge regression performs ‘L2 regularization‘, i.e. it adds a factor of sum of squares of coefficients in the optimization objective. Thus, ridge regression optimizes the following:

Objective = RSS + α * (sum of square of coefficients)

Here, α is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients and can take various values:

α = 0:

  • The objective becomes same as simple linear regression.
  • We’ll get the same coefficients as simple linear regression.

α = ∞:

  • The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less than zero will make the objective infinite.

0 < α < ∞:

  • The magnitude of α will decide the weightage given to different parts of objective.
  • The coefficients will be somewhere between 0 and ones for simple linear regression.
Figure 4: L2 regularization keeps all variables and shrinks the coefficients towards zero

In short, what the Ridge regression does is to decrease the complexity of the model. In this way it seeks to penalize them if they are too far from zero, forcing them to be small continuously. If α gets bigger, variance decreases and bias increases.

  • Dropout

Regularization methods like L2 and L1 reduce overfitting by modifying the cost function. Dropout, on the other hand, modify the network itself.

Dropout is a method that randomly disables a number of neurons in a neural network. In each iteration of the neural network dropout will deactivate different neurons, the deactivated neurons are not taken into account for the forward propagation nor for the backward propagation what forces the nearby neurons to not depend so much on the deactivated neurons. This method helps to reduce overfitting since nearby neurons often learn patterns that are related and these relationships can form a very specific pattern with the training data, with dropout this dependence between neurons is less throughout the neural network, thus the neurons need to work better alone and not depend so much on relationships with neighboring neurons.

Figure 5: Dropout regularization

Dropout has a parameter that indicates the probability that the neurons remain activated, this parameter takes values from 0 to 1, 0.5 is usually used by default indicating that half of the neurons will remain activated, if the values are close to 0 dropout will deactivate less neurons, if it is close to 1 will deactivate many more neurons. Dropout is only used during the training phase, in the test phase no neurons are deactivated but we scale them on the dropout probability to compensate for the neurons deactivated during the training phase.

A different dropout can be set for each layer, depending on what we need in each layer. In the input layers a very high dropout (0.7) is usually used to keep most neurons activated and in the hidden layers a dropout of (0.5).

  • Data Augmentation

Data augmentation consists of artificially increasing the size of the learning database by adding new examples created from distortions of the initial examples. The goal is for the network to learn descriptors specific to the classes of objects considered rather than image artifacts such as differences in illumination. For example, for classic images, it is clear that an object does not change if the ambient lighting changes or if the observer is replaced by another.

Figure 6: Example of data augmentation realized for a classical image

The best regularization option get more data but not always is possible to collect, so creating false data from the real ones works particularly well for visual classification, and the reason is that it looks for invariance to certain transformations:-Rotation-Translation-Horizontal or Vertical Reflection-Scale-Light Intensity, etc…

  • Early Stopping

Early stopping regularization limits fitting for training set based on validation set.

It may be too expensive to try different penalties on the weights of the net, so it may be advisable to start training with very small weights and let them grow until the performance on the validation set starts to worsen.

The capacity of the net is limited by preventing the weights from growing too much.

Figure 7: Early stopping point and Training and Validation sample errors

Why does it work?

When the weights are very small, the hidden layer neurons stay in their linear range: A network with hidden layers of linear neurons has no more capacity than a linear network without hidden layers.

As the weights grow, the hidden neurons begin to behave in a non-linear way (and the capacity of the network increases).

Conclusion

In this article we have explained what regularization is and how we can use it to improve the generalization capacity of our machine learning models. In general, regularization is going to help us reduce the over-adjustment of machine learning models.

Sources

https://ulaval-damas.github.io/glo4030/assets/slides/04-regularisation.pdf

--

--