Normalization and Residual Connections

Quiz

In the previous chapters, we have understood the architecture of the Transformer, its sublayers, and several key components that contribute to its efficiency and effectiveness. At the heart of the Transformers design, there is another crucial component called "add & norm" which is a residual connection immediately followed by layer normalization. It boosts the performance of the Transformer model by avoiding vanishing gradients and stabilizing the training process.

In this chapter, we will understand the role of layer normalization and residual connections, how they work, their benefits and some practical considerations while implementing them into a Transformer model.

Role of Layer Normalization and Residual Connections

Before we look at the role of normalization and residual connections, let's recall the basics of the Transformer architecture.

The Transformer consists of two parts: an encoder and a decoder. Both encoder and decode are composed of multiple layers and each layer includes two primary sub-layers: a multi-head attention and a fully connected feed-forward neural network. In addition, residual connections and layer normalization are applied to these sub-layers to maintain stability and improve training performance.

Residual Connections

Residual Connections, also known as skip connections, are introduced to address the vanishing gradient problem. It bypasses some layers and allows the gradient to pass directly through the network. In simple words, residual connections help the network to learn more effectively by allowing the gradient to pass through the layers without losing much information.

Mathematically, a residual connection can be represented as −

$$\mathrm{Output \: = \: Layer \: Output \: + \: Input}$$

The above equation shows that we add the output of a layer to its input. It basically helps the model to learn the differences (or residual) between the input and the output. In this way, residual connections make model training easier and more effective.

Layer Normalization

Layer Normalization is a technique used to keep inputs within a certain range during the whole training process. This normalization step keeps the training process stable especially when dealing with deep neural networks.

Mathematically, the formula of layer normalization for a given input vector x is −

$$\mathrm{\hat{x} \: = \: \frac{x \: - \: \mu}{\sigma}}$$

where μ is the mean and is the standard deviation (SD) of the input vector. After normalization, the output is scaled and shifted using learnable parameters as shown below −

$$\mathrm{y \: = \: \gamma \: \dot \: \hat{x} \: + \: \beta}$$

The benefit of this scaling and shifting mechanism are as follows −

It allows the network to keep the representational capacity of the inputs.
It also ensures that the activations remain in a certain range throughout the training process.

Working of Normalization and Residual Connections

In the Transformer architecture, Normalization and Residual Connections are applied to both multi-head attention and Feed Forward Neural Network (FFNN) sub layers. Lets see how they work −

First, the input x is passed through the multi-head attention sub layer.
Then the output of the self-attention mechanism is added to the original input x. This forms the residual connection.
After that Layer Normalization (LN) is applied to the sum of the input and multi-head attention output. Mathematically, this operation can be summarized as follows −

$$\mathrm{Normal(x \: + \: Multi \: - \: head \: attention \: (x))}$$

Now the output from multi-head attention sub layer is passed through the FFNN sub layer.
Then the output of the FFNN is added to the input of self-attention mechanism. This once again forms the residual connection.

After that Layer Normalization (LN) is applied to the sum of the input and FFNN output. Mathematically, this operation can be summarized as follows −

$$\mathrm{Norm2 \: \left(FFNN \: \left(Norm1 \:(x \: + \: Multi \: - \: head \: attention \: (x)) \right) \right)}$$

Benefits of Normalization and Residual Connections

The combination of residual connections and layer normalization provides below benefits −

Stabilizes Training − Layer normalization keeps the training process stable by ensuring that the activation function remains within a consistent range. It prevents the issue of vanishing gradients.
Allows Construction of Deeper Networks − Residual connections allow the construction of deeper networks which is essential for capturing complex patterns.
Improves Learning Speed − Residual connection allows gradients to flow directly through the network. It improves the convergence rate of the model, leading to faster training and better performance.
Enhances Model Performance − The combination of both Layer Normalization (LN) and Residual connections enhances the models ability to learn complex functions hence resulting in improved accuracy and generalization.

Considerations for Normalization and Residual Connections

While implementing residual connections and layer normalization combination in a Transformer model, we should consider the followings −

Initialization − We should do the proper initialization of the weights for the layer normalization parameters γ and β. Two common techniques are He initialization and Xavier initialization.
Hyperparameters − We need to tune hyperparameters like size of hidden layers, the number of attention heads, and the dropout rate carefully as they impact the performance of model.
Computational Efficiency − We should balance the complexity of the model with available computational resources as the implementation of residual connections and layer normalization add computational overhead.

Conclusion

The "Add & Norm" component which is a residual connection immediately followed by layer normalization is a fundamental aspect of the Transformer architecture.

In this chapter, we discussed the role of layer normalization and residual connections in Transformer model. By implementing them, the Transformer model can train deeper networks effectively, mitigate the vanishing gradient problem, and increase the model training speed. But before implementing them, proper initializations of the weights and tuning of the hyperparameters are necessary.

Understanding the concept of Add & Norm component is important for someone who wants to work with advanced NLP tasks. As research progresses, we can expect further improvements in normalization and residual connections, enhancing the capabilities of Transformer-based architectures.

Print Page