Detailed explanation and code implementation of self-attention mechanism in Transformer

  • Share this:
post-title
Self-Attention Mechanism (Self-Attention Mechanism) is the core component of the Transformer model, which allows the model to capture global dependencies when processing sequence data. This mechanism works by calculating the relevance of each element in the input sequence to the entire sequence, rather than simply treating the sequence as a block of fixed size. This greatly enhances the model's ability to understand and utilize contextual information. In a simple Transformer model, we first define an encoder layer that receives an input sequence and outputs a fixed-length encoding vector. Then, we use the decoder layer, which receives the encoded vector as input and outputs the predicted value of the sequence. Between these two layers, we insert a self-attention layer, which is used to calculate the correlation between each element in the input sequence and the entire sequence. The calculation process of the self-attention layer is as follows: 1. For each element in the input sequence, calculate its relevance score to the entire sequence. This is usually achieved by calculating the cosine similarity or dot product of the elements. 2. According to the correlation score, select the other elements most relevant to the current element and calculate the weighted sum of these elements. Weights are usually determined based on their correlation scores. 3. Multiply the weighted sum with the original value of the current element to get the new element value. 4. Combine the new element value with the current element to form a new element vector and pass it to the next time step. In this way, the self-attention mechanism can capture the global dependencies of the sequence data, so that the Transformer model performs well when dealing with complex tasks.
Self-attention mechanism is an important technology in deep learning, which can capture global dependencies in sequence data.

In the fields of natural language processing, computer vision, etc., the Transformer model has achieved remarkable results by using the self-attention mechanism.

This article will describe how to implement a simple Transformer model and show how it works with code examples.

First, let's understand what the self-attention mechanism is.

The self-attention mechanism is a mechanism that calculates the relationship between each element and other elements in a sequence.

It can help us understand the importance and relevance of each element in the sequence throughout the sequence.

Compared with traditional recurrent neural networks (RNN) or convolutional neural networks (CNN), self-attention mechanisms can better capture long-distance dependencies and do not need to explicitly define the order of sequences.

The following is an implementation of a simplified version of the Transformer model, we will focus on the self-attention mechanism:


import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads

        assert (
            self.head_dim * heads == embed_size
        ), "Embedding size needs to be divisible by heads"

        self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # Split the embedding into self.heads different pieces
        values = values.reshape(N, value_len, self.heads, self.head_dim)
        keys = keys.reshape(N, key_len, self.heads, self.head_dim)
        queries = query.reshape(N, query_len, self.heads, self.head_dim)

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        # Perform scaled dot-product attention
        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-1e20"))

        attention = torch.softmax(energy / (self.embed_size # (1 / 2)), dim=3)
        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len, self.heads * self.head_dim
        )

        out = self.fc_out(out)
        return out

In this simplified version of the Transformer model, we implement a self-attention layer SelfAttention

This class accepts input values, keys, and queries, and calculates attention weights between them.

The attention weight is calculated by scaling the dot product attention, and then a mask is applied to ignore information at certain positions.

Finally, we get the output by weighting and summing.

Now, let's look at the implementation of a more complete Transformer model, including components such as multi-head self-attention, feedforward neural network, and layer normalization:

lass TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion * embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion * embed_size, embed_size),
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value, key, query, mask)

        # Add a skip connection and apply layer normalization
        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

At this TransformerBlockIn the class, we first apply the self-attention layer, then add the result to the original input and normalize the layer.

Next, we perform further processing through a feedforward neural network and add the results to the previous output and perform layer normalization.

Finally, we apply dropout to reduce the risk of overfitting.

This is just a simple implementation of the Transformer model, and more details need to be considered in practical applications, such as location coding, multi-layer stacking, word embedding, etc.

However, through the implementation of this simplified version, we can better understand how the self-attention mechanism works and how it helps the Transformer model to capture global dependencies in sequence data.