Pytorch 기반 Transformer

The Annotated Transformer: harvard NLP

class Transformer(nn.Module):

    def __init__(self, encoder, decoder):
        super(Transformer, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def encode(self, x):
        out = self.encoder(x)
        return out

    def decode(self, z, c):
        out = self.decode(z, c)
        return out

    def forward(self, x, z):
        c = self.encode(x)
        y = self.decode(z, c)
        return y

class Encoder(nn.Module):

    def __init__(self, encoder_block, n_layer):  # n_layer: Encoder Block의 개수
        super(Encoder, self).__init__()
        self.layers = []  # 여러 개의 Encoder 레이어를 담을 리스트를 초기화
        for i in range(n_layer):
            # 주어진 encoder_block을 n_layer 만큼 복사하여 self.layers에 추가
            self.layers.append(copy.deepcopy(encoder_block))

    def forward(self, x):
        out = x  # 입력을 저장할 변수를 초기화
        for layer in self.layers:
            out = layer(out)  # 각 레이어를 차례대로 통과
        return out

class EncoderBlock(nn.Module):

    def __init__(self, self_attention, position_ff):
        super(EncoderBlock, self).__init__()
        self.self_attention = self_attention  # self-attention 레이어를 저장
        self.position_ff = position_ff  # position-wise feed-forward 레이어를 저장

    def forward(self, x):
        out = x  # 입력을 저장할 변수를 초기화
        
        # Self-attention 레이어를 통과
        out = self.self_attention(out)
        
        # Position-wise feed-forward 레이어를 통과
        out = self.position_ff(out)
        
        return out  # 출력을 반환

def calculate_attention(query, key, value, mask):
    # query, key, value: (n_batch, seq_len, d_k)
    # mask: (n_batch, seq_len, seq_len)
    
    d_k = key.shape[-1]  # 키(key)의 차원을 가져옴
    
    # Q x K^T를 계산 (내적 결과)
    # 결과는 어텐션 스코어 행렬 (n_batch, seq_len, seq_len)
    attention_score = torch.matmul(query, key.transpose(-2, -1))
    
    # 스케일드 닷 프로덕트 어텐션(Scaled dot-product Attention) 수행
    attention_score = attention_score / math.sqrt(d_k)
    
    # 마스크가 제공되면 어텐션 스코어에 마스크를 적용
    if mask is not None:
        attention_score = attention_score.masked_fill(mask == 0, -1e9)
    
    # 소프트맥스를 사용하여 어텐션 가중치를 계산
    # 결과는 어텐션 확률 (n_batch, seq_len, seq_len)
    attention_prob = F.softmax(attention_score, dim=-1)
    
    # 가중 평균을 사용하여 어텐션 값을 계산
    # 결과는 어텐션 값 (n_batch, seq_len, d_k)
    out = torch.matmul(attention_prob, value)
    
    return out  # 어텐션 값 반환

스케일드 닷 프로덕트 어텐션(Scaled dot-product Attention)

말 그대로 Dot-Product Attention을 Scaling하는 것.

이유: softmax의 그래디언트는 확률들의 곱으로 표현되기 때문에 특정 값이 아주 작게 나오는 경우, Gradient Vanishing 문제가 발생 할 수 있다. 따라서 Scaling을 통해 Q와 K의 Dot-product 연산 Matrix의 분산을 줄이고자 하는 것.

Dot-Product Attention

$score(s_t,h_i) = s^T_th_i$

Untitled

<aside> 📢 파이토치에서 제공하는 트랜스포머 모델을 활용해 영어-독일어 번역 모델을 구성

</aside>

학습에 사용되는 데이터세트: 대규모 다국어 데이터세트 Multi30k

Multi30k

영어-독일어 병렬 말뭉치 - 약 30,000개의 데이터

토치 데이터와 토치 텍스트 라이브러리: 쉽게 다운로드 가능

pip install torchdata torchtext portalocker