어텐션 메커니즘 (Attention Mechanism)

📖 상세 설명

어텐션 메커니즘(Attention Mechanism)은 입력 시퀀스의 각 부분에 서로 다른 중요도(가중치)를 부여하여, 모델이 관련성 높은 정보에 "주목(attention)"하도록 하는 기술입니다. 사람이 긴 문장을 읽을 때 모든 단어에 동일한 주의를 기울이지 않고 핵심 단어에 집중하는 것처럼, AI도 중요한 부분에 더 많은 가중치를 부여합니다.

2017년 구글의 "Attention Is All You Need" 논문에서 소개된 Self-Attention은 Transformer 아키텍처의 핵심으로, 시퀀스 내 모든 위치 간의 관계를 병렬로 계산합니다. 각 토큰은 Query(Q), Key(K), Value(V) 벡터로 변환되며, Q와 K의 유사도를 계산해 어텐션 점수를 구하고, 이를 V에 적용하여 컨텍스트를 반영한 출력을 생성합니다.

Multi-Head Attention은 여러 개의 어텐션 연산을 병렬로 수행하여 다양한 관점에서 관계를 포착합니다. 예를 들어, 한 헤드는 문법적 관계를, 다른 헤드는 의미적 유사성을 학습할 수 있습니다. GPT-4는 수십 개의 어텐션 헤드를 사용하여 복잡한 언어 패턴을 이해합니다.

어텐션 메커니즘의 등장으로 RNN/LSTM의 순차 처리 한계를 극복하고, 긴 시퀀스에서도 효과적으로 장거리 의존성을 학습할 수 있게 되었습니다. 현재 GPT, BERT, Claude 등 거의 모든 최신 언어 모델과 Vision Transformer(ViT) 같은 비전 모델에서도 핵심 구성요소로 사용됩니다.

💻 코드 예제

PyTorch로 구현한 Scaled Dot-Product Attention과 Multi-Head Attention 예제입니다.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Scaled Dot-Product Attention
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, seq_len, d_k)
    """
    d_k = Q.size(-1)
    # Attention Score: Q와 K의 내적 후 스케일링
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # Softmax로 가중치 정규화
    attention_weights = F.softmax(scores, dim=-1)

    # V에 가중치 적용
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

# Multi-Head Attention 클래스
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)

        # Linear projection + reshape to heads
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Attention 적용
        x, attn = scaled_dot_product_attention(Q, K, V, mask)

        # Concat heads + final linear
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        return self.W_o(x)

# 사용 예시
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512)  # (batch=2, seq_len=10, d_model=512)
output = mha(x, x, x)  # Self-Attention
print(f"Output shape: {output.shape}")  # (2, 10, 512)

📊 어텐션 변형별 비교

어텐션 유형	시간 복잡도	특징	적용 모델
Full Attention	O(n²)	모든 토큰 간 관계 계산	BERT, GPT-3
Sparse Attention	O(n√n)	일부 패턴만 계산	GPT-3 (일부)
Flash Attention	O(n²) 메모리 효율	IO 최적화, 메모리 절감	LLaMA, Mistral
Linear Attention	O(n)	커널 근사 사용	Performer
Multi-Query Attention	O(n²) KV 절감	K,V 헤드 공유	PaLM, Falcon

🗣️ 실무에서 이렇게 말하세요

모델 아키텍처 논의 시

"128K 토큰을 처리하려면 일반 어텐션으론 메모리가 터집니다. Flash Attention 2나 Sliding Window Attention을 적용해야 합니다."

디버깅/분석 시

"어텐션 맵을 시각화해보니 모델이 'not'이라는 부정어를 무시하고 있네요. 프롬프트를 수정해봅시다."

성능 최적화 회의에서

"추론 속도가 느린 이유가 어텐션 연산 때문입니다. KV 캐시를 적용하고 Multi-Query Attention으로 전환하면 30% 빨라집니다."

⚠️ 흔한 실수 & 주의사항

💾

메모리 폭발 (OOM)

어텐션은 O(n²) 메모리를 사용합니다. 긴 시퀀스에서는 Flash Attention이나 gradient checkpointing을 적용하세요.

🔢

스케일링 누락

√d_k로 나누지 않으면 소프트맥스가 극단값으로 수렴합니다. 항상 Scaled Dot-Product를 사용하세요.

🎭

마스킹 실수

Decoder에서 미래 토큰을 보면 학습이 망가집니다. Causal mask를 반드시 적용하세요.

📊

헤드 수 과다

헤드가 너무 많으면 각 헤드의 차원(d_k)이 줄어 표현력이 감소합니다. d_model/num_heads가 최소 64 이상 유지하세요.

🔗 관련 용어

📚 더 배우기

📄 Attention Is All You Need - Transformer 원본 논문 🎨 The Illustrated Transformer - 시각적 설명 (Jay Alammar) ⚡ FlashAttention 논문 - 효율적인 어텐션 구현