Building GPT-2 from Scratch: A Detailed Implementation Guide

本文主要讲解模型的实现，辅以必要的数学公式。而对数学原理的深入讲解详见 Transformer。请确保阅读此文前，对 Transformer Decoder 的数学原理已有全面的了解。

本文代码的实现主要基于 karpathy/nanoGPT 与 karpathy/build-nanogpt，同时参考了 Hugging Face transformers 库对 GPT-2 的实现。本文的 Reference 一节列出了详细的参考列表。与 nanoGPT 实现的差异主要在于：

模块初始化的参数：nanoGPT 与 HuggingFace 多以 config 字典直接传入，而本文各模块的 __init__ 初始化方法，详细列出了每个参数，可以类比字典解包
超参数的数量与名称：仿照 HuggingFace 的实现，添加了 layer_norm_epsilon，修改 block_size 为 n_positions

Transformer Decoder and GPT-2

Transformer decoder 由多个相同的 block 堆叠而成，每个 block 结构相同，均由三个子层组成：

Causal (i.e. Masked) Multi-Head Self-Attention
Encoder-Decoder Attention
Position-wise Feed-Forward Network

进入第一个 decoder block 前，需要经过：

Input Embedding
Positional Embedding

在每个子层输出之后，采用 Add & Norm：

Residual Connection
Layer Normalization

GPT-2 属于 decoder-only 模型，即：仅含有 decoder，而移除了 encoder。其 decoder 与 Transformer 相似，主要区别如下：

Components	Transformer	GPT-2
Encoder-Decoder Attention	有，与 encoder 交互	无，仅使用自注意力机制
Position-wise Feed-Forward Network	激活函数为 ReLU	激活函数为 GELU
Layer Normalization	位于每个子层输出后	位于每个子层输入后
Positional Encoding	固定正余弦位置编码	可学习的位置编码

GPT 架构图如下：

The full architecture of a generative pre-trained transformer (GPT) model

Blue: Function with trainable parameters
Yellow: Function without trainable parameters
Orange: Transformer block
Green: Function activated during inference, inactive during training

本文对 GPT-2 架构的介绍顺序如下：

GPT Block Layer: 即：Transformer Decoder Layer，包括最核心的注意力机制
Full GPT: 堆叠多层 Block Layer，结合 Embedding Layers 与预测头形成完整的 GPT 模型

在此之前，先对 GPT-2 的超参数进行介绍。

Hyperparameters Configuration

根据参数量，GPT-2 划分为四个大小的模型：

#Parameters	#Decoders	$d_{\text{model}}$
117M - 124M	12	768
345M - 350M	24	1024
762M - 774M	36	1280
1542M - 1558M	48	1600

实际上，GPT-2 原论文中对模型参数规模计算有误，上表 #Parameters 列中，每一行左侧数据摘自原论文，而右侧数据为实际大小。OpenAI Official GPT-2 GitHub Repository 刊误如下：

Note that our original parameter counts were wrong due to an error (in our previous blog posts and paper). Thus you may have seen small referred to as 117M and medium referred to as 345M.

以 GPT-2 124M (后文简称 GPT-2) 为例，具体超参数配置如下：

Components	Hyperparameters	Value	Notation	Description
Input Embedding	`vocab_size`	50257	$V$	number of tokens
Positional Embedding	`n_positions`	1024	$T$	maximum sequence length
Nearly every sublayer	`n_embd`	768	$C$	embedding dimension
-	`n_layer`	12	$N$	number of block layers
Self Attention	`n_head`	12	$H$	number of attention heads
Residual Connections	`resid_pdrop`	0.1	$p_{\text{resid}}$	dropout probability for residual connections
Embedding Layer	`embd_pdrop`	0.1	$p_{\text{embd}}$	dropout probability for embedding layer
Self Attention	`attn_pdrop`	0.1	$p_{\text{attn}}$	dropout probability for attention weights
Layer Norm	`layer_norm_epsilon`	1e-5	$\epsilon$	layer norm epsilon
-	`initializer_range`	0.02	$\sigma$	standard deviation of weight initializer
Layer Norm and Linears	`bais`	True	-	whether to include bias in the Linears and LayerNorms

实现如下：

from dataclasses import dataclass

@dataclass
class GPTConfig:
    # number of block layers
    n_layer: int = 12
    # number of attention heads
    n_head: int = 12
    # embedding dimension
    n_embd: int = 768
    # number of tokens: 50,000 BPE merges + 256 bytes tokens + 1 <|endoftext|> token
    vocab_size: int = 50257
    # maximum sequence length
    n_positions: int = 1024
    # dropout probability for embedding layer
    embd_pdrop: float = 0.1
    # dropout probability for residual connections
    resid_pdrop: float = 0.1
    # dropout probability for attention weights
    attn_pdrop: float = 0.1
    # layer norm epsilon
    layer_norm_epsilon: float = 1e-5
    # std of weight initializer
    initializer_range: float = 0.02
    # whether to include bias in the Linears and LayerNorms
    bias: bool = True

GPT Block Layer

本文先专注最核心的 GPT Block Layer，即：Transformer Decoder Layer。该部分包括：

Causal Multi-Head Self-Attention
Feed-Forward Networks
Layer Norm
Residual Connection

Causal Multi-Head Self-Attention

回忆 Transformer Decoder Self-Attention 的计算公式：

$\begin{aligned} \mathbf{Z} = \mathrm{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) &= \mathrm{Concat}(\mathrm{head}_{1}, \ldots, \mathrm{head}_{H}) \mathbf{W}^{O} \\ \text{where} \ \mathrm{head}_{i} &= \mathrm{Attention}(\mathbf{Q}_{i}, \mathbf{K}_{i}, \mathbf{V}_{i}) \\ &= \mathrm{softmax} \bigg(\frac{\mathbf{Q}_{i}\mathbf{K}_{i}^{\top}}{\sqrt{D_{k}}} \bigg)\mathbf{V}_{i} \end{aligned}$

其中：

$\mathbf{X} \in \mathbb{R}^{T \times C}$ : token embedding matrix
$\mathbf{Q}_{i} = \mathbf{X}\mathbf{W}^{Q_{i}} \in \mathbb{R}^{T \times \frac{C}{H}}$ : query matrix for head $i$
$\mathbf{K}_{i} = \mathbf{X}\mathbf{W}^{K_{i}} \in \mathbb{R}^{T \times \frac{C}{H}}$ : key matrix for head $i$
$\mathbf{V}_{i} = \mathbf{X}\mathbf{W}^{V_{i}} \in \mathbb{R}^{T \times \frac{C}{H}}$ : value matrix for head $i$
$\mathbf{W}^{O} \in \mathbb{R}^{H\frac{C}{H} \times C}$

实现如下：

class CausalSelfAttention(nn.Module):

    def __init__(
        self,
        n_embd: int,
        n_positions: int,
        n_head: int,
        attn_pdrop: float = 0.1,
        resid_pdrop: float = 0.1,
        bias: bool = True,
    ) -> None:
        """Initialize the module.

        Args:
            n_embd (int): Embedding dimension.
            n_positions (int): Maximum sequence length.
            n_head (int): Number of attention heads.
            attn_pdrop (float, optional):
                Dropout probability for attention weights. Defaults to 0.1.
            resid_pdrop (float, optional):
                Dropout probability for residual connections. Defaults to 0.1.
            bias (bool, optional):
                Whether to include bias terms when calculating k, q, v projections.
                Defaults to True.
        """
        super().__init__()
        assert n_embd % n_head == 0  # n_embd must be divisible by n_head
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(n_embd, 3 * n_embd, bias=bias)
        # output projection
        self.c_proj = nn.Linear(n_embd, n_embd, bias=bias)
        self.c_proj.NANOGPT_SCALE_INIT = 1  # special scaled initialization
        # regularization
        self.attn_dropout = nn.Dropout(attn_pdrop)
        self.resid_dropout = nn.Dropout(resid_pdrop)
        self.n_head = n_head
        # precompute and cache mask
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(n_positions, n_positions)).view(
                1, 1, n_positions, n_positions
            ),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_len, n_embd).

        Returns:
            torch.Tensor: Output tensor of the same shape as input.
        """
        # B: batch size, T: sequence length, C: embedding dimension (=n_embd)
        B, T, C = x.size()

        # calculate q, k, v for all heads in batch
        # (B, T, C) -> (B, T, 3C) -> (B, T, C) x 3
        q, k, v = self.c_attn(x).split(C, dim=-1)
        # move head dim forward to be the batch dim
        # (B, T, C) -> (B, T, nh, hs) -> (B, nh, T, hs)
        # C = nh * hs, where nh: number of heads, hs: head size,
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        attn_weights = (q @ k.transpose(-2, -1)) * (
            1.0 / math.sqrt(k.size(-1))  # scaling factor
        )  # (B, nh, T, hs) x (B, nh, hs, T) = (B, nh, T, T)
        attn_weights.masked_fill_(self.mask[:, :, :T, :T] == 0, float("-inf"))
        attn_weights = torch.softmax(attn_weights, dim=-1)
        attn_weights = self.attn_dropout(attn_weights)
        # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = attn_weights @ v
        # re-assemble all head outputs side by side
        y = (
            y.transpose(1, 2)  # (B, T, nh, hs)
            .contiguous()  # equivalent to `.reshape(B, T, C)`
            .view(B, T, C)  # (B, T, C)
        )

        # output projection
        return self.resid_dropout(self.c_proj(y))  # (B, T, C)

该代码实现与前文给出的数学公式不尽相同，主要差异集中在对 $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ 的计算：

q, k, v = self.c_attn(x).split(C, dim=-1): 通过线性层 self.c_attn 直接将输入维度从 $C$ 映射至 $3C$ ，相当于直接计算连接后的 $\mathbf{QKV}$ ，接着通过 split 拆分成单独的 $\mathbf{Q},\mathbf{K},\mathbf{V}$
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2): 前面计算的 $\mathbf{Q},\mathbf{K},\mathbf{V}$ 包含了所有头，此处通过 view 与 transpose 来拆分多头

Pytorch 提供了相应的实现：

F.scaled_dot_product_attention: 集成了 Flash Attention 2，极大加速了注意力机制的计算
nn.MultiheadAttention: 集成了 F.scaled_dot_product_attention，提供了更丰富的参数选项，一套 API 支持了多种注意力机制：
- Encoder-Decoder Cross Attention: 前向传播时设置 is_causal=True (Default False)。同时，可以指定 query, key, value 来自不同输入
- Encoder Self-Attention: 前向传播时设置 is_causal=False，从而不使用 causal mask。同时，可以指定 query, key, value 来自同一输入
- Decoder Causal Self-Attention: 前向传播时设置 is_causal=True。同时，可以指定 query, key, value 来自同一输入

为了支持 Flash Attention，修改代码如下：

class CausalSelfAttention(nn.Module):

    def __init__(
        self,
        n_embd: int,
        n_positions: int,
        n_head: int,
        attn_pdrop: float = 0.1,
        resid_pdrop: float = 0.1,
        bias: bool = True,
    ) -> None:
        """Initialize the module.

        Args:
            n_embd (int): Embedding dimension.
            n_positions (int): Maximum sequence length.
            n_head (int): Number of attention heads.
            attn_pdrop (float, optional):
                Dropout probability for attention weights. Defaults to 0.1.
            resid_pdrop (float, optional):
                Dropout probability for residual connections. Defaults to 0.1.
            bias (bool, optional):
                Whether to include bias terms when calculating k, q, v projections.
                Defaults to True.
        """
        super().__init__()
        assert n_embd % n_head == 0  # n_embd must be divisible by n_head
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(n_embd, 3 * n_embd, bias=bias)
        # output projection
        self.c_proj = nn.Linear(n_embd, n_embd, bias=bias)
        self.c_proj.NANOGPT_SCALE_INIT = 1  # special scaled initialization
        # regularization
        self.attn_dropout = nn.Dropout(attn_pdrop)
        self.attn_pdrop = attn_pdrop  # save for Flash Attention
        self.resid_dropout = nn.Dropout(resid_pdrop)
        self.n_head = n_head
        # flash attention, supported only in PyTorch >= 2.0
        self.flash = hasattr(F, "scaled_dot_product_attention")
        if not self.flash:
            print(
                "WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0"
            )
            # precompute and cache mask
            self.register_buffer(
                "mask",
                torch.tril(torch.ones(n_positions, n_positions)).view(
                    1, 1, n_positions, n_positions
                ),
            )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_len, n_embd).

        Returns:
            torch.Tensor: Output tensor of the same shape as input.
        """
        # B: batch size, T: sequence length, C: embedding dimension (=n_embd)
        B, T, C = x.size()

        # calculate q, k, v for all heads in batch
        # (B, T, C) -> (B, T, 3C) -> (B, T, C) x 3
        q, k, v = self.c_attn(x).split(C, dim=-1)
        # move head dim forward to be the batch dim
        # (B, T, C) -> (B, T, nh, hs) -> (B, nh, T, hs)
        # C = nh * hs, where nh: number of heads, hs: head size,
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        if self.flash:
            # efficient attention using Flash Attention CUDA kernels
            y = F.scaled_dot_product_attention(
                q,
                k,
                v,
                dropout_p=self.attn_pdrop if self.training else 0,
                is_causal=True,
            )
        else:
            attn_weights = (q @ k.transpose(-2, -1)) * (
                1.0 / math.sqrt(k.size(-1))  # scaling factor
            )  # (B, nh, T, hs) x (B, nh, hs, T) = (B, nh, T, T)
            attn_weights.masked_fill_(
                self.mask[:, :, :T, :T] == 0, float("-inf")
            )
            attn_weights = torch.softmax(attn_weights, dim=-1)
            attn_weights = self.attn_dropout(attn_weights)
            # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
            y = attn_weights @ v
        # re-assemble all head outputs side by side
        y = (
            y.transpose(1, 2)  # (B, T, nh, hs)
            .contiguous()  # equivalent to `.reshape(B, T, C)`
            .view(B, T, C)  # (B, T, C)
        )

        # output projection
        return self.resid_dropout(self.c_proj(y))  # (B, T, C)

此外，self.c_proj.NANOGPT_SCALE_INIT = 1 用于标记 c_proj 特殊的权重初始化，具体在 GPT 的 _init_params 方法中实现。类似标记还在 MLP.c_proj 中出现。

Feed-Forward Networks

GPT-2 与 Transformer 的 FFN 差异在于激活函数的使用：前者使用 GELU，后者使用 ReLU。GPT-2 FFN 计算公式如下：

$\begin{aligned} \mathrm{FFN}(\mathbf{Z}) &= \mathrm{MLP}(\mathbf{Z}) \\ &= \mathrm{GELU}(\mathbf{Z}\mathbf{W}_{1} + \mathbf{b}_{1})\mathbf{W}_{2} + \mathbf{b}_{2} \end{aligned}$

其中： $\mathbf{W}_{1} \in \mathbb{R}^{C \times 4C}$ , $\mathbf{W}_{2} \in \mathbb{R}^{4C \times C}$ , $\mathbf{b}_{1} \in \mathbb{R}^{4C}$ , $\mathbf{b}_{2} \in \mathbb{R}^{C}$

具体实现如下：

class MLP(nn.Module):

    def __init__(
        self, n_embd: int, resid_pdrop: float = 0.1, bias: bool = True
    ) -> None:
        """Initialize the module.

        Args:
            n_embd (int): Embedding dimension.
            resid_pdrop (float, optional):
                Dropout probability for residual connections. Defaults to 0.1.
            bias (bool, optional):
                Whether to include bias terms in the linear layers.
                Defaults to True.
        """
        super().__init__()
        self.c_fc = nn.Linear(n_embd, 4 * n_embd, bias)
        self.act = nn.GELU("tanh")
        self.c_proj = nn.Linear(4 * n_embd, n_embd, bias)
        self.c_proj.NANOGPT_SCALE_INIT = 1  # special scaled initialization
        self.dropout = nn.Dropout(resid_pdrop)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_len, n_embd).

        Returns:
            torch.Tensor: Output tensor of the same shape as input.
        """
        x = self.c_fc(x)
        x = self.act(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

注：GPT-2 官方实现以及 Huggingface transformers 对 MLP 的实现使用的是 Conv1D 而非 Linear，二者的区别在于 Linear 计算时对权重矩阵做了转置，而 Conv1D 并没有。因此，当加载 Huggingface transformers GPT-2 的预训练权重时，需对权重矩阵做相应的转置。相应讨论见 Shouldn’t GPT2 use Linear instead of Conv1D? · Issue #311 · huggingface/transformers，具体实现见 Load Pretrained Weights From HuggingFace 一节。

Full Implementation of GPT Block Layer

介绍完 Self-Attention 与 FFN 后，便可集成 LayerNorm 与 Residual Connections 搭建完整的 Block Layer。GPT Block 与 Transformer Decoder 的 LayerNorm 位置不同：

GPT Block: 位于每个子层输入后
Transformer Decoder: 位于每个子层输出后

实现如下：

class Block(nn.Module):

    def __init__(
        self,
        n_embd: int,
        n_positions: int,
        n_head: int,
        attn_pdrop: float = 0.1,
        resid_pdrop: float = 0.1,
        layer_norm_epsilon: float = 1e-5,
        bias: bool = True,
    ) -> None:
        """Initialize the module.

        Args:
            n_embd (int): Embedding dimension.
            n_positions (int): Maximum sequence length.
            n_head (int): Number of attention heads.
            attn_pdrop (float, optional):
                Dropout probability for attention weights. Defaults to 0.1.
            resid_pdrop (float, optional):
                Dropout probability for residual connections. Defaults to 0.1.
            layer_norm_epsilon (float, optional):
                Layer norm epsilon. Defaults to 1e-5.
            bias (bool, optional):
                Whether to include bias terms in the layers. Defaults to True.
        """
        super().__init__()
        self.ln_1 = nn.LayerNorm(n_embd, eps=layer_norm_epsilon, bias=bias)
        self.attn = CausalSelfAttention(
            n_embd, n_positions, n_head, attn_pdrop, resid_pdrop, bias
        )
        self.ln_2 = nn.LayerNorm(n_embd, eps=layer_norm_epsilon, bias=bias)
        self.mlp = MLP(n_embd, resid_pdrop)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, seq_len, n_embd).

        Returns:
            torch.Tensor: Output tensor of the same shape as input.
        """
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

Full GPT

搭建完 GPT Block Layer 后，便可堆叠多层 Block，结合 Embedding Layers 与预测头形成完整的 GPT 模型。本节首先给出完整的实现，再逐步讲解：

class GPT(nn.Module):

    def __init__(
        self,
        vocab_size: int,
        n_positions: int,
        n_embd: int,
        n_layer: int,
        n_head: int,
        embd_pdrop: float = 0.1,
        resid_pdrop: float = 0.1,
        attn_pdrop: float = 0.1,
        layer_norm_epsilon: float = 1e-5,
        initializer_range: float = 0.02,
        bias: bool = True,
    ) -> None:
        """Initialize the module.

        Args:
            vocab_size (int): Number of tokens.
            n_positions (int): Maximum sequence length.
            n_embd (int): Embedding dimension.
            n_layer (int): Number of block layers.
            n_head (int): Number of attention heads.
            embd_pdrop (float, optional):
                Dropout probability for embedding layer. Defaults to 0.1.
            resid_pdrop (float, optional):
                Dropout probability for residual connections. Defaults to 0.1.
            attn_pdrop (float, optional):
                Dropout probability for attention weights. Defaults to 0.1.
            layer_norm_epsilon (float, optional):
                Layer norm epsilon. Defaults to 1e-5.
            initializer_range (float, optional):
                Std of weight initializer. Defaults to 0.02.
            bias (bool, optional):
                Whether to include bias terms in the Linears and LayerNorms.
                Defaults to True.
        """
        super().__init__()
        self.n_positions = n_positions
        self.transformer = nn.ModuleDict(
            dict(
                # token and position embeddings
                wte=nn.Embedding(vocab_size, n_embd),
                wpe=nn.Embedding(n_positions, n_embd),
                drop=nn.Dropout(embd_pdrop),  # dropout for the embeddings
                # transformer blocks
                h=nn.ModuleList(
                    [
                        Block(
                            n_embd,
                            n_positions,
                            n_head,
                            attn_pdrop,
                            resid_pdrop,
                            layer_norm_epsilon,
                            bias,
                        )
                        for _ in range(n_layer)
                    ]
                ),
                # final layer norm before the classifier
                ln_f=nn.LayerNorm(n_embd, eps=layer_norm_epsilon, bias=bias),
            )
        )
        # language model head, bias is set to False to support the weight sharing scheme
        self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)

        # weight sharing scheme
        self.transformer.wte.weight = self.lm_head.weight

        # init params
        self.initializer_range = initializer_range
        self.n_layer = n_layer
        self.apply(self._init_params)

    def _init_params(self, module: nn.Module) -> None:
        """Initialize the parameters of modules.

        1. Linear:
            - Weights: Normal(mean=0.0, std=self.initializer_range)
                If module has `NANOGPT_SCALE_INIT` (e.g., `c_proj` layers in `CausalSelfAttention` and `MLP`),
                `std` will be scaled by `(2 * self.n_layer) ** -0.5`
            - Biases: Zeros if exists
        2. Embedding:
            - Weights: Normal(mean=0.0, std=self.initializer_range)
        3. LayerNorm:
            - Weights: Ones
            - Biases: Zeros

        Args:
            module (nn.Module): Modules to initialize.
        """
        if isinstance(module, nn.Linear):
            std = self.initializer_range
            # special scaled initialization
            if hasattr(module, "NANOGPT_SCALE_INIT"):
                std *= (2 * self.n_layer) ** -0.5
            nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=self.initializer_range)
        elif isinstance(module, nn.LayerNorm):
            nn.init.zeros_(module.bias)
            nn.init.ones_(module.weight)

    def forward(
        self, idx: torch.Tensor, targets: torch.Tensor = None
    ) -> tuple[torch.Tensor, torch.Tensor | None]:
        """Forward pass.

        Args:
            idx (torch.Tensor): Token indices of shape (B, T), where:
                - B: batch size
                - T: sequence length.
            targets (torch.Tensor, optional): Ground truth token indices of shape (B, T).
                If provided, the loss is calculated using cross entropy. Defaults to None.

        Returns:
            tuple[torch.Tensor, torch.Tensor | None]: A tuple containing:
                - logits (torch.Tensor): Output tensor of shape (B, T, vocab_size) containing
                  the unnormalized log probabilities for each token in the vocabulary.
                - loss (torch.Tensor | None): The computed cross entropy loss if targets is
                  provided, otherwise None.
        """
        T = idx.size(1)  # (B, T)
        assert (
            T <= self.n_positions
        ), f"Cannot forward sequence of length {T}, block size is only {self.n_positions}"
        # forward the position embeddings
        pos = torch.arange(T, device=idx.device)  # (T,)
        pos_emb = self.transformer.wpe(pos)  # (T, C)
        # forward the token embeddings
        tok_emb = self.transformer.wte(idx)  # (B, T, C)
        x = tok_emb + pos_emb  # (B, T, C)
        # forward the blocks of the transformer
        for block in self.transformer.h:
            x = block(x)
        # forward the final layernorm and the classifier
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)  # (B, T, vocab_size)
        # calculate loss if targets are provided
        loss: torch.Tensor | None = None
        if targets is not None:
            loss = F.cross_entropy(
                input=logits.view(-1, logits.size(-1)),  # (B*T, vocab_size)
                target=targets.view(-1),  # (B*T,)
            )
        return logits, loss

    @classmethod
    def from_pretrained(cls, model_type: str) -> "GPT":
        """Load pretrained GPT-2 model weights from huggingface.

        Args:
            model_type (str):
                Model type to load.
                Must be one of {"gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"}

        Returns:
            GPT: Pretrained GPT-2 model.
        """
        assert model_type in {"gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"}
        from transformers import GPT2LMHeadModel
        from dataclasses import asdict

        print("loading weights from pretrained gpt: %s" % model_type)

        # n_layer, n_head and n_embd are determined from model_type
        cfg_args: dict[str, Any] = {
            "gpt2": dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            "gpt2-medium": dict(
                n_layer=24, n_head=16, n_embd=1024
            ),  # 350M params
            "gpt2-large": dict(
                n_layer=36, n_head=20, n_embd=1280
            ),  # 774M params
            "gpt2-xl": dict(n_layer=48, n_head=25, n_embd=1600),  # 1558M params
        }[model_type]
        # create a from-scratch initialized minGPT model
        model = GPT(**asdict(GPTConfig(**cfg_args)))
        sd = model.state_dict()
        sd_keys = sd.keys()
        sd_keys = [
            k for k in sd_keys if not k.endswith(".attn.mask")
        ]  # discard this mask / buffer, not a param

        # init a huggingface/transformers model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # copy while ensuring all of the parameters are aligned and match in names and shapes
        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [
            k for k in sd_keys_hf if not k.endswith(".attn.masked_bias")
        ]  # ignore these, just a buffer
        sd_keys_hf = [
            k for k in sd_keys_hf if not k.endswith(".attn.bias")
        ]  # same, just the mask (buffer)

        # basically the openai checkpoints use a `Conv1D` module, but we only want to use a vanilla Linear
        # this means that we have to transpose these weights when we import them
        transposed = [
            "attn.c_attn.weight",
            "attn.c_proj.weight",
            "mlp.c_fc.weight",
            "mlp.c_proj.weight",
        ]
        assert len(sd_keys_hf) == len(
            sd_keys
        ), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])
        return model

Positional and Input Embedding

假设输入序列为：

$X=\{ x_{1}, x_{2}, \ldots, x_{T} \}$

其中：

$x_{i} \in \{ 0,1,\ldots,V-1 \}$ : 长度为 $T$ 的序列中第 $i$ 个 token index
$V$ : token 词表大小

与 Transformer 不同，GPT-2 的 Positional Embedding 采用可学习的 nn.Embedding 层。而 Input Embedding 则与 Transformer 相同，相关定义如下：

self.transformer = nn.ModuleDict(
    dict(
        # token and position embeddings
        wte=nn.Embedding(vocab_size, n_embd),
        wpe=nn.Embedding(n_positions, n_embd),
        # ...
    )
)

输入嵌入矩阵为 $\mathbf{E} = [\mathbf{e}_{1}, \mathbf{e}_{2}, \ldots, \mathbf{e}_{V}]^{\top} \in \mathbb{R}^{V \times C}$ ，当输入 token index $x_{i}$ 时，将其映射成词向量 $\mathbf{e}_{i} = \mathbf{E}[x_{i}] \in \mathbb{R}^{C}$ 。位置嵌入矩阵 $\mathbf{P} = \{\mathbf{p}_{1},\mathbf{p}_{2},\ldots,\mathbf{p}_{T}\} \in \mathbb{R}^{T \times C}$ ，其中： $\mathbf{p}_{i} \in \mathbb{R}^{C}$ 为第 $i$ 个位置的嵌入向量， $C$ 为嵌入维度。简单起见，这里用 $T$ 表示序列最大长度，而代码中的 T 为序列实际长度。位置嵌入矩阵与 token indices 无关，完全由位置 $i$ 决定。因此，在具体实现中，无需以 $X$ 为输入：

def forward(
    self, idx: torch.Tensor, targets: torch.Tensor = None
) -> tuple[torch.Tensor, torch.Tensor | None]:
    T = idx.size(1)  # (B, T)
    # forward the position embeddings
    pos = torch.arange(T, device=idx.device)  # (T,)
    pos_emb = self.transformer.wpe(pos)  # (T, C)
    # forward the token embeddings
    tok_emb = self.transformer.wte(idx)  # (B, T, C)
    x = tok_emb + pos_emb  # (B, T, C)

Stacked Block Layers

Input Embedding 与 Positional Embedding 模块的输出为词嵌入矩阵 $\mathbf{X} \in \mathbb{R}^{T \times C}$ 。接着，堆叠多层 Block：

self.transformer = nn.ModuleDict(
    dict(
        # ...
        # transformer blocks
        h=nn.ModuleList(
            [
                Block(
                    n_embd,
                    n_positions,
                    n_head,
                    attn_pdrop,
                    resid_pdrop,
                    layer_norm_epsilon,
                    bias,
                )
                for _ in range(n_layer)
            ]
        ),
        # ...
    )
)

前向传播时，只需逐层通过即可，最终输出隐藏态 $\mathbf{H} \in \mathbb{R}^{T \times C}$ ：

# forward the blocks of the transformer
for block in self.transformer.h:
    x = block(x)

Language Model Prediction Head

最后，是语言模型预测头，位于最后一层 decoder 后，将 decoder 输出隐藏态转化为概率分布。在此之前，需要经过一个 LayerNorm，相关模块定义如下：

self.transformer = nn.ModuleDict(
    dict(
        # ...
        ln_f=nn.LayerNorm(n_embd, eps=layer_norm_epsilon, bias=bias),
    )
)
# language model head, bias is set to False to support the weight sharing scheme
self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
# weight sharing scheme
self.transformer.wte.weight = self.lm_head.weight

首先，经过一个无偏置 $\mathbf{b}$ 的全连接层： $\mathbf{O} = \mathbf{H}\mathbf{W}$ 。其中： $\mathbf{W} \in \mathbb{R}^{C \times V}$ 为从隐藏态映射至目标语言词汇表大小的矩阵。实际实现中，该权重矩阵与输入嵌入矩阵共享，二者互为转置，即： $\mathbf{W} = \mathbf{E}^{\top}$ 。权重共享有以下优势：

减少参数量
保证输入输出语义空间一致性：因而设置预测头的 bias=False

具体实现时，self.transformer.wte.weight = self.lm_head.weight 直接令二者相等，而非转置。原因在于 Linear 权重矩阵维度为 (out_features, in_features)，Embedding 权重矩阵维度为 (num_embeddings, embedding_dim)，对应到我们的模型中，则均为 (vocab_size, n_embd)，即： $C \times V$ ，因此无需转置。

在预测生成中，为了获得概率分布，需对 $\mathbf{o}_{t} \in \mathbb{R}^{V}$ 进行 softmax 归一化得到时间步 $t$ 时 token $y_{t}$ 的条件概率：

$P(y_{t} | y_{<t}) = \mathrm{softmax}(\mathbf{o}_{t}) = \frac{\exp(\mathbf{o}_{t,i})}{\sum_{j=1}^{V}\exp(\mathbf{o}_{t,j})}$

Load Pretrained Weights From HuggingFace

从头开始训练一个 GPT-2 需要大量的数据，同时耗费大量的算力与时间。本节主要讲解如何加载 Hugging Face transformers 中的 GPT2LMHeadModel 模型权重到我们的自定义模型中。该模型架构与我们的略有差异，典型的差异在于 MLP 与 Attention 模块，我们使用的是线性层 Linear，而 Hugging Face 使用的是 Conv1D，前文已经提及。

Initialize GPT Models

首先，初始化我们自定义的 GPT 模型：

# n_layer, n_head and n_embd are determined from model_type
cfg_args: dict[str, Any] = {
    "gpt2": dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
    "gpt2-medium": dict(
        n_layer=24, n_head=16, n_embd=1024
    ),  # 350M params
    "gpt2-large": dict(
        n_layer=36, n_head=20, n_embd=1280
    ),  # 774M params
    "gpt2-xl": dict(n_layer=48, n_head=25, n_embd=1600),  # 1558M params
}[model_type]
# create a from-scratch initialized minGPT model
model = GPT(**asdict(GPTConfig(**cfg_args)))
sd = model.state_dict()
sd_keys = sd.keys()
sd_keys = [
    k for k in sd_keys if not k.endswith(".attn.mask")
]  # discard this mask / buffer, not a param

根据 model_type 为各规模的 GPT-2 模型配置相应的超参数，主要为 n_layer, n_head, n_embd
创建 GPTConfig 配置对象，传入前面定义的 cfg_args，并使用 asdict 将配置对象转换为字典
参数解包传入 GPT 的构造函数，初始化模型
获取模型的状态字典 (权重和缓冲区)，过滤掉不必要的键 (如 .attn.mask)

对 Hugging Face GPT2LMHeadModel 模型的加载与此类似，不再赘述

Transpose Weights of Conv1D

# basically the openai checkpoints use a `Conv1D` module, but we only want to use a vanilla Linear
# this means that we have to transpose these weights when we import them
transposed = [
    "attn.c_attn.weight",
    "attn.c_proj.weight",
    "mlp.c_fc.weight",
    "mlp.c_proj.weight",
]
assert len(sd_keys_hf) == len(
    sd_keys
), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
for k in sd_keys_hf:
    if any(k.endswith(w) for w in transposed):
        # special treatment for the Conv1D weights we need to transpose
        assert sd_hf[k].shape[::-1] == sd[k].shape
        with torch.no_grad():
            sd[k].copy_(sd_hf[k].t())
    else:
        # vanilla copy over the other parameters
        assert sd_hf[k].shape == sd[k].shape
        with torch.no_grad():
            sd[k].copy_(sd_hf[k])
return model

定义需要特殊处理的权重名称列表 transposed。这些权重在 Hugging Face 的实现中使用 Conv1D 模块，需要进行转置
逐键复制权重：
- 如果是需要转置的权重 (transposed 中定义的)，则对权重进行转置后再复制
- 对于普通权重，直接复制
无梯度操作：使用 torch.no_grad() 确保权重复制过程中不会影响梯度计算