Magnicord

X-Enhanced Contrastive Decoding Strategies for Large Language Models

Created2025-05-30|Updated2025-05-30|NLP|deep-learning•LLM•NLP

Notations 以下是与 LLMs 推理相关的通用符号表示： ttt: 当前解码时间步 V\mathcal{V}V: token 词汇表 x\mathbf{x}x: 输入序列 y∈Vy \in \mathcal{V}y∈V: 候选 next token yt∈Vy_{t} \in \mathcal{V}yt∈V: 当前时间步 ttt 选择的 token y<t=[y1,y2,…,yt−1]\mathbf{y}_{<t} = [y_{1}, y_{2}, \ldots, y_{t-1}]y<t=[y1,y2,…,yt−1]: 当前时间步 ttt 之前已生成的输出序列 l(y∣x,y<t)\mathbf{l}(y \mid \mathbf{x},\mathbf{y}_{<t})l(y∣x,y<t): 给定上文（即：输入序列和已生成的输出序列）的 next token 在 softmax 归一化前的 logits 分数 p(y∣x,y<t)p(y \mid...

Policy-Based Methods

Created2025-03-05|Updated2025-03-05|Reinforcement Learning Basics|reinforcement-learning•deep-learning

在阅读本文前，需要了解强化学习的基本概念。本文按顺序探讨了以下几个部分：策略方法的目标朴素策略梯度的计算策略网络的简易训练步骤 reward-to-go 策略梯度一般形式的策略梯度在深度学习的背景下，我们规定策略函数使用神经网络近似，表示为：πθ(a∣s)\pi_{\theta}(a \mid s)πθ(a∣s)。其中，θ\thetaθ 为策略网络的参数。 Related Material 本文主要参考 OpenAI Spinning Up Documentation Part 3: Intro to Policy Optimization 编写，对部分内容的进行了删减，感兴趣的读者可以阅读原文。此外，为了确保数学公式的严谨性，参考了 Policy Gradient Algorithms | Lil’Log，该文章主要围绕数学理论展开。类似的文章还有 Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients。 Optimization...

Key Concepts in Reinforcement Learning

Created2025-03-05|Updated2025-03-05|Reinforcement Learning Basics|reinforcement-learning•deep-learning

Related Material 本文主要摘录翻译自 OpenAI Spinning Up Documentation Part 1: Key Concepts in RL，忽略了对具体实现的讲解，同时对回报，价值函数等概念进行了更规范的形式化表述。A (Long) Peek into Reinforcement Learning | Lil’Log 一文对这些概念的讲解更加理论深入。 Overview of Reinforcement Learning 强化学习（Reinforcement Learning, RL）中的主要角色是智能体（Agent）和环境（Environment）。agent 位于环境中，且与环境进行交互。在每一步交互中， agent 会观测当前环境状态（State）的（可能不完整的）部分信息，然后决定采取一个动作（Action） agent 的动作会改变环境，但环境也可能自行发生变化。同时，环境会给 agent...

LoRA: From Principle to Implementation

Created2025-01-19|Updated2025-03-05|NLP|deep-learning•LLM•NLP•Python•Pytorch•PEFT

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) technique designed to reduce the number of trainable parameters in LLMs while maintaining high performance. It achieves this by decomposing weight updates into low-rank matrices, significantly reducing memory and computational costs. Motivation: Intrinsic Dimension Hypothesis The Intrinsic Dimension Hypothesis states that high-dimensional models actually lie on a much lower-dimensional subspace. Similarly, LoRA...

Additive PEFT: From Adapter to Proxy Tuning

Created2025-01-19|Updated2025-03-05|NLP|deep-learning•LLM•NLP•PEFT

PEFT Taxonomy Taxonomy Idea Examples Additive PEFT Introduce extra trainable parameters while freezing the original ones. Adapter Tuning, Prefix Tuning, Prompt Tuning, Proxy Tuning Selective PEFT Update a subset of the original parameters while freezing the rest. BitFit, Child Tuning Reparameterized PEFT Transform existing parameters for efficient training, then revert back for inference LoRA Based on where the trainable parameters are introduced, additive PEFT can be further...

Building GPT-2 from Scratch: A Detailed Implementation Guide

Created2025-01-13|Updated2025-03-07|NLP|deep-learning•LLM•NLP•Python•Pytorch

本文主要讲解模型的实现，辅以必要的数学公式。而对数学原理的深入讲解详见 Transformer。请确保阅读此文前，对 Transformer Decoder 的数学原理已有全面的了解。本文代码的实现主要基于 karpathy/nanoGPT 与 karpathy/build-nanogpt，同时参考了 Hugging Face transformers 库对 GPT-2 的实现。本文的 Reference 一节列出了详细的参考列表。与 nanoGPT 实现的差异主要在于：模块初始化的参数：nanoGPT 与 HuggingFace 多以 config 字典直接传入，而本文各模块的 __init__ 初始化方法，详细列出了每个参数，可以类比字典解包超参数的数量与名称：仿照 HuggingFace 的实现，添加了 layer_norm_epsilon，修改 block_size 为 n_positions Transformer Decoder and GPT-2 Transformer decoder 由多个相同的 block 堆叠而成，每个 block...

Transformer: From Principle to Implementation

Created2025-01-11|Updated2025-03-07|NLP|deep-learning•LLM•NLP•Python•Pytorch

Overview Transformer 的提出主要解决 RNN 的三个问题：最小化每层的计算复杂度。最小化任何一对词间的路径长度：RNN 从左到右顺序编码，需要 O(N)\mathcal{O}(N)O(N) 步才能让远距离的词间进行交互。这意味着 RNN 难以学习长距离依赖，由于梯度问题。最大化可并行化的计算量：RNN 前向与反向传播均有 O(N)\mathcal{O}(N)O(N) 步不可并行的计算，无法充分利用 GPU, TPU 等假设 NNN 为序列长度，DDD 为表示维度。recurrent 和 self-attention 的每层复杂度如下表所示： Layer Type Complexity per Layer Self-Attention O(N2⋅D)\mathcal{O}(N^{2} \cdot D)O(N2⋅D) Recurrent O(N⋅D2)\mathcal{O}(N \cdot D^{2})O(N⋅D2) 当 N≪DN \ll DN≪D 时，Transformer 的每层复杂度比 RNN...

Linear Regression: From Principle to Implementation

Created2024-01-17|Updated2025-03-05|Deep Learning Basics|deep-learning•Python•Pytorch

Introduction Suppose we have a dataset giving the area and age of some houses, how can we predict future house prices? Now we introduce linear regression to tackle this prediction problem. Linear regression model assumes that: price=warea⋅area+wage⋅age+b\textrm{price} = w_{\textrm{area}} \cdot \textrm{area} + w_{\textrm{age}} \cdot \textrm{age} + b price=warea⋅area+wage⋅age+b Example Concepts area\textrm{area}area, age\textrm{age}age features (a.k.a....

Python Basic Data Types: Dictionary

Created2023-10-21|Updated2025-03-05|Python Basics|Python•data-structure

This note mainly focuses on summarizing knowledge based on Corey Schafer’s Python Tutorial. Dictionary is a collection of key-value pairs. Creating Dictionaries We use curly braces notation to represent a dictionary. empty_dict = {} # create an empty dictionarystudent = {'name': 'John', 'age': 25, 'course': ['Math', 'CompSci']}print(student) {'name': 'John', 'age': 25,...

Python Basic Data Types: Lists, Tuples and Sets

Created2023-10-21|Updated2025-03-05|Python Basics|Python•data-structure

This note mainly focuses on summarizing knowledge based on Corey Schafer’s Python Tutorial Lists List is a collection which is: ordered changeable Creating Lists We use square bracket notation to represent a list. empty_list = [] # create an empty listcourses = ['History', 'Math', 'Physics', 'CompSci']print(courses) ['History', 'Math', 'Physics', 'CompSci'] Similar to string, we can use len to get the length...