Transformer: From Principle to Implementation
Overview Transformer 的提出主要解决 RNN 的三个问题: 最小化每层的计算复杂度。 最小化任何一对词间的路径长度:RNN 从左到右顺序编码,需要 O(N)\mathcal{O}(N)O(N) 步才能让远距离的词间进行交互。这意味着 RNN 难以学习长距离依赖,由于梯度问题。 最大化可并行化的计算量:RNN 前向与反向传播均有 O(N)\mathcal{O}(N)O(N) 步不可并行的计算,无法充分利用 GPU, TPU 等 假设 NNN 为序列长度,DDD 为表示维度。recurrent 和 self-attention 的每层复杂度如下表所示: Layer Type Complexity per Layer Self-Attention O(N2⋅D)\mathcal{O}(N^{2} \cdot D)O(N2⋅D) Recurrent O(N⋅D2)\mathcal{O}(N \cdot D^{2})O(N⋅D2) 当 N≪DN \ll DN≪D 时,Transformer 的每层复杂度比 RNN 低。 以机器翻译任务为例,T...
Linear Regression: From Principle to Implementation
Introduction Suppose we have a dataset giving the area and age of some houses, how can we predict future house prices? Now we introduce linear regression to tackle this prediction problem. Linear regression model assumes that: price=warea⋅area+wage⋅age+b\textrm{price} = w_{\textrm{area}} \cdot \textrm{area} + w_{\textrm{age}} \cdot \textrm{age} + b price=warea⋅area+wage⋅age+b Example Concepts area\textrm{area}area, age\textrm{age}age features (a.k.a. inputs) price\textrm{price}pric...
Python Basic Data Types: Dictionary
This note mainly focuses on summarizing knowledge based on Corey Schafer’s Python Tutorial. Dictionary is a collection of key-value pairs. Creating Dictionaries We use curly braces notation to represent a dictionary. empty_dict = {} # create an empty dictionarystudent = {'name': 'John', 'age': 25, 'course': ['Math', 'CompSci']}print(student) {'name': 'John', 'age': 25, 'co...
Python Basic Data Types: Lists, Tuples and Sets
This note mainly focuses on summarizing knowledge based on Corey Schafer’s Python Tutorial Lists List is a collection which is: ordered changeable Creating Lists We use square bracket notation to represent a list. empty_list = [] # create an empty listcourses = ['History', 'Math', 'Physics', 'CompSci']print(courses) ['History', 'Math', 'Physics', 'CompSci'] Similar to string, we can use len to get the length o...