Introduction

Suppose we have a dataset giving the area and age of some houses, how can we predict future house prices? Now we introduce linear regression to tackle this prediction problem.

Linear regression model assumes that:

price=wareaarea+wageage+b\textrm{price} = w_{\textrm{area}} \cdot \textrm{area} + w_{\textrm{age}} \cdot \textrm{age} + b

Example Concepts
area\textrm{area}, age\textrm{age} features (a.k.a. inputs)
price\textrm{price} target (a.k.a. outputs)
wareaw_{\textrm{area}}, wagew_{\textrm{age}} weights (a.k.a. parameters)
bb bias (a.k.a. offset, intercept)

Basics

Notations

To clarify the model, we establish notation below:

Concepts Notation
features (a.k.a. inputs) x\mathbf{x}
target (a.k.a. outputs) yy
estimated value y^\hat{y}
weights (a.k.a. parameters) w\mathbf{w}
bias (a.k.a. offset, intercept) bb

To represent different training examples, we put the (i)(i) superscript in x(i)x^{(i)}. In the case above, x1(i)x_{1}^{(i)} is the living area of the ithi^{th} house in the training set, and x2(i)x_{2}^{(i)} is its age.

Model

Suppose we have nn examples and dd features, the model can be generalized into:

y^=w1x1++wdxd+b=wx+b.\begin{aligned} \hat{y} & = w_1 x_1 + \cdots + w_d x_{d} + b \\ & = \mathbf{w}^\top \mathbf{x} + b. \end{aligned}

where:

w=[w1w2wd]Rdx=[x1x2xd]Rd\begin{align*} \mathbf{w} & = \begin{bmatrix} w_{1} & w_{2} & \cdots & w_{d} \end{bmatrix}^\top \in \mathbb{R}^d \\ \mathbf{x} & = \begin{bmatrix} x_{1} & x_{2} & \cdots & x_{d} \end{bmatrix}^\top \in \mathbb{R}^d \end{align*}

In the formula above, x\mathbf{x} corresponds to the features of a single example. We can use the design matrix XRn×d\mathbf{X} \in \mathbb{R}^{n \times d} to represent our entire dataset of nn examples.

Thus the predictions y^Rn\hat{\mathbf{y}} \in \mathbb{R}^n can be expressed as:

y^=Xw+b{\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b

where:

y^=[y^1y^2y^n]X=[x1x2xn]\begin{align*} \hat{\mathbf{y}} & = \begin{bmatrix} \hat{y}_{1} & \hat{y}_{2} & \cdots & \hat{y}_{n} \end{bmatrix}^{\top} \\ \mathbf{X} & = \begin{bmatrix} \mathbf{x}_{1} & \mathbf{x}_{2} & \cdots & \mathbf{x}_{n} \end{bmatrix}^{\top} \end{align*}

Task

Our task is to find the best model for predicting yy given x\mathbf{x}, that is, to find the best parameters w\mathbf{w} and bb.

Before searching for them, we need two more things:

  1. A measure of the quality of some given model (i.e. loss function)
  2. A procedure for updating the model to improve its quality (i.e. optimization algorithm)

Loss Function

Loss functions quantify the distance between the real and predicted values of the target. For regression problems, the most common loss function is the squared error. When our prediction for an example ii is y^(i)\hat{y}^{(i)} and the corresponding true label is y(i)y^{(i)}, the squared error is given by:

(i)(w,b)=12(y^(i)y(i))2\ell^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^{2}

The constant 12\frac{1}{2} makes no real difference but proves to be notationally convenient when taking the derivative of the loss.

Considering entire dataset, we simply average (or equivalently, sum) the losses on the training set:

L(w,b)=1ni=1n(i)(w,b)=1ni=1n12(wx(i)+by(i))2\mathcal{L}(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n \ell^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2

When training the model, we seek parameters (w,b)(\mathbf{w}^*, b^*) that minimize the total loss across all training examples:

w,b=argminw,bL(w,b)\mathbf{w}^{*}, b^{*} = \arg\min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b)

Minibatch Stochastic Gradient Descent

In each iteration tt:

  1. Randomly sample a minibatch Bt\mathcal{B}_t consisting of a fixed number B|\mathcal{B}| of training examples
  2. Compute the gradient of the average loss on the minibatch (i.e. 1BiBt(w,b)(i)(w,b)\frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \nabla_{(\mathbf{w},b)} \ell^{(i)}(\mathbf{w},b))
  3. Multiply the gradient by a predetermined learning rate (i.e. η\eta)
  4. Subtract the resulting term from the current parameter values (i.e. (w,b)(\mathbf{w},b)).

We can express the update as follows:

(w,b)(w,b)ηBiBt(w,b)l(i)(w,b)(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \nabla_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b)

Note that Minibatch size and learning rate are user-defined. Such tunable parameters that are not updated in the training loop are called hyperparameters.

Linear Regression Implementation from Scratch

Synthetic Regression Data

Generating the Dataset

In this part, we will generate:

  1. XR1000×2\mathbf{X} \in \mathbb{R}^{1000 \times 2}: 10001000 examples with 22-dimensional features
  2. yR1000\mathbf{y} \in \mathbb{R}^{1000}: 10001000 labels

We generate each label by applying a ground truth linear function, corrupting them via additive noise ϵ\boldsymbol{\epsilon}:

y=Xw+b+ϵ\mathbf{y}= \mathbf{X} \mathbf{w} + b + \boldsymbol{\epsilon}

where: XN(0,12)\mathbf{X} \sim \mathcal{N}(0, 1^2) and ϵN(0,0.012)\boldsymbol{\epsilon} \sim \mathcal{N}(0, 0.01^2)

def synthetic_data(w, b, num_examples=1000):
"""
Synthetic data for linear regression.

Args:
w (torch.Tensor): weight vector
b (float): bias term
num_examples (int): number of examples

Returns:
Tuple[torch.Tensor, torch.Tensor]: features (X) and labels (y)
"""
X = torch.randn(num_examples, len(w)) # generate a matrix from the standard normal distribution
y = torch.matmul(X, w) + b # calculate the labels
y += torch.normal(0, 0.01, y.shape) # integrate the noise
return X, y.reshape((-1, 1)) # reshape the labels to be a column vector

Then we set ground true values: w=[2,3.4]\mathbf{w} = [2, -3.4]^\top and b=4.2b = 4.2.

features, labels = synthetic_data(w=torch.tensor([2, -3.4]), b=4.2)

We can observe the linear relationship between the second feature x2\mathbf{x_2} (i.e. features[:, 1]) and labels y\mathbf{y} (i.e. labels) by plotting the scatter.

plt.figure(figsize=(6, 4))
plt.scatter(features[:, 1].detach().numpy(), labels.detach().numpy(), s=1)

Reading the Dataset

def data_iter(batch_size, features, labels):
"""
A generator that provides batches of data.

Args:
batch_size (int): The batch size.
features (torch.Tensor): The features of the data.
labels (torch.Tensor): The labels of the data.

Yields:
Tuple[torch.Tensor, torch.Tensor]: A batch of features and labels.
"""
num_examples = len(features)

# Make selecting random
indices = list(range(num_examples))
random.shuffle(indices)

for i in range(0, num_examples, batch_size):
batch_indices = torch.tensor(indices[i:i + batch_size]) # Convert indices into tensor: for GPU acceleration and autograd
yield features[batch_indices], labels[batch_indices]

We can read the first batch of data and print them.

batch_size = 10

for X, y in data_iter(batch_size, features, labels):
print(X, '\n', y)
break

Initializing the Parameters

Assume that wN(0,0.012)\mathbf{w} \sim \mathcal{N}(0, 0.01^2) and b=0b = 0.

w = torch.normal(0, 0.01, size=(2, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

Defining the Model

def linear_regression(X, w, b):
"""
Args:
X (torch.Tensor): The input data, a tensor of shape (n_examples, n_features)
w (torch.Tensor): The weight vector, a tensor of shape (n_features, 1)
b (torch.Tensor): The bias term, a scalar

Returns:
torch.Tensor: The output of the linear regression, a tensor of shape (n_examples, 1)
"""
return torch.matmul(X, w) + b # broadcasting machanism

Defining the Loss Function

def squared_loss(y_hat, y):
"""
Calculates the squared loss between the predicted and actual values.

Args:
y_hat (torch.Tensor): The predicted values, of shape (n_examples, 1)
y (torch.Tensor): The actual values, of shape (n_examples, 1)

Returns:
torch.Tensor: The squared loss, of shape (1,)
"""
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

Defining the Optimization Algorithm

def SGD(params, lr, batch_size):
"""
Stochastic Gradient Descent (SGD) is an optimization algorithm for minimizing a loss function.
It updates the parameters of a model based on the gradient of the loss with respect to the parameters.

Args:
params (list[torch.Tensor]): The parameters of the model to be updated.
lr (float): The learning rate.
batch_size (int): The size of the minibatch.

Returns:
None
"""
with torch.no_grad():
for param in params:
param -= lr * param.grad / batch_size # update the parameters using the gradient and learning rate
param.grad.zero_() # reset the gradient to zero

Training

Firstly, we set hyperparameters.

lr = 0.03                   # learning rate
num_epochs = 3 # number of epoch
net = linear_regression # model architecture (net)
loss = squared_loss # loss function

Then we repeat the training until done.

for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y)
l.sum().backward()
SGD([w, b], lr, batch_size)

# Observe the loss
with torch.no_grad():
train_l = loss(net(features, w, b), labels)
print(f'epoch {epoch +1}, loss {float(train_l.mean()):f}')

Concise Implementation of Linear Regression

Synthetic Regression Data

Generating the Dataset

This part is the same as before.

def synthetic_data(w, b, num_examples=1000):
"""
Synthetic data for linear regression.

Args:
w (torch.Tensor): weight vector
b (float): bias term
num_examples (int): number of examples

Returns:
Tuple[torch.Tensor, torch.Tensor]: features (X) and labels (y)
"""
X = torch.randn(num_examples, len(w)) # generate a matrix from the standard normal distribution
y = torch.matmul(X, w) + b # calculate the labels
y += torch.normal(0, 0.01, y.shape) # integrate the noise
return X, y.reshape((-1, 1)) # reshape the labels to be a column vector
features, labels = synthetic_data(w=torch.tensor([2, -3.4]), b=4.2)

Reading the Dataset

def load_array(data_arrays, batch_size, is_train=True):
"""
Construct a Pytorch data iterator.

Args:
data_arrays (tuple): a tuple of numpy arrays containing the data
batch_size (int): the batch size for training or inference
is_train (bool, optional): whether the data is for training or inference (default: True)

Returns:
DataLoader: a PyTorch DataLoader containing the data
"""
dataset = data.TensorDataset(*data_arrays)
return data.DataLoader(dataset, batch_size, shuffle=is_train)

We can read the first batch of data and print them.

batch_size = 10
data_iter = load_array((features, labels), batch_size)
next(iter(data_iter))

Defining the Model

We can use Pytorch’s predefined layers, which allows us to focus on layers to use instead of the layers implementation.

Now we define net, an instance of Sequential class. The Sequential class connects multiple layers together. When given input data, the Sequential instance passes the data to the first layer, then uses the output of the first layer as the input of the second layer, and so on.

In this case, our model contains only one layer, so Sequential is not actually needed. However, since almost all the model in the future will be multi-layered, using Sequential here will make you familiar with the “standard pipeline”.

from torch import nn

net = nn.
net = nn.Sequential(nn.Linear(2, 1))

In PyTorch, the fully connected layer is defined in Linear and LazyLinear classes. The latter allows users to specify merely the output dimension, while the former additionally requires the input dimension. For simplicity, we will use LazyLinear layers whenever we can.

Initializing the Parameters

net[0].weight.data.normal_(0, 0.01)
net[0].bias.data.fill_(0)

Defining the Loss Function

The MSELoss class computes the mean squared error (without the 12\frac{1}{2} factor). By default, MSELoss returns the average loss over examples.

loss = nn.MSELoss()

Defining the Optimization Algorithm

trainer = torch.optim.SGD(net.parameters(), lr=0.03)

Training

num_epochs = 3
for epoch in range(num_epochs):
for X, y in data_iter:
l = loss(net(X), y)
trainer.zero_grad()
l.backward()
trainer.step()
l = loss(net(features), labels)

# Observe the loss
print(f'epoch {epoch + 1}, loss {l:f}')

Reference