Linear Regression: From Principle to Implementation

Introduction

Suppose we have a dataset giving the area and age of some houses, how can we predict future house prices? Now we introduce linear regression to tackle this prediction problem.

Linear regression model assumes that:

$\textrm{price} = w_{\textrm{area}} \cdot \textrm{area} + w_{\textrm{age}} \cdot \textrm{age} + b$

Example	Concepts
$\textrm{area}$ , $\textrm{age}$	features (a.k.a. inputs)
$\textrm{price}$	target (a.k.a. outputs)
$w_{\textrm{area}}$ , $w_{\textrm{age}}$	weights (a.k.a. parameters)
$b$	bias (a.k.a. offset, intercept)

Basics

Notations

To clarify the model, we establish notation below:

Concepts	Notation
features (a.k.a. inputs)	$\mathbf{x}$
target (a.k.a. outputs)	$y$
estimated value	$\hat{y}$
weights (a.k.a. parameters)	$\mathbf{w}$
bias (a.k.a. offset, intercept)	$b$

To represent different training examples, we put the $(i)$ superscript in $x^{(i)}$ . In the case above, $x_{1}^{(i)}$ is the living area of the $i^{th}$ house in the training set, and $x_{2}^{(i)}$ is its age.

Model

Suppose we have $n$ examples and $d$ features, the model can be generalized into:

$\begin{aligned} \hat{y} & = w_1 x_1 + \cdots + w_d x_{d} + b \\ & = \mathbf{w}^\top \mathbf{x} + b. \end{aligned}$

where:

$\begin{align*} \mathbf{w} & = \begin{bmatrix} w_{1} & w_{2} & \cdots & w_{d} \end{bmatrix}^\top \in \mathbb{R}^d \\ \mathbf{x} & = \begin{bmatrix} x_{1} & x_{2} & \cdots & x_{d} \end{bmatrix}^\top \in \mathbb{R}^d \end{align*}$

In the formula above, $\mathbf{x}$ corresponds to the features of a single example. We can use the design matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ to represent our entire dataset of $n$ examples.

Thus the predictions $\hat{\mathbf{y}} \in \mathbb{R}^n$ can be expressed as:

${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b$

where:

$\begin{align*} \hat{\mathbf{y}} & = \begin{bmatrix} \hat{y}_{1} & \hat{y}_{2} & \cdots & \hat{y}_{n} \end{bmatrix}^{\top} \\ \mathbf{X} & = \begin{bmatrix} \mathbf{x}_{1} & \mathbf{x}_{2} & \cdots & \mathbf{x}_{n} \end{bmatrix}^{\top} \end{align*}$

Task

Our task is to find the best model for predicting $y$ given $\mathbf{x}$ , that is, to find the best parameters $\mathbf{w}$ and $b$ .

Before searching for them, we need two more things:

A measure of the quality of some given model (i.e. loss function)
A procedure for updating the model to improve its quality (i.e. optimization algorithm)

Loss Function

Loss functions quantify the distance between the real and predicted values of the target. For regression problems, the most common loss function is the squared error. When our prediction for an example $i$ is $\hat{y}^{(i)}$ and the corresponding true label is $y^{(i)}$ , the squared error is given by:

$\ell^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^{2}$

The constant $\frac{1}{2}$ makes no real difference but proves to be notationally convenient when taking the derivative of the loss.

Considering entire dataset, we simply average (or equivalently, sum) the losses on the training set:

$\mathcal{L}(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n \ell^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2$

When training the model, we seek parameters $(\mathbf{w}^*, b^*)$ that minimize the total loss across all training examples:

$\mathbf{w}^{*}, b^{*} = \arg\min_{\mathbf{w}, b} \mathcal{L}(\mathbf{w}, b)$

Minibatch Stochastic Gradient Descent

In each iteration $t$ :

Randomly sample a minibatch $\mathcal{B}_t$ consisting of a fixed number $|\mathcal{B}|$ of training examples
Compute the gradient of the average loss on the minibatch (i.e. $\frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \nabla_{(\mathbf{w},b)} \ell^{(i)}(\mathbf{w},b)$ )
Multiply the gradient by a predetermined learning rate (i.e. $\eta$ )
Subtract the resulting term from the current parameter values (i.e. $(\mathbf{w},b)$ ).

We can express the update as follows:

$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t} \nabla_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b)$

Note that Minibatch size and learning rate are user-defined. Such tunable parameters that are not updated in the training loop are called hyperparameters.

Linear Regression Implementation from Scratch

Synthetic Regression Data

Generating the Dataset

In this part, we will generate:

$\mathbf{X} \in \mathbb{R}^{1000 \times 2}$ : $1000$ examples with $2$ -dimensional features
$\mathbf{y} \in \mathbb{R}^{1000}$ : $1000$ labels

We generate each label by applying a ground truth linear function, corrupting them via additive noise $\boldsymbol{\epsilon}$ :

$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \boldsymbol{\epsilon}$

where: $\mathbf{X} \sim \mathcal{N}(0, 1^2)$ and $\boldsymbol{\epsilon} \sim \mathcal{N}(0, 0.01^2)$

def synthetic_data(w, b, num_examples=1000):
    """
    Synthetic data for linear regression.

    Args:
        w (torch.Tensor): weight vector
        b (float): bias term
        num_examples (int): number of examples

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: features (X) and labels (y)
    """
    X = torch.randn(num_examples, len(w))               # generate a matrix from the standard normal distribution
    y = torch.matmul(X, w) + b                          # calculate the labels
    y += torch.normal(0, 0.01, y.shape)                 # integrate the noise
    return X, y.reshape((-1, 1))                        # reshape the labels to be a column vector

Then we set ground true values: $\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$ .

features, labels = synthetic_data(w=torch.tensor([2, -3.4]), b=4.2)

We can observe the linear relationship between the second feature $\mathbf{x_2}$ (i.e. features[:, 1]) and labels $\mathbf{y}$ (i.e. labels) by plotting the scatter.

plt.figure(figsize=(6, 4))
plt.scatter(features[:, 1].detach().numpy(), labels.detach().numpy(), s=1)

Reading the Dataset

def data_iter(batch_size, features, labels):
    """
    A generator that provides batches of data.

    Args:
        batch_size (int): The batch size.
        features (torch.Tensor): The features of the data.
        labels (torch.Tensor): The labels of the data.

    Yields:
        Tuple[torch.Tensor, torch.Tensor]: A batch of features and labels.
    """
    num_examples = len(features)

    # Make selecting random
    indices = list(range(num_examples))
    random.shuffle(indices)

    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(indices[i:i + batch_size]) # Convert indices into tensor: for GPU acceleration and autograd
        yield features[batch_indices], labels[batch_indices]

We can read the first batch of data and print them.

batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break

Initializing the Parameters

Assume that $\mathbf{w} \sim \mathcal{N}(0, 0.01^2)$ and $b = 0$ .

w = torch.normal(0, 0.01, size=(2, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

Defining the Model

def linear_regression(X, w, b):
    """
    Args:
        X (torch.Tensor): The input data, a tensor of shape (n_examples, n_features)
        w (torch.Tensor): The weight vector, a tensor of shape (n_features, 1)
        b (torch.Tensor): The bias term, a scalar

    Returns:
        torch.Tensor: The output of the linear regression, a tensor of shape (n_examples, 1)
    """
    return torch.matmul(X, w) + b # broadcasting machanism

Defining the Loss Function

def squared_loss(y_hat, y):
    """
    Calculates the squared loss between the predicted and actual values.

    Args:
        y_hat (torch.Tensor): The predicted values, of shape (n_examples, 1)
        y (torch.Tensor): The actual values, of shape (n_examples, 1)

    Returns:
        torch.Tensor: The squared loss, of shape (1,)
    """
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

Defining the Optimization Algorithm

def SGD(params, lr, batch_size):
    """
    Stochastic Gradient Descent (SGD) is an optimization algorithm for minimizing a loss function.
    It updates the parameters of a model based on the gradient of the loss with respect to the parameters.

    Args:
        params (list[torch.Tensor]): The parameters of the model to be updated.
        lr (float): The learning rate.
        batch_size (int): The size of the minibatch.

    Returns:
        None
    """
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size   # update the parameters using the gradient and learning rate
            param.grad.zero_()                      # reset the gradient to zero

Training

Firstly, we set hyperparameters.

lr = 0.03                   # learning rate
num_epochs = 3              # number of epoch
net = linear_regression     # model architecture (net)
loss = squared_loss         # loss function

Then we repeat the training until done.

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)
        l.sum().backward()
        SGD([w, b], lr, batch_size)

    # Observe the loss
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch +1}, loss {float(train_l.mean()):f}')

Concise Implementation of Linear Regression

Synthetic Regression Data

Generating the Dataset

This part is the same as before.

def synthetic_data(w, b, num_examples=1000):
    """
    Synthetic data for linear regression.

    Args:
        w (torch.Tensor): weight vector
        b (float): bias term
        num_examples (int): number of examples

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: features (X) and labels (y)
    """
    X = torch.randn(num_examples, len(w))               # generate a matrix from the standard normal distribution
    y = torch.matmul(X, w) + b                          # calculate the labels
    y += torch.normal(0, 0.01, y.shape)                 # integrate the noise
    return X, y.reshape((-1, 1))                        # reshape the labels to be a column vector

features, labels = synthetic_data(w=torch.tensor([2, -3.4]), b=4.2)

Reading the Dataset

def load_array(data_arrays, batch_size, is_train=True):
    """
    Construct a Pytorch data iterator.

    Args:
        data_arrays (tuple): a tuple of numpy arrays containing the data
        batch_size (int): the batch size for training or inference
        is_train (bool, optional): whether the data is for training or inference (default: True)

    Returns:
        DataLoader: a PyTorch DataLoader containing the data
    """
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)

We can read the first batch of data and print them.

batch_size = 10
data_iter = load_array((features, labels), batch_size)
next(iter(data_iter))

Defining the Model

We can use Pytorch’s predefined layers, which allows us to focus on layers to use instead of the layers implementation.

Now we define net, an instance of Sequential class. The Sequential class connects multiple layers together. When given input data, the Sequential instance passes the data to the first layer, then uses the output of the first layer as the input of the second layer, and so on.

In this case, our model contains only one layer, so Sequential is not actually needed. However, since almost all the model in the future will be multi-layered, using Sequential here will make you familiar with the “standard pipeline”.

from torch import nn

net = nn.
net = nn.Sequential(nn.Linear(2, 1))

In PyTorch, the fully connected layer is defined in Linear and LazyLinear classes. The latter allows users to specify merely the output dimension, while the former additionally requires the input dimension. For simplicity, we will use LazyLinear layers whenever we can.

Initializing the Parameters

net[0].weight.data.normal_(0, 0.01)
net[0].bias.data.fill_(0)

Defining the Loss Function

The MSELoss class computes the mean squared error (without the $\frac{1}{2}$ factor). By default, MSELoss returns the average loss over examples.

loss = nn.MSELoss()

Defining the Optimization Algorithm

trainer = torch.optim.SGD(net.parameters(), lr=0.03)

Training

num_epochs = 3
for epoch in range(num_epochs):
    for X, y in data_iter:
        l = loss(net(X), y)
        trainer.zero_grad()
        l.backward()
        trainer.step()
    l = loss(net(features), labels)

    # Observe the loss
    print(f'epoch {epoch + 1}, loss {l:f}')

Introduction

Basics

Notations

Model

Task

Loss Function

Minibatch Stochastic Gradient Descent

Linear Regression Implementation from Scratch

Synthetic Regression Data

Generating the Dataset

Reading the Dataset

Initializing the Parameters

Defining the Model

Defining the Loss Function

Defining the Optimization Algorithm

Training

Concise Implementation of Linear Regression

Synthetic Regression Data

Generating the Dataset

Reading the Dataset

Defining the Model

Initializing the Parameters

Defining the Loss Function

Defining the Optimization Algorithm

Training

Reference