Suppose we have a dataset giving the area and age of some houses, how can we predict future house prices? Now we introduce linear regression to tackle this prediction problem.
Linear regression model assumes that:
price=warea⋅area+wage⋅age+b
Example
Concepts
area, age
features (a.k.a. inputs)
price
target (a.k.a. outputs)
warea, wage
weights (a.k.a. parameters)
b
bias (a.k.a. offset, intercept)
Basics
Notations
To clarify the model, we establish notation below:
Concepts
Notation
features (a.k.a. inputs)
x
target (a.k.a. outputs)
y
estimated value
y^
weights (a.k.a. parameters)
w
bias (a.k.a. offset, intercept)
b
To represent different training examples, we put the (i) superscript in x(i). In the case above, x1(i) is the living area of the ith house in the training set, and x2(i) is its age.
Model
Suppose we have nexamples and dfeatures, the model can be generalized into:
y^=w1x1+⋯+wdxd+b=w⊤x+b.
where:
wx=[w1w2⋯wd]⊤∈Rd=[x1x2⋯xd]⊤∈Rd
In the formula above, x corresponds to the features of a single example. We can use the design matrixX∈Rn×d to represent our entire dataset of n examples.
Thus the predictions y^∈Rn can be expressed as:
y^=Xw+b
where:
y^X=[y^1y^2⋯y^n]⊤=[x1x2⋯xn]⊤
Task
Our task is to find the best model for predicting y given x, that is, to find the best parametersw and b.
Before searching for them, we need two more things:
A measure of the quality of some given model (i.e. loss function)
A procedure for updating the model to improve its quality (i.e. optimization algorithm)
Loss Function
Loss functions quantify the distance between the real and predicted values of the target. For regression problems, the most common loss function is the squared error. When our prediction for an example i is y^(i) and the corresponding true label is y(i), the squared error is given by:
ℓ(i)(w,b)=21(y^(i)−y(i))2
The constant 21 makes no real difference but proves to be notationally convenient when taking the derivative of the loss.
Considering entire dataset, we simply average (or equivalently, sum) the losses on the training set:
When training the model, we seek parameters (w∗,b∗) that minimize the total loss across all training examples:
w∗,b∗=argw,bminL(w,b)
Minibatch Stochastic Gradient Descent
In each iteration t:
Randomly sample a minibatchBt consisting of a fixed number ∣B∣ of training examples
Compute the gradient of the average loss on the minibatch (i.e. ∣B∣1∑i∈Bt∇(w,b)ℓ(i)(w,b))
Multiply the gradient by a predetermined learning rate (i.e. η)
Subtract the resulting term from the current parameter values (i.e. (w,b)).
We can express the update as follows:
(w,b)←(w,b)−∣B∣ηi∈Bt∑∇(w,b)l(i)(w,b)
Note that Minibatch size and learning rate are user-defined. Such tunable parameters that are not updated in the training loop are called hyperparameters.
Linear Regression Implementation from Scratch
Synthetic Regression Data
Generating the Dataset
In this part, we will generate:
X∈R1000×2: 1000examples with 2-dimensional features
y∈R1000: 1000labels
We generate each label by applying a ground truth linear function, corrupting them via additive noiseϵ:
y=Xw+b+ϵ
where: X∼N(0,12) and ϵ∼N(0,0.012)
defsynthetic_data(w, b, num_examples=1000): """ Synthetic data for linear regression. Args: w (torch.Tensor): weight vector b (float): bias term num_examples (int): number of examples Returns: Tuple[torch.Tensor, torch.Tensor]: features (X) and labels (y) """ X = torch.randn(num_examples, len(w)) # generate a matrix from the standard normal distribution y = torch.matmul(X, w) + b # calculate the labels y += torch.normal(0, 0.01, y.shape) # integrate the noise return X, y.reshape((-1, 1)) # reshape the labels to be a column vector
Then we set ground true values: w=[2,−3.4]⊤ and b=4.2.
defdata_iter(batch_size, features, labels): """ A generator that provides batches of data. Args: batch_size (int): The batch size. features (torch.Tensor): The features of the data. labels (torch.Tensor): The labels of the data. Yields: Tuple[torch.Tensor, torch.Tensor]: A batch of features and labels. """ num_examples = len(features)
# Make selecting random indices = list(range(num_examples)) random.shuffle(indices)
for i inrange(0, num_examples, batch_size): batch_indices = torch.tensor(indices[i:i + batch_size]) # Convert indices into tensor: for GPU acceleration and autograd yield features[batch_indices], labels[batch_indices]
We can read the first batch of data and print them.
batch_size = 10
for X, y in data_iter(batch_size, features, labels): print(X, '\n', y) break
Initializing the Parameters
Assume that w∼N(0,0.012) and b=0.
w = torch.normal(0, 0.01, size=(2, 1), requires_grad=True) b = torch.zeros(1, requires_grad=True)
Defining the Model
deflinear_regression(X, w, b): """ Args: X (torch.Tensor): The input data, a tensor of shape (n_examples, n_features) w (torch.Tensor): The weight vector, a tensor of shape (n_features, 1) b (torch.Tensor): The bias term, a scalar Returns: torch.Tensor: The output of the linear regression, a tensor of shape (n_examples, 1) """ return torch.matmul(X, w) + b # broadcasting machanism
Defining the Loss Function
defsquared_loss(y_hat, y): """ Calculates the squared loss between the predicted and actual values. Args: y_hat (torch.Tensor): The predicted values, of shape (n_examples, 1) y (torch.Tensor): The actual values, of shape (n_examples, 1) Returns: torch.Tensor: The squared loss, of shape (1,) """ return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
Defining the Optimization Algorithm
defSGD(params, lr, batch_size): """ Stochastic Gradient Descent (SGD) is an optimization algorithm for minimizing a loss function. It updates the parameters of a model based on the gradient of the loss with respect to the parameters. Args: params (list[torch.Tensor]): The parameters of the model to be updated. lr (float): The learning rate. batch_size (int): The size of the minibatch. Returns: None """ with torch.no_grad(): for param in params: param -= lr * param.grad / batch_size # update the parameters using the gradient and learning rate param.grad.zero_() # reset the gradient to zero
Training
Firstly, we set hyperparameters.
lr = 0.03# learning rate num_epochs = 3# number of epoch net = linear_regression # model architecture (net) loss = squared_loss # loss function
Then we repeat the training until done.
for epoch inrange(num_epochs): for X, y in data_iter(batch_size, features, labels): l = loss(net(X, w, b), y) l.sum().backward() SGD([w, b], lr, batch_size)
# Observe the loss with torch.no_grad(): train_l = loss(net(features, w, b), labels) print(f'epoch {epoch +1}, loss {float(train_l.mean()):f}')
Concise Implementation of Linear Regression
Synthetic Regression Data
Generating the Dataset
This part is the same as before.
defsynthetic_data(w, b, num_examples=1000): """ Synthetic data for linear regression. Args: w (torch.Tensor): weight vector b (float): bias term num_examples (int): number of examples Returns: Tuple[torch.Tensor, torch.Tensor]: features (X) and labels (y) """ X = torch.randn(num_examples, len(w)) # generate a matrix from the standard normal distribution y = torch.matmul(X, w) + b # calculate the labels y += torch.normal(0, 0.01, y.shape) # integrate the noise return X, y.reshape((-1, 1)) # reshape the labels to be a column vector
defload_array(data_arrays, batch_size, is_train=True): """ Construct a Pytorch data iterator. Args: data_arrays (tuple): a tuple of numpy arrays containing the data batch_size (int): the batch size for training or inference is_train (bool, optional): whether the data is for training or inference (default: True) Returns: DataLoader: a PyTorch DataLoader containing the data """ dataset = data.TensorDataset(*data_arrays) return data.DataLoader(dataset, batch_size, shuffle=is_train)
We can read the first batch of data and print them.
We can use Pytorch’s predefined layers, which allows us to focus on layers to use instead of the layers implementation.
Now we define net, an instance of Sequential class. The Sequential class connects multiple layers together. When given input data, the Sequential instance passes the data to the first layer, then uses the output of the first layer as the input of the second layer, and so on.
In this case, our model contains only one layer, so Sequential is not actually needed. However, since almost all the model in the future will be multi-layered, using Sequential here will make you familiar with the “standard pipeline”.
from torch import nn
net = nn. net = nn.Sequential(nn.Linear(2, 1))
In PyTorch, the fully connected layer is defined in Linear and LazyLinear classes. The latter allows users to specify merely the output dimension, while the former additionally requires the input dimension. For simplicity, we will use LazyLinear layers whenever we can.
num_epochs = 3 for epoch inrange(num_epochs): for X, y in data_iter: l = loss(net(X), y) trainer.zero_grad() l.backward() trainer.step() l = loss(net(features), labels)
# Observe the loss print(f'epoch {epoch + 1}, loss {l:f}')