This tutorial was written by Jacob Valdez (1001628688) without any human assistance or copying any code. It represents a powerful introduction to regularization for any ML student. Minor changes have been made since submission. Please see git log for details.

# A Simple Demonstration of OverfittingÂ¶

We've all done it. Have you ever studied all night for an exam but then realized that you were studying the wrong material? You were overfitting. The same can happen with machine learning models. Whether you're looking at an extremely simple linear regression model or a 1000-layer residual neural network, if not propperly regularized, the model can begin to systematically bias its performance around the training dataset. In this notebook, I want to show you how this can happen in the simpler case (linear regression model) and present one way to avoid minimizing overfitting.

Let's start by importing the libraries we'll need and building a plotting utility.

```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from IPython import display
def plot(true_points=None, pred_points=None, data_points=None, ax=None):
if ax is None:
fig, ax = plt.subplots(1, 1)
if true_points is not None:
x_true, y_true = true_points
ax.plot(x_true, y_true, 'g')
if pred_points is not None:
x_pred, y_pred = pred_points
ax.plot(x_pred, y_pred, 'r')
if data_points is not None:
x_data, y_data = data_points
ax.plot(x_data, y_data, 'bo')
ax.set_xlabel('x')
ax.set_ylabel('y')
```

Great. Imports and utilities are set up. Let's start by building a simple linear regression model. Formally, a linear regression model is a function of the form:

$$f(x) = w_0 + w_1 x + w_2 x^2 + ... + w_n x^n$$

For example, a zeroeth order polynomial is just a constant $f(x) = w_0$. A first order polynomial is a linear function $f(x) = w_0 + w_1 x$. A second order polynomial is a quadratic function $f(x) = w_0 + w_1 x + w_2 x^2$. And so on. The highest exponent determines the order of the polynomial.

Since we're dealing with linear combinations of the exponented value, we can represent our linear regression model $f(x)$ succinctly as a vector dot-product:

$$f(x) = \vec{w} \cdot \vec{x}$$

where $\vec{w} = \langle w_0 \ w_1 \ w_2 \ ... \ w_n \rangle$ is the vector of coefficients and $\vec{x} = \langle 1 \ x \ x^2 \ ... \ x^n \rangle$ is the vector of exponentiated values. Besides being succinct for in math, the dot-product operation is also a convenient way to perform linear regression model in code:

```
def model(x, w):
"""
Linear regression model
Computes f(x) = w_0 + w_1 x + w_2 x^2 + ... + w_n x^n
for a given x and weights w.
"""
# compute increasing exponents of x
X = np.stack([x**i for i in range(w.shape[0])])
# compute dot-product of weights and x
return w.dot(X)
```

Let's try out this function with a simple example.

```
x = 2
w = np.array([3, 4, 5])
y = model(x, w)
x, w, y
```

(2, array([3, 4, 5]), 31)

Does the math look correct? $\vec{w} \cdot \vec{x} = 3 + 4 \cdot 2 + 5 \cdot 2^2 = 31 = y$. Correct!

Before moving on, I forgot to emphasize that our linear regression model is going to be operating on lots and lots of data. Currently, `model`

must be called for each data point. This is inefficient. We can speed up the process by using a matrix multiplication:

```
def model_matmul(xs, w):
"""
Linear regression model
Computes f(x) = w_0 + w_1 x + w_2 x^2 + ... + w_n x^n
for a given set of xs and weights w.
"""
# compute increasing exponents of x
X = np.stack([xs**i for i in range(w.shape[0])], axis=1)
# compute dot-product of weights and *all* xs using matrix multiplication
return X @ w
```

Let's make sure our new model is correct by comparing it to the old one.

```
xs = np.array([1, 2, 3, 4, 5, 6, 7])
w = np.array([3, 4, 5])
ys = [model(x, w) for x in xs]
ys_matmul = model_matmul(xs, w)
ys==ys_matmul
```

array([ True, True, True, True, True, True, True])

Looks good! Now its time to build vectors for our data points and targets. In the real-world, you'd be loading your data from a database or file, but I'm going to keep it simple with:

$$x_{data} \sim \mathcal{U}(0,1)$$ and $$y_{data} = \sin(2 \pi x_{data}) + \zeta$$

with $\zeta \sim \mathcal{N}(0,1)$ and 20 data points.

Ideally, our data points approximate the true function $$y_{true} = \sin(2 \pi x_{true})$$ for $x_{true} \in [0,1]$. Let's see this in code:

```
np.random.seed(0) # set seed for reproducibility
x_data = np.random.uniform(0, 1, size=20)
y_data = np.sin(2 * np.pi * x_data) + np.random.normal(0, 0.1, size=20)
x_true = np.linspace(0, 1, 100)
y_true = np.sin(2 * np.pi * x_true)
# plot data points and true function
fig, ax = plt.subplots(1, 2)
plot(true_points=(x_true, y_true), data_points=(x_data, y_data), ax=ax[0])
# plot kde of data points
_x_data_kde = np.linspace(0, 1, 10000)
_y_data_kde = np.sin(2 * np.pi * _x_data_kde) + np.random.normal(0, 0.1, size=10000)
sns.kdeplot(x=_x_data_kde, y=_y_data_kde, ax=ax[1], shade=True)
plt.show()
```

Now its time to train our model. We'll split the data into training and testing sets.

```
x_train = x_data[:10]
y_train = y_data[:10]
x_test = x_data[10:]
y_test = y_data[10:]
```

Training the model is easy. There are faster methods in the one-dimensional case, but I'm going to use backpropagation to illustrate the overfitting problem. If you aren't familiar with backpropagation, it treats the backwards propagation of error as a linear algebra problem solved locally at each node in the computation graph. We're fortunate to have a simple model: a linear combination of the exponented values combined with some loss function (e.g. MSE).

$$f(x) = \vec{w} \cdot \vec{x}$$ $$\mathcal{L}(f(x), y) = (f(x) - y)^2$$

Now we train by finding the vector $\vec{w}$ that minimizes the loss function $\mathcal{L}$. Formally, this is

$$\min_{\vec{w}} (\vec{w} \cdot \vec{x} - y)^2$$

Going backwards, we derive our parameter gradients:

$$d\mathcal{L} = 1 $$ $$de = 2 (f(x) - y) \mathcal{L}$$ $$df(x) = de$$ $$d\vec{w} = \vec{x}^T \cdot df(x)$$

Just to stay consistant with deep learning conventions, we'll run an iterative $\eta$-weighted update step instead of moving the parameters all at once. In code this becomes:

```
def backprop(xs, y_true, w, lr=0.01):
# build power-bases of xs
X = np.stack([xs**i for i in range(w.shape[0])], axis=1) # [n_samples, order]
# forward pass
y_pred = X @ w # [n_samples, order] x [order, 1] = [n_samples]
err = y_pred - y_true # [n_samples]
loss = np.sum(err**2) # []
# compute gradient
dloss = 1 # []
derr = 2 * err * dloss # [n_samples]
dy_pred = derr # [n_samples]
dw = X.T @ dy_pred # [order, n_samples] x [n_samples] = [order]
# update weights
w = w - lr * dw
return w, loss
```

Let's run a few iterations of the update step. If it's working, we should see the weights begin to converge on the true function.

```
def train_loop(w_init, lr=0.01, n_epochs=100, quiet=False):
xs = np.linspace(0, 1, 100)
w = w_init
if not quiet:
fig, ax = plt.subplots(2, 5, figsize=(20, 10))
for i in range(1, 101):
w, loss = backprop(x_train, y_train, w, lr=lr)
if i % 10 == 0:
y_pred = model_matmul(xs, w)
_, val_loss = backprop(x_test, y_test, w)
if not quiet:
print(f'Epoch: %d\tLoss: %6.3f\tVal. Loss: %6.3f\t' % (i, loss, val_loss))
n = i//10 - 1
plot(true_points=(x_true, y_true),
pred_points=(xs, y_pred),
data_points=(x_test, y_test),
ax=ax[n//5, n%5])
if not quiet:
plt.suptitle('Linear Regression Training')
plt.show()
return w
np.random.seed(0) # set seed for reproducibility
w = np.random.normal(0, 0.1, size=11)
_ = train_loop(w, lr=0.01)
```

Epoch: 10 Loss: 2.266 Val. Loss: 2.337 Epoch: 20 Loss: 2.149 Val. Loss: 2.114 Epoch: 30 Loss: 2.039 Val. Loss: 1.927 Epoch: 40 Loss: 1.935 Val. Loss: 1.758 Epoch: 50 Loss: 1.836 Val. Loss: 1.607 Epoch: 60 Loss: 1.743 Val. Loss: 1.471 Epoch: 70 Loss: 1.655 Val. Loss: 1.349 Epoch: 80 Loss: 1.572 Val. Loss: 1.240 Epoch: 90 Loss: 1.493 Val. Loss: 1.144 Epoch: 100 Loss: 1.419 Val. Loss: 1.060

Everything looks good! Now let's crank up the linear model order and see what happens.

```
orders = [0, 1, 6, 9, 16, 23, 47, 93, 127]
fig, ax = plt.subplots(1, len(orders), figsize=(20, 5))
xs = np.linspace(0, 1, 100)
for i, order in enumerate(orders):
np.random.seed(0) # set seed for reproducibility
w_init = np.random.normal(0, 0.1, size=(order+1))
w_fin = train_loop(w_init, lr=0.01/(1+0.001*order), quiet=True)
plot(true_points=(x_true, y_true),
pred_points=(xs, model_matmul(xs, w_fin)),
data_points=(x_test, y_test),
ax=ax[i])
fig.suptitle('Various Poly Orders')
plt.show()
```

Note that as we increase the order, the model becomes more complex. When pitted up against simple linear regression, the model started overfitting. What can we do to prevent this?

Regularization to the rescue! We'll add a small amount of L2 regularization to our loss function to reduce the overfitting. It goes like this:

$$\mathcal{L}(f(x), y) = (f(x) - y)^2 + \lambda \sum_{i=1}^n w_i^2$$

Backpropagation now includes the additional term:

$$d\vec{w} = \vec{x}^T \cdot df(x) + \underbrace{2 \lambda \vec{w}}_{\text{L2 loss penalty}}$$

In code, this becomes:

```
def backprop_L2(xs, y_true, w, lambda_L2=0.0, lr=0.01):
# build power-bases of xs
X = np.stack([xs**i for i in range(w.shape[0])], axis=1) # [n_samples, order]
# forward pass
y_pred = X @ w # [n_samples, order] x [order, 1] = [n_samples]
err = y_pred - y_true # [n_samples]
loss = np.sum(err**2) # []
# compute gradient
dloss = 1 # []
derr = 2 * err * dloss # [n_samples]
dy_pred = derr # [n_samples]
dw = X.T @ dy_pred # [order, n_samples] x [n_samples] = [order]
dw = dw + 2 * lambda_L2 * w # L2 regularization
# update weights
w = w - lr * dw
return w, loss
```

Let's see if this makes any difference ($\lambda=1$) in our training loop. I'm going to first run a quick sanity check to make sure the new code didn't break our loss function.

```
def train_loop(w_init, lr=0.01, lambda_L2=0.0, n_epochs=100, quiet=False):
xs = np.linspace(0, 1, 100)
w = w_init
n = 0
if not quiet:
fig, ax = plt.subplots(2, 5, figsize=(20, 10))
for i in range(1, n_epochs+1):
w, loss = backprop_L2(x_train, y_train, w, lambda_L2=lambda_L2, lr=lr)
if i % (n_epochs//10) == 0:
y_pred = model_matmul(xs, w)
_, val_loss = backprop_L2(x_test, y_test, w, lambda_L2=0.0)
if not quiet:
print(f'Epoch: %d\tLoss: %6.3f\tVal. Loss: %6.3f\t' % (i, loss, val_loss))
plot(true_points=(x_true, y_true),
pred_points=(xs, y_pred),
data_points=(x_test, y_test),
ax=ax[n//5, n%5])
n += 1
if not quiet:
plt.suptitle('Linear Regression Training with L2 Regularization')
plt.show()
return w
np.random.seed(0) # set seed for reproducibility
w = np.random.normal(0, 5, size=9)
_ = train_loop(w, lr=0.02, lambda_L2=0.1)
```

Epoch: 10 Loss: 6.302 Val. Loss: 7.193 Epoch: 20 Loss: 5.250 Val. Loss: 6.005 Epoch: 30 Loss: 4.697 Val. Loss: 5.432 Epoch: 40 Loss: 4.237 Val. Loss: 5.015 Epoch: 50 Loss: 3.836 Val. Loss: 4.680 Epoch: 60 Loss: 3.483 Val. Loss: 4.405 Epoch: 70 Loss: 3.174 Val. Loss: 4.177 Epoch: 80 Loss: 2.902 Val. Loss: 3.986 Epoch: 90 Loss: 2.661 Val. Loss: 3.825 Epoch: 100 Loss: 2.449 Val. Loss: 3.688

Good. Okay, let's compare various combinations of poly order with and without regularization. Then we'll make some conclusions on the L2 approach to combat overfitting.

```
orders = [0, 1, 6, 9, 16, 23, 47, 93]
lambda_L2 = [0.0, 0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
fig, ax = plt.subplots(len(lambda_L2), len(orders), figsize=(20, 20))
for i, l2 in enumerate(lambda_L2):
for j, order in enumerate(orders):
np.random.seed(0) # set seed for reproducibility
w_init = np.random.normal(0, 0.1, size=(order+1))
w_fin = train_loop(w_init, lr=0.02, lambda_L2=l2, quiet=True)
plot(true_points=(x_true, y_true),
pred_points=(xs, model_matmul(xs, w_fin)),
data_points=(x_test, y_test),
ax=ax[i,j])
ax[i,j].title.set_text(f'L2={l2} order={order}')
fig.suptitle('Various poly orders and L2 penalties')
plt.show()
```