contrib/chapter_machine-learning-fundamentals/underfit-overfit.md
%load_ext d2lbook.tab
tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
:label:sec_polynomial
In this section we test out some of the concepts that we saw previously. To keep matters simple, we use polynomial regression as our toy example.
%%tab mxnet
from d2l import mxnet as d2l
from mxnet import gluon, np, npx
from mxnet.gluon import nn
import math
npx.set_np()
%%tab pytorch
from d2l import torch as d2l
import torch
from torch import nn
import math
%%tab tensorflow
from d2l import tensorflow as d2l
import tensorflow as tf
import math
First we need data. Given $x$, we will [use the following cubic polynomial to generate the labels] on training and test data:
($$y = 5 + 1.2x - 3.4\frac{x^2}{2!} + 5.6 \frac{x^3}{3!} + \epsilon \text{ where } \epsilon \sim \mathcal{N}(0, 0.1^2).$$)
The noise term $\epsilon$ obeys a normal distribution with a mean of 0 and a standard deviation of 0.1. For optimization, we typically want to avoid very large values of gradients or losses. This is why the features are rescaled from $x^i$ to $\frac{x^i}{i!}$. It allows us to avoid very large values for large exponents $i$. We will synthesize 100 samples each for the training set and test set.
%%tab all
class Data(d2l.DataModule):
def __init__(self, num_train, num_val, num_inputs, batch_size):
self.save_hyperparameters()
p, n = max(3, self.num_inputs), num_train + num_val
w = d2l.tensor([1.2, -3.4, 5.6] + [0]*(p-3))
if tab.selected('mxnet') or tab.selected('pytorch'):
x = d2l.randn(n, 1)
noise = d2l.randn(n, 1) * 0.1
if tab.selected('tensorflow'):
x = d2l.normal((n, 1))
noise = d2l.normal((n, 1)) * 0.1
X = d2l.concat([x ** (i+1) / math.gamma(i+2) for i in range(p)], 1)
self.y = d2l.matmul(X, d2l.reshape(w, (-1, 1))) + noise
self.X = X[:,:num_inputs]
def get_dataloader(self, train):
i = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader([self.X, self.y], train, i)
Again, monomials stored in poly_features
are rescaled by the gamma function,
where $\Gamma(n)=(n-1)!$.
[Take a look at the first 2 samples] from the generated dataset.
The value 1 is technically a feature,
namely the constant feature corresponding to the bias.
We will begin by first using a third-order polynomial function, which is the same order as that of the data generation function. The results show that this model's training and test losses can be both effectively reduced. The learned model parameters are also close to the true values $w = [1.2, -3.4, 5.6], b=5$.
%%tab all
def train(p):
if tab.selected('mxnet') or tab.selected('tensorflow'):
model = d2l.LinearRegression(lr=0.01)
if tab.selected('pytorch'):
model = d2l.LinearRegression(p, lr=0.01)
model.board.ylim = [1, 1e2]
data = Data(200, 200, p, 20)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
print(model.get_w_b())
train(p=3)
Let's take another look at linear function fitting. After the decline in early epochs, it becomes difficult to further decrease this model's training loss. After the last epoch iteration has been completed, the training loss is still high. When used to fit nonlinear patterns (like the third-order polynomial function here) linear models are liable to underfit.
%%tab all
train(p=1)
Now let's try to train the model using a polynomial of too high degree. Here, there is insufficient data to learn that the higher-degree coefficients should have values close to zero. As a result, our overly-complex model is so susceptible that it is being influenced by noise in the training data. Though the training loss can be effectively reduced, the test loss is still much higher. It shows that the complex model overfits the data.
%%tab all
train(p=10)
In the subsequent sections, we will continue to discuss overfitting problems and methods for dealing with them, such as weight decay and dropout.
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: