Curve-fitting regression problem

Just trying out the basics of scikit-learn machine learning package. Scikit-learn's API makes it easy to define, train, and test models.

Here, we do a basic example of linear regression, in the context of curve fitting. This toy example is an exact copy of tutorial available here. This example will illustrate concepts such as linear models, overfitting, underfiting, regularization, and cross-validation. Let's begin.

In [1]:
# all necessary imports
import numpy as np
import scipy.stats as st
import sklearn.linear_model as lm
import matplotlib.pyplot as plt

Let's define the deterministic function underlying our generative model

In [2]:
f = lambda x: np.exp(3 * x)

Now, let's generate the values along the curve on [0, 2]

In [3]:
x_tr = np.linspace(0., 2, 200)
y_tr = f(x_tr)

Next, we generate our data points within [0,1]. We use the function f and add some Guassian noise (why not?)

In [6]:
x = np.array([0, .1, .2, .5, .8, .9, 1])
y = f(x) + np.random.randn(len(x))

# plot our data points on [0,1]
plt.figure(figsize=(8,4));
plt.plot(x_tr[:100], y_tr[:100], '--k');
plt.plot(x,y, 'ok', ms=10);

Finally, we use scikit-learn to fit a linear model to the data. The three steps are:

  1. Create the model i.e. an instance of the LinearRegression class
  2. Fit the model to our data
  3. Predict the values from our trained model
In [7]:
# create the model
lr = lm.LinearRegression()

# train the model on our training dataset
lr.fit(x[:, np.newaxis], y);

# predict the points with our trained model
y_lr = lr.predict(x_tr[:, np.newaxis])

It is a general convention in scikit-learn that observations are rows and features are columns. So, let's convert x and x_tr to column vectors. We have seven observations with 1 feature.

In [8]:
# plot the result of the trained linear model
plt.figure(figsize=(8,4));
plt.plot(x_tr, y_tr, '--k');
plt.plot(x_tr, y_lr, 'g');
plt.plot(x, y, 'ok', ms=10);
plt.xlim(0, 1);
plt.ylim(y.min()-1, y.max()+1);
plt.title("Linear regression");

Vandermonde matrix

The linear fit is not well adapted here as the data points are generated according to a non-linear model (an exponential curve). Therefore, we are now going to fit a nonlinear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for that by precomputing the exponents of our data points. This is done by generating a Vandermonde matrix, using the np.vander function.

In [9]:
lrp = lm.LinearRegression()
plt.figure(figsize=(8,4));
plt.plot(x_tr, y_tr, '--k');

for deg, s in zip([2,5], ['-', '.']):
    lrp.fit(np.vander(x, deg + 1), y);
    y_lrp = lrp.predict(np.vander(x_tr, deg + 1));
    plt.plot(x_tr, y_lrp, s, label='degree ' + str(deg));
    plt.legend(loc=2);
    plt.xlim(0,1.4);
    plt.ylim(-10, 40);
    
    # print the model's coefficients
    print(' '.join(['%.2f' % c for c in lrp.coef_]))

plt.plot(x, y, 'ok', ms=10);
plt.title("Linear Regression")
    
26.37 -7.83 0.00
66.67 -22.16 -125.81 125.85 -24.77 0.00

Out[9]:
<matplotlib.text.Text at 0x1082e3390>

We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial appears to fit the data points less precisely than the degree 5 polynomial. However, we could say the degree 5 polynomial is overfitting and less robust simply because it would not be able to acheive a better fit outside our current dataset. Also, note the coefficients of the degree 5 polynomial is much larger which is generally a sign of overfitting.

This plot was interesting because if you look at the original tutorial, the degree 5 polynomial has much larger coefficients and the plot inverts after our largest x-value thereby making it less robust.

Ridge Regression model

Now, we will use a different learning model, called the ridge regression. It works like linear regression, except that it prevents the polynomial's coefficients to explode. By adding a regularization term in the loss function, ridge regression imposes some structure on the underlying model.

The ridge regression model has a meta-parameter which represents the weight of the regularization term. We could try the different values with trials and errors, using the Ridge class. However, we will use the scikit-learn model called RidgeCV which includes a parameter search with cross-validation. In practise, it means that we don't have to tweak this parameter by hand: scikit-learn does it for us. Since the models of scikit-learn always follow the fit-predict API, all we have to do it replace lm.LinearRegression by lm.RidgeCV in our earlier code.

In [11]:
ridge = lm.RidgeCV()
plt.figure(figsize=(8,4));
plt.plot(x_tr, y_tr, '--k');

for deg, s in zip([2,5], ['-', '.']):
    ridge.fit(np.vander(x, deg + 1), y);
    y_ridge = ridge.predict(np.vander(x_tr, deg + 1))
    plt.plot(x_tr, y_ridge, s, label='degree ' + str(deg));
    plt.legend(loc=2);
    plt.xlim(0, 1.5);
    plt.ylim(-5, 80);
    
    # print the model's coefficients
    print(' '.join(['%.2f' % c for c in ridge.coef_]))
    
plt.plot(x, y, 'ok', ms=10);
plt.title("Ridge regression");
12.44 5.54 0.00
4.45 4.03 3.81 3.92 3.41 0.00

Now, we see the degree 5 polynomial is more precise than the simpler degree 2 polynomial which now causes underfitting. Ridge regression reduces the overfitting issue here (also notice the much smaller coefficients for degree 5 polynomial).

Additional Notes

Scikit-learn API

scikit-learn implements a clean and coherent API for supervised and unsupervised learning. Our data points should be a N X D matrix X, where N is the number of observations, and D is the number of features. In other words, each row is an observation. The first step in a machine learning task is to define what our matrix X is exactly.

In a supervised learning setup, we also have a target, a N-long vector y witha scalar value for each observations. This value is continuous or discrete, depending on whether we have a regression or classification problem, respectively.

In scikit-learn, models are implemented in classes that have fit and predict methods. The fit method accepts the data matrix X as input and y as well for supervised learning models. This method trains the model on the given data.

The predict method also takes the data points as input (as a matrix M X D). It returns the lables or transformed points, according to the trained model.