Just trying out the basics of scikit-learn machine learning package. Scikit-learn's API makes it easy to define, train, and test models.
Here, we do a basic example of linear regression, in the context of curve fitting. This toy example is an exact copy of tutorial available here. This example will illustrate concepts such as linear models, overfitting, underfiting, regularization, and cross-validation. Let's begin.
# all necessary imports
import numpy as np
import scipy.stats as st
import sklearn.linear_model as lm
import matplotlib.pyplot as plt
Let's define the deterministic function underlying our generative model
f = lambda x: np.exp(3 * x)
Now, let's generate the values along the curve on [0, 2]
x_tr = np.linspace(0., 2, 200)
y_tr = f(x_tr)
Next, we generate our data points within [0,1]. We use the function f and add some Guassian noise (why not?)
x = np.array([0, .1, .2, .5, .8, .9, 1])
y = f(x) + np.random.randn(len(x))
# plot our data points on [0,1]
plt.figure(figsize=(8,4));
plt.plot(x_tr[:100], y_tr[:100], '--k');
plt.plot(x,y, 'ok', ms=10);
Finally, we use scikit-learn to fit a linear model to the data. The three steps are:
LinearRegression
class# create the model
lr = lm.LinearRegression()
# train the model on our training dataset
lr.fit(x[:, np.newaxis], y);
# predict the points with our trained model
y_lr = lr.predict(x_tr[:, np.newaxis])
It is a general convention in scikit-learn that observations are rows and features are columns. So, let's convert x
and x_tr
to column vectors. We have seven observations with 1 feature.
# plot the result of the trained linear model
plt.figure(figsize=(8,4));
plt.plot(x_tr, y_tr, '--k');
plt.plot(x_tr, y_lr, 'g');
plt.plot(x, y, 'ok', ms=10);
plt.xlim(0, 1);
plt.ylim(y.min()-1, y.max()+1);
plt.title("Linear regression");
The linear fit is not well adapted here as the data points are generated according to a non-linear model (an exponential curve). Therefore, we are now going to fit a nonlinear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for that by precomputing the exponents of our data points. This is done by generating a Vandermonde matrix, using the np.vander
function.
lrp = lm.LinearRegression()
plt.figure(figsize=(8,4));
plt.plot(x_tr, y_tr, '--k');
for deg, s in zip([2,5], ['-', '.']):
lrp.fit(np.vander(x, deg + 1), y);
y_lrp = lrp.predict(np.vander(x_tr, deg + 1));
plt.plot(x_tr, y_lrp, s, label='degree ' + str(deg));
plt.legend(loc=2);
plt.xlim(0,1.4);
plt.ylim(-10, 40);
# print the model's coefficients
print(' '.join(['%.2f' % c for c in lrp.coef_]))
plt.plot(x, y, 'ok', ms=10);
plt.title("Linear Regression")
We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial appears to fit the data points less precisely than the degree 5 polynomial. However, we could say the degree 5 polynomial is overfitting and less robust simply because it would not be able to acheive a better fit outside our current dataset. Also, note the coefficients of the degree 5 polynomial is much larger which is generally a sign of overfitting.
This plot was interesting because if you look at the original tutorial, the degree 5 polynomial has much larger coefficients and the plot inverts after our largest x-value
thereby making it less robust.
Now, we will use a different learning model, called the ridge regression. It works like linear regression, except that it prevents the polynomial's coefficients to explode. By adding a regularization term in the loss function, ridge regression imposes some structure on the underlying model.
The ridge regression model has a meta-parameter which represents the weight of the regularization term. We could try the different values with trials and errors, using the Ridge
class. However, we will use the scikit-learn model called RidgeCV
which includes a parameter search with cross-validation. In practise, it means that we don't have to tweak this parameter by hand: scikit-learn does it for us. Since the models of scikit-learn always follow the fit-predict API, all we have to do it replace lm.LinearRegression
by lm.RidgeCV
in our earlier code.
ridge = lm.RidgeCV()
plt.figure(figsize=(8,4));
plt.plot(x_tr, y_tr, '--k');
for deg, s in zip([2,5], ['-', '.']):
ridge.fit(np.vander(x, deg + 1), y);
y_ridge = ridge.predict(np.vander(x_tr, deg + 1))
plt.plot(x_tr, y_ridge, s, label='degree ' + str(deg));
plt.legend(loc=2);
plt.xlim(0, 1.5);
plt.ylim(-5, 80);
# print the model's coefficients
print(' '.join(['%.2f' % c for c in ridge.coef_]))
plt.plot(x, y, 'ok', ms=10);
plt.title("Ridge regression");
Now, we see the degree 5 polynomial is more precise than the simpler degree 2 polynomial which now causes underfitting. Ridge regression reduces the overfitting issue here (also notice the much smaller coefficients for degree 5 polynomial).
scikit-learn implements a clean and coherent API for supervised and unsupervised learning. Our data points should be a X
, where N
is the number of observations, and D
is the number of features. In other words, each row is an observation. The first step in a machine learning task is to define what our matrix X
is exactly.
In a supervised learning setup, we also have a target, a N
-long vector y witha scalar value for each observations. This value is continuous or discrete, depending on whether we have a regression or classification problem, respectively.
In scikit-learn, models are implemented in classes that have fit
and predict
methods. The fit
method accepts the data matrix X as input and y
as well for supervised learning models. This method trains the model on the given data.
The predict
method also takes the data points as input (as a matrix