What is Cross Validation (CV)

Let’s say we have 50 sets of data used to train a model. If we use all the data for training, then we don’t have any left to test it. Thus, a sensible approach is to split it 80/20, when 80% of the data is used to train the model, and remaining 20% is used to test it.

  • data set 1-40: use for training
  • data set 41-50: use for testing

However, we don’t know for sure if data set 1-40 is the best for training the model; neither do we used all the data available.

Cross Validation can solve the problem, or improve the model. Basically, we want to divide the data into K groups (K-folds), example 5 groups below. In each iteration / training process, 4/5 groups are used to train the model, and the last one for testing.

Dataset 1-8 Dataset 9-16 Dataset 17-24 Dataset 25-32 Dataset 33-40 Dataset 41-50
Train Train Train Train TEST HoldOut
Train Train Train TEST Train HoldOut
Train Train TEST Train Train HoldOut
Train TEST Train Train Train HoldOut
TEST Train Train Train Train HoldOut

Hypothesis: Model trained with CV performs BETTER!

Setup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
# create 50 data points for the experiment
X, y= make_regression(n_samples=50, n_features=1, noise=5, random_state=42)
plt.scatter(X,y)

50-regression-points

Split data in 80/20 with 20% for Test

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("x_train: ", x_train.size)
print("x_test: ", x_test.size)
# output
x_train:  40
x_test:  10

Train model as baseline

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
# fit linear regression model to training set
linear = LinearRegression()
linear.fit(x_train, y_train)
# predit training set and test set using the model
y_train_pred = linear.predict(x_train)
y_test_pred = linear.predict(x_test)
scores = {}
scores["training set"] = {"r_2": r2_score(y_train, y_train_pred), "mse": mean_squared_error(y_train, y_train_pred)}
scores["test set"] = {"r_2": r2_score(y_test, y_test_pred), "mse": mean_squared_error(y_test, y_test_pred)}
print(scores)
# output
{'training set': {'r_2': 0.9082211858809777, 'mse': 20.77136285727775},
 'test set': {'r_2': 0.6761171340902208, 'mse': 36.679634983592294}}

Let’s use Cross Validation

from sklearn.model_selection import cross_validate
# CV is set to 5-fold
# R^2 and mse is used as scoring metrics
scores = cross_validate(linear, X, y, cv=5, return_estimator=True, scoring=('r2', 'neg_mean_squared_error'))
print(scores)
# output
{'fit_time': array([0.00079298, 0.00041986, 0.00042582, 0.00053477, 0.00041199]),
 'score_time': array([0.00068283, 0.00073123, 0.00070882, 0.00072026, 0.00058603]),
 'estimator': (LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
  LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
  LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
  LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
  LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)),
 'test_r2': array([0.83294462, 0.87513271, 0.85425891, 0.95260985, 0.86880153]),
 'test_neg_mean_squared_error': array([-28.66454405, -15.60107755, -47.62231841,  -6.01442767,
        -34.90107707])}

We can see the 4th model performs the best, with R^2 =0.95260985, and mse=-6.01442767. Using the best model to predict:

best_model = scores["estimator"][3]
y_train_pred_cv = best_model.predict(x_train)
y_test_pred_cv = best_model.predict(x_test)
from sklearn.metrics import r2_score
scores_cv = {}
scores_cv["training set"] = {
    "r_2": r2_score(y_train, y_train_pred_cv),
    "mse": mean_squared_error(y_train, y_train_pred_cv)
}
scores_cv["test set"] = {
    "r_2": r2_score(y_test, y_test_pred_cv),
    "mse": mean_squared_error(y_test, y_test_pred_cv)}
print(scores_cv)
# output
{'training set': {'r_2': 0.9051641686456118, 'mse': 21.463226386636407},
 'test set': {'r_2': 0.725591176478181, 'mse': 31.076714894393582}}

Conclusion

The model trained with CV performs better compared to the baseline, as

  • R^2 is higher, 0.725 vs 0.676
  • mse is lower, 31 vs 36

Below is the visualization below. Red line is the CV. 50-regression-compare