Model Validation for Supervised Models

Poshan Pandey
6 min readJan 3, 2022

While implementing supervised machine learning model, there are mainly four core steps as follow:

  1. Choose a class of model
  2. Choose model hyperparameters
  3. Fit the model to the training data
  4. Use the model to predict labels for new data

Among these points the first two steps are perhaps the most important part of using supervised model. In order to make an informed choice, we need to validate that our model and our hyperparameters are a good fit to the data. While this may sound simple, there are some pitfalls you must avoid to do this effectively.

Exploring Model Validation

The principle behind model validation is very simple: after choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the training data and comparing the prediction to the known value. There are mainly two ways for validating a model as follow:

Holdout Sets

In this approach we hold back some subset of the data from training of the model, and then use this holdout set to check the model performance. We can do this by splitting. For splitting purpose we can use the train_test_split utility in Scikit-Learn.

let’s demonstrate this using a dataset from sklearn library. We will start by loading the data:

from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data

y = iris.target

Now, we choose a model and hyperparameters. Here we’ll use a k-neighbors classifier with n_neighbors=1. this is a very is simple and intuitive model that says “the label of unknown point is the same as the label of its closest training point”:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=1)

Now that we have our data and model ready let1s split the data and run our model:

from sklearn.model_selection import train_test_split

# split the data with 50% in each set

X1, X2, y1, y2 = train_test_split(X, y, random_state=0,

train_size=0.5)

# fit the model on one set of data

model.fit(X1, y1)

# evaluate the model on the second set of data

y2_model = model.predict(X2)

accuracy_score(y2, y2_model)

The output of above code is:

0.9066666666666666

which is a reasonable accuracy of the model. It depicts that the nearest-neighbor classifier is about 90% accurate on this holdout set.

Cross-Validation

One disadvantage of using holdout set for model validation is that we have lost a portion of our data to the model training. In the previous case, half the dataset does not contribute to the training of the model. This is not optional, and can cause problems, especially if the initial set of training data is small.

One way to address this is to use cross-validation i.e. to do a sequence of fits where each subset is used both as a training set and as a validation set. Visually it might look something like this:

Here we do two validation trials, alternately using each half of the data as a holdout set. Using the split data from before, we could implement it like this:

y2_model = model.fit(X1, y1).predict(X2)

y1_model = model.fit(X2, y2).predict(X1)

accuracy_score(y1, y1_model), accuracy_score(y2, y2_model)

(0.96, 0.9066666666666666)

what comes out are two accuracy scores, which we could combine ; for instance by calculating mean, to get a better measure of the global performance. This particular form of cross-validation is a two-fold cross-validation, one in which we have split the data into two sets and used each in turn as a validation set.

We could expand on this idea to use even more trials, and more folds in the data. The figure below is a visual depiction of five-fold cross-validation:

Here we split the data into five groups, and use each of them in turn to evaluate the model fit on the other 4/4 of the data. This would be rather tedious to do by hand, and so we can use Scikit-Learn’s cross_val_score convenience routine to to it succinctly:

from sklearn.model_selection import cross_val_score

cross_val_score(model, X, y, cv=5)

array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1. ])

Repeating the validation across different subsets of the data gives us an even better idea of the performance of the algorithm.

Scikit-Learn implements number of cross-validation schemes that are useful in particular situations: these are implemented via iterators in the model_selection module sdFor example, we might wish to go to the extreme case in which out number of folds is equal to the number of data points; this is, we train on all points but one in each trial. This type of cross validation is known as leave-one-out cross-validation, and can be used as follows:

from sklearn.model_selection import LeaveOneOut

scores = cross_val_score(model, X, y, cv=LeaveOneOut())

scores

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Because we have 150 samples, the leave-one-out cross-validation yields scores for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0) prediction. Taking the mean of these gives an estimate of the error rate:

scores.mean()

0.96

Other cross-validation schemes can be used similarly. For a description of what is available in Scikit-Learn, use Ipython to explore the sklearn-model_selection sub-module or take a look at Scikit-Learn’s online cross-validation documentation.

Selecting the Best Model

Now that we’ve seen the basics of validation and cross-validation, we will go into a little more depth regarding model selection and selection of hyperparameters. These issues are some of the most important aspects of the practice of machine learning, and I find that this information is often glossed over in introductory machine learning tutorials. Of core importance is the following question: if our estimator is underperforming, how should we move forward? There are several possible answers:

  • Use a more complicated/more flexible model
  • Use a less complicated/less flexible model
  • Gather more training samples
  • Gather more data to add features to each sample

The answer to this question is often counterintuitive. In particular, sometimes using a more complicated model will give worse results, and adding more training samples may not improve your results! The ability to determine what steps will improve your model is what separates the successful machine learning practitioners from the unsuccessful.

We will write more about selecting the best model in another article please stay tuned.

The Google Colab link to this article can be found here.

--

--