Cross Validation

Cross Validation and different types of splitting techniques


In machine learning, we couldn’t fit the model on the training data and can’t say that the model will work accurately for the real data. For this, we must assure that our model got the correct patterns from the data, and it is not getting up too much noise. For this purpose, we use the cross-validation technique:

Cross-Validation

Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.


The three steps involved in cross-validation are as follows :

A. Reserve some portion of sample data-set.

B. Using the rest data-set train the model.

C. Test the model using the reserve portion of the data-set.


Example:


Here, we are trying to find the relationship between size and price. To achieve this, we have taken the following steps:


We’ve established the relationship using a linear equation for which the plots have been shown. The first plot has a high error from training data points. This is an example of “Underfitting”. In this case, our model fails to capture the underlying trend of the data

In the second plot, we just found the right relationship between price and size, i.e., low training error and generalization of the relationship

In the third plot, we found a relationship which has almost zero training error. This is because the relationship is developed by considering each deviation in the data point (including noise), i.e., the model is too sensitive and captures random patterns which are present only in the current dataset. This is an example of “Overfitting”. 

A few common methods used for Cross Validation:-

1. The validation set approach

In this approach, we reserve 50% of the dataset for validation and the remaining 50% for model training. However, a major disadvantage of this approach is that since we are training a model on only 50% of the dataset, there is a huge possibility that we might miss out on some interesting information about the data which will lead to a higher bias.

Code : train, validation = train_test_split(data, test_size=0.50, random_state = 5)

2K-Fold Cross Validation

As there is never enough data to train your model, removing a part of it for validation poses a problem of underfitting. By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias. So, what we require is a method that provides ample data for training the model and also leaves ample data for validation. K Fold cross validation does exactly that.



In K Fold cross validation, the data is divided into k subsets. Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model. As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set. Interchanging the training and test sets also adds to the effectiveness of this method. As a general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value.

Code : 

from sklearn.model_selection import KFold 

kf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=None) 

for train_index, test_index in kf.split(X):

     print("Train:", train_index, "Validation:",test_index)

     X_train, X_test = X[train_index], X[test_index] 

     y_train, y_test = y[train_index], y[test_index]

3. Stratified K-Fold Cross Validation

In some cases, there may be a large imbalance in the response variables. For example, in dataset concerning price of houses, there might be large number of houses having high price. Or in case of classification, there might be several times more negative samples than positive samples. For such problems, a slight variation in the K Fold cross validation technique is made, such that each fold contains approximately the same percentage of samples of each target class as the complete set, or in case of prediction problems, the mean response value is approximately equal in all the folds. This variation is also known as Stratified K Fold.



Above explained validation techniques are also referred to as Non-exhaustive cross validation methods. These do not compute all ways of splitting the original sample, i.e. you just have to decide how many subsets need to be made. Also, these are approximations of method explained below, also called Exhaustive Methods, that computes all possible ways the data can be split into training and test sets.

Code : 

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, random_state=None)

# X is the feature set and y is the target

for train_index, test_index in skf.split(X,y): 

   print("Train:", train_index, "Validation:", val_index) 

   X_train, X_test = X[train_index], X[val_index] 

   y_train, y_test = y[train_index], y[val_index]

4. Leave-P-Out Cross Validation

This approach leaves p data points out of training data, i.e. if there are n data points in the original sample then, n-p samples are used to train the model and p points are used as the validation set. This is repeated for all combinations in which original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness.

This method is exhaustive in the sense that it needs to train and validate the model for all possible combinations, and for moderately large p, it can become computationally infeasible.

A particular case of this method is when p = 1. This is known as Leave one out cross validation. This method is generally preferred over the previous one because it does not suffer from the intensive computation, as number of possible combinations is equal to number of data points in original sample or n.

Code: 

# importing libraries

import numpy as np

from sklearn.model_selection import LeaveOneOut

# creating the data

X = np.array([[1, 2], [3, 4]])

y = np.array([1, 2])

# Independent variable

print("\nIndependent variable :")

print(X)

# Dependent variable

print("\nDependent variable :")

print(y)

# creating the leav one out function

loo = LeaveOneOut()

loo.get_n_splits(X)

# printing the training and validation data

for train_index, test_index in loo.split(X):

   X_train, X_test = X[train_index], X[test_index]

   y_train, y_test = y[train_index], y[test_index]    

print("\ntraining set:", X_train, y_train)

print("\nvalidation set :", X_test, y_test)




    • Related Articles

    • CNN model on Iris images

      Convolutional Neural Networks, like neural networks, are made up of neurons with learnable weights and biases. Each neuron receives several inputs, takes a weighted sum over them, passes it through an activation function and responds with an output. ...
    • UiPath Passing Arguments between workflows

      Dear Learner, Greetings from edureka! Kindly find attached a workflow file that will demonstrate Arguments and passing arguments between two sequences. argSQ01Name and argSq01Gender are arguments created in Sq01 sequence to ...
    • Hypothesis Testing(z test,t test with examples)

      Hypothesis Testing : Null Hypothesis (H⁰) : It states that there is no difference of certain characteristics between the sample and population data due to random chances. Eg: Assuming the researcher’s predictions are true. Alternate Hypothesis (H1): ...
    • Data Manipulation Case Study -2 (Modified Correct code)

      #Domain – HR #focus – Insights from data import pandas as pd # Set the option on how to display float # To See the impact of this command comment it out and check outputs pd.set_option('display.float_format', lambda x: '%.3f' % x) #  Read ...