Cross-Validation: The Gold Standard for Model Evaluation

When we build a machine learning model, our primary goal is for it to generalize well to new, unseen data. A common way to estimate this is to split our data into a training set and a testing set. We train the model on the training set and evaluate it on the testing set.

But this approach has a weakness. What if we were just unlucky (or lucky) with our split? What if the testing set happened to contain a lot of easy-to-predict data points, making our model seem better than it is? Or what if it contained a lot of unusual outliers, making it seem worse?

The performance score from a single train/test split can be highly variable. To get a more reliable estimate of our model’s performance, we can use a technique called cross-validation.

K-Fold Cross-Validation

The most common type of cross-validation is K-Fold Cross-Validation. It works as follows:

Split: Randomly split the entire dataset into ‘K’ equal-sized subsets, or “folds.”
Iterate: For each fold, from 1 to K: a. Hold out the current fold and use it as the validation set. b. Use all the other K-1 folds as the training set. c. Train a new model on the training set and evaluate it on the validation set. d. Keep the evaluation score and discard the model.
Average: After K iterations, you will have K different performance scores. The final performance of your model is the average of these K scores.

K-Fold Cross-Validation Diagram Image from the Scikit-Learn documentation.

A common value for K is 5 or 10. This process gives a much more robust estimate of the model’s performance on unseen data because it uses every data point for both training and validation over the course of the K iterations.

When to Use Cross-Validation

Model Selection: If you are comparing different algorithms (e.g., Logistic Regression vs. Random Forest), cross-validation can give you a better estimate of which one will perform better in the real world.
Hyperparameter Tuning: As we’ve seen previously, cross-validation is a core component of GridSearchCV and RandomizedSearchCV, ensuring that we select hyperparameters that generalize well.
Reliable Performance Estimate: When you need a trustworthy measure of your final model’s accuracy before deploying it.

The main drawback is that it’s more computationally expensive, as you need to train K models instead of just one.

Cross-Validation in Scikit-Learn

scikit-learn provides a simple way to perform cross-validation with the cross_val_score function.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# --- 1. Generate Data ---
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=10, n_redundant=0,
                           random_state=42)

# --- 2. Compare a single Train/Test split vs. Cross-Validation ---

# --- Single Split Evaluation ---
print("--- Single Train/Test Split ---")
# Create one specific split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Accuracy on one specific test set: {score:.4f}")

# --- Cross-Validation Evaluation ---
print("\\n--- K-Fold Cross-Validation ---")
# Use the same model type
model_cv = LogisticRegression(solver='liblinear')

# Perform 10-fold cross-validation (cv=10)
# The function handles all the splitting, training, and scoring
scores = cross_val_score(model_cv, X, y, cv=10, scoring='accuracy')

print(f"Scores for each of the 10 folds: \\n{np.round(scores, 4)}")
print(f"\\nMean CV Accuracy: {scores.mean():.4f}")
print(f"Standard Deviation of CV Accuracy: {scores.std():.4f}")

# The result is a more reliable estimate of the model's true performance.
# The standard deviation gives us an idea of how much the performance varies across different data subsets.
# A model with a mean accuracy of 0.85 +/- 0.05 is more reliable than one with 0.85 +/- 0.2.

What the Code Does

Single Split: We perform a standard 70/30 train/test split and calculate the accuracy of a LogisticRegression model. The result is a single number.
Cross-Validation: We use cross_val_score to perform 10-fold cross-validation on the entire dataset.
- estimator=model_cv: The model object to use.
- X, y: The data and labels.
- cv=10: The number of folds (K).
- scoring='accuracy': The metric to use for evaluation.
Results: The function returns an array of 10 scores, one for each fold. We can then calculate the mean and standard deviation of these scores. The mean gives us our robust performance estimate, and the standard deviation tells us how stable that estimate is.

Conclusion

While a simple train/test split is a good first step, K-Fold Cross-Validation is the industry standard for getting a reliable estimate of a model’s generalization performance. It’s an essential technique for robustly comparing models and for tuning hyperparameters, ensuring that you build models that are not just accurate on your current data, but on future data as well.

Recent content

Can AI Catch What Clinicians Miss? A Comparative Study of Diagnostic Accuracy

MLOps: From Model to Production

Cross-Validation: The Gold Standard for Model Evaluation

Support Vector Machines: Maximizing the Margin

Popular topics

Cross-Validation: The Gold Standard for Model Evaluation

K-Fold Cross-Validation

When to Use Cross-Validation

Cross-Validation in Scikit-Learn

What the Code Does

Conclusion

Support Vector Machines: Maximizing the Margin

MLOps: From Model to Production

MLOps: From Model to Production

Support Vector Machines: Maximizing the Margin

The Bias-Variance Tradeoff: A Balancing Act in Machine Learning

Finding the Sweet Spot: An Introduction to Hyperparameter Tuning

Recent content

Popular topics

Cross-Validation: The Gold Standard for Model Evaluation

K-Fold Cross-Validation

When to Use Cross-Validation

Cross-Validation in Scikit-Learn

What the Code Does

Conclusion

You may also like