One of the central challenges in machine learning is building a model that performs well not just on the data it was trained on, but on new, unseen data. This is the concept of generalization. The biggest threat to generalization is overfitting.
What is Overfitting?
Overfitting happens when a model learns the training data too well. Instead of learning the general patterns in the data, it starts to memorize the noise and random fluctuations. An overfit model might have fantastic accuracy on the training set, but it will perform poorly when it encounters new data because it has failed to learn the underlying trend.
Imagine a student who memorizes the answers to every question in a textbook. They would ace a test that uses those exact questions, but they would fail a test that asks conceptual questions about the same topics. The student hasn’t learned the concepts; they’ve only memorized the data.
The opposite problem is underfitting, where a model is too simple to capture the underlying structure of the data. An underfit model performs poorly on both the training data and new data. The goal is to find the “sweet spot” right in the middle.
How to Fight Overfitting: Regularization
There are several ways to combat overfitting, such as using more training data or using a simpler model. But one of the most powerful techniques is regularization.
Regularization works by adding a penalty term to the model’s loss function. This penalty discourages the model from learning overly complex patterns. It forces the model’s weights (coefficients) to be small. The two most common types of regularization are L1 and L2 regularization.
L2 Regularization (Ridge Regression)
L2 regularization adds a penalty equal to the sum of the squared values of the model’s weights. This is the penalty term:
[ \lambda \sum_{i=1}^{n} w_i^2 ]
- ( w_i ) is the weight of the i-th feature.
- ( \lambda ) (lambda) is the regularization parameter. It controls the strength of the penalty. A larger ( \lambda ) results in smaller weights and a simpler model.
L2 regularization forces the weights to be small, but it doesn’t force them to be exactly zero. In linear models, this is known as Ridge Regression.
L1 Regularization (Lasso Regression)
L1 regularization adds a penalty equal to the sum of the absolute values of the model’s weights:
| [ \lambda \sum_{i=1}^{n} | w_i | ] |
The key difference is that L1 regularization can shrink some weights to be exactly zero. This means it can perform feature selection, effectively removing irrelevant features from the model by making their weights zero. In linear models, this is known as Lasso Regression.
Regularization in Scikit-Learn
Let’s see how to apply Ridge and Lasso regularization to a linear regression model.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error
import numpy as np
# --- 1. Generate Data and Add Noise ---
X, y, coef = make_regression(n_samples=100, n_features=20, n_informative=10,
noise=15, coef=True, random_state=42)
# Add some irrelevant features (noise)
X_noisy = np.concatenate([X, np.random.randn(100, 80)], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_noisy, y, test_size=0.3, random_state=42)
# --- 2. Train Different Models ---
# Standard Linear Regression (no regularization)
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_mse = mean_squared_error(y_test, lr.predict(X_test))
print(f"Linear Regression MSE: {lr_mse:.2f}")
# Ridge Regression (L2)
# alpha is the regularization strength (lambda)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
ridge_mse = mean_squared_error(y_test, ridge.predict(X_test))
print(f"Ridge (L2) Regression MSE: {ridge_mse:.2f}")
# Lasso Regression (L1)
lasso = Lasso(alpha=0.5)
lasso.fit(X_train, y_train)
lasso_mse = mean_squared_error(y_test, lasso.predict(X_test))
print(f"Lasso (L1) Regression MSE: {lasso_mse:.2f}")
# --- 3. Inspect Coefficients ---
print("\\nNumber of non-zero coefficients:")
print(f" - Linear Regression: {np.sum(lr.coef_ != 0)}")
print(f" - Ridge (L2): {np.sum(ridge.coef_ != 0)}")
print(f" - Lasso (L1): {np.sum(lasso.coef_ != 0)}") # Note how Lasso drives many to zeroWhat the Code Does
- Data: We create a regression dataset with 20 features, but only 10 of them are actually informative. We then add 80 completely random, useless features to demonstrate the power of regularization.
- Models: We train three models: a standard
LinearRegression, aRidge(L2) model, and aLasso(L1) model. Thealphaparameter corresponds to ( \lambda ). - Evaluation: We compare the Mean Squared Error (MSE) of the three models on the test set. You’ll likely see that the regularized models perform better because they are less influenced by the noisy, irrelevant features.
- Coefficients: We check how many of the learned coefficients are non-zero. Notice that Lasso has likely set many of the coefficients to exactly zero, effectively performing feature selection and ignoring the useless features we added.
Conclusion
Overfitting is a constant challenge in machine learning. Regularization provides an elegant mathematical framework for controlling model complexity, reducing overfitting, and building models that generalize better to new data. Techniques like Ridge and Lasso are fundamental tools, and the same principles are applied to more complex models like neural networks (where it’s often called “weight decay”).



