Decision Trees and Random Forests: Interpretable Machine Learning

Not all machine learning models are black boxes. Some, like Decision Trees, are highly interpretable and work much like we do when making decisions. They are simple, powerful, and form the basis for more complex algorithms like Random Forests.

A Decision Tree is a supervised learning algorithm that looks like a flowchart. It’s used for both classification and regression tasks. The tree is built by splitting the data into smaller and smaller subsets based on its features, creating a tree structure with decision nodes and leaf nodes.

Decision Nodes: These are the points where the data is split based on a certain feature (e.g., “Is the email longer than 500 words?”).
Leaf Nodes: These are the terminal nodes that represent the final outcome or decision (e.g., “Spam” or “Not Spam”).

The algorithm chooses the best feature to split the data at each step, typically by measuring how much a split increases the “purity” of the resulting nodes (e.g., using metrics like Gini impurity or information gain).

The Overfitting Problem

Decision Trees are easy to understand and visualize. However, a single decision tree is prone to overfitting. This means it can learn the training data too well, capturing all its noise and peculiarities. A tree that is too deep and complex will perform brilliantly on the data it was trained on, but it will fail to generalize to new, unseen data.

The Solution: Random Forests

To overcome the overfitting problem, we can use an ensemble method called a Random Forest. As the name suggests, a Random Forest is made up of a large number of individual decision trees that operate as a committee.

Here’s how it works:

Bagging (Bootstrap Aggregating): The Random Forest algorithm builds multiple decision trees by training each one on a different random subset of the training data (a “bootstrap” sample).
Feature Randomness: When splitting a node, each tree only considers a random subset of the available features. This ensures that the trees in the forest are different from one another.

To make a prediction, the Random Forest gets a prediction from every tree in the forest. It then makes its final decision by taking the majority vote (for classification) or the average (for regression) of all the individual tree predictions.

This process of combining many different, decorrelated trees reduces the overall variance and makes the model much more robust and accurate than any single decision tree.

Decision Trees and Random Forests in Scikit-Learn

Let’s see how to use scikit-learn to train both a single Decision Tree and a Random Forest for a classification task.

from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# --- 1. Generate Synthetic Data ---
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=5, n_redundant=0,
                           random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 2. Train a Single Decision Tree ---
# We limit the depth to prevent severe overfitting
tree_clf = DecisionTreeClassifier(max_depth=10, random_state=42)
tree_clf.fit(X_train, y_train)
tree_preds = tree_clf.predict(X_test)
tree_accuracy = accuracy_score(y_test, tree_preds)

print(f"Single Decision Tree Accuracy: {tree_accuracy:.4f}")

# --- 3. Train a Random Forest ---
# n_estimators is the number of trees in the forest
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
forest_clf.fit(X_train, y_train)
forest_preds = forest_clf.predict(X_test)
forest_accuracy = accuracy_score(y_test, forest_preds)

print(f"Random Forest Accuracy: {forest_accuracy:.4f}")

# --- 4. Feature Importance ---
# Random forests can also tell us which features were most important
importances = forest_clf.feature_importances_
print("\\nTop 5 Most Important Features (according to Random Forest):")
for i in importances.argsort()[-5:]:
    print(f"  - Feature {i}")

What the Code Does

Data Setup: We create a synthetic dataset for a binary classification problem and split it into training and testing sets.
Decision Tree: We train a single DecisionTreeClassifier. We set max_depth=10 to stop it from growing too deep and overfitting too much. We then evaluate its accuracy on the unseen test data.
Random Forest: We train a RandomForestClassifier with 100 trees (n_estimators=100). We then evaluate its accuracy. You’ll typically see that the Random Forest outperforms the single Decision Tree.
Feature Importance: A great feature of tree-based models is that they can calculate the importance of each feature in making predictions. We can inspect the feature_importances_ attribute of our trained forest to see which features it relied on most.

Conclusion

Decision Trees provide a transparent and intuitive way to model decisions, but their tendency to overfit can be a significant drawback. Random Forests solve this problem by averaging the predictions of many diverse trees, creating a robust and high-performing model that is one of the most widely used algorithms in machine learning today. They offer a great balance of performance, interpretability, and ease of use.

Recent content

Can AI Catch What Clinicians Miss? A Comparative Study of Diagnostic Accuracy

MLOps: From Model to Production

Cross-Validation: The Gold Standard for Model Evaluation

Support Vector Machines: Maximizing the Margin

Popular topics

Decision Trees and Random Forests: Interpretable Machine Learning

The Overfitting Problem

The Solution: Random Forests

Decision Trees and Random Forests in Scikit-Learn

What the Code Does

Conclusion

Unsupervised Learning: Finding Patterns in the Noise

Gradient Boosting and XGBoost: The King of Kaggle Competitions

MLOps: From Model to Production

Cross-Validation: The Gold Standard for Model Evaluation

Support Vector Machines: Maximizing the Margin

The Bias-Variance Tradeoff: A Balancing Act in Machine Learning

Recent content

Popular topics

Decision Trees and Random Forests: Interpretable Machine Learning

The Overfitting Problem

The Solution: Random Forests

Decision Trees and Random Forests in Scikit-Learn

What the Code Does

Conclusion

You may also like