Not all machine learning models are black boxes. Some, like Decision Trees, are highly interpretable and work much like we do when making decisions. They are simple, powerful, and form the basis for more complex algorithms like Random Forests.
A Decision Tree is a supervised learning algorithm that looks like a flowchart. It’s used for both classification and regression tasks. The tree is built by splitting the data into smaller and smaller subsets based on its features, creating a tree structure with decision nodes and leaf nodes.
- Decision Nodes: These are the points where the data is split based on a certain feature (e.g., “Is the email longer than 500 words?”).
- Leaf Nodes: These are the terminal nodes that represent the final outcome or decision (e.g., “Spam” or “Not Spam”).
The algorithm chooses the best feature to split the data at each step, typically by measuring how much a split increases the “purity” of the resulting nodes (e.g., using metrics like Gini impurity or information gain).
The Overfitting Problem
Decision Trees are easy to understand and visualize. However, a single decision tree is prone to overfitting. This means it can learn the training data too well, capturing all its noise and peculiarities. A tree that is too deep and complex will perform brilliantly on the data it was trained on, but it will fail to generalize to new, unseen data.
The Solution: Random Forests
To overcome the overfitting problem, we can use an ensemble method called a Random Forest. As the name suggests, a Random Forest is made up of a large number of individual decision trees that operate as a committee.
Here’s how it works:
- Bagging (Bootstrap Aggregating): The Random Forest algorithm builds multiple decision trees by training each one on a different random subset of the training data (a “bootstrap” sample).
- Feature Randomness: When splitting a node, each tree only considers a random subset of the available features. This ensures that the trees in the forest are different from one another.
To make a prediction, the Random Forest gets a prediction from every tree in the forest. It then makes its final decision by taking the majority vote (for classification) or the average (for regression) of all the individual tree predictions.
This process of combining many different, decorrelated trees reduces the overall variance and makes the model much more robust and accurate than any single decision tree.
Decision Trees and Random Forests in Scikit-Learn
Let’s see how to use scikit-learn to train both a single Decision Tree and a Random Forest for a classification task.
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# --- 1. Generate Synthetic Data ---
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=5, n_redundant=0,
random_state=42)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# --- 2. Train a Single Decision Tree ---
# We limit the depth to prevent severe overfitting
tree_clf = DecisionTreeClassifier(max_depth=10, random_state=42)
tree_clf.fit(X_train, y_train)
tree_preds = tree_clf.predict(X_test)
tree_accuracy = accuracy_score(y_test, tree_preds)
print(f"Single Decision Tree Accuracy: {tree_accuracy:.4f}")
# --- 3. Train a Random Forest ---
# n_estimators is the number of trees in the forest
forest_clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
forest_clf.fit(X_train, y_train)
forest_preds = forest_clf.predict(X_test)
forest_accuracy = accuracy_score(y_test, forest_preds)
print(f"Random Forest Accuracy: {forest_accuracy:.4f}")
# --- 4. Feature Importance ---
# Random forests can also tell us which features were most important
importances = forest_clf.feature_importances_
print("\\nTop 5 Most Important Features (according to Random Forest):")
for i in importances.argsort()[-5:]:
print(f" - Feature {i}")What the Code Does
- Data Setup: We create a synthetic dataset for a binary classification problem and split it into training and testing sets.
- Decision Tree: We train a single
DecisionTreeClassifier. We setmax_depth=10to stop it from growing too deep and overfitting too much. We then evaluate its accuracy on the unseen test data. - Random Forest: We train a
RandomForestClassifierwith 100 trees (n_estimators=100). We then evaluate its accuracy. You’ll typically see that the Random Forest outperforms the single Decision Tree. - Feature Importance: A great feature of tree-based models is that they can calculate the importance of each feature in making predictions. We can inspect the
feature_importances_attribute of our trained forest to see which features it relied on most.
Conclusion
Decision Trees provide a transparent and intuitive way to model decisions, but their tendency to overfit can be a significant drawback. Random Forests solve this problem by averaging the predictions of many diverse trees, creating a robust and high-performing model that is one of the most widely used algorithms in machine learning today. They offer a great balance of performance, interpretability, and ease of use.



