An Introduction to Reinforcement Learning: Learning by Doing

Unlike supervised learning, where models learn from labeled data, and unsupervised learning, where models find patterns in unlabeled data, Reinforcement Learning (RL) is about learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.

Reinforcement Learning is a paradigm of learning inspired by behaviorist psychology. It’s about an agent learning to behave in an environment by performing actions and seeing the results. This is how humans and animals often learn: through trial and error.

The Core Components of Reinforcement Learning

Every RL problem involves a few key elements:

Agent: The learner or decision-maker. This is the algorithm we are training.
Environment: The world through which the agent moves. The agent’s actions can change the state of the environment.
State (S): A complete description of the state of the environment. It’s a snapshot of the current situation.
Action (A): A choice made by the agent from a set of possible actions.
Reward (R): A feedback signal from the environment. The agent’s sole objective is to maximize the total reward it receives over time.
Policy (π): The agent’s strategy or “brain.” It’s a function that maps a given state to an action. The goal of RL is to find the optimal policy that maximizes the cumulative reward.

The Learning Loop

The interaction between the agent and the environment follows a simple but powerful loop:

The agent observes the current state of the environment.
Based on this state, the agent chooses an action according to its current policy.
The environment transitions to a new state as a result of the action.
The environment gives the agent a reward (which can be positive, negative, or zero).
The agent uses this state-action-reward information to update its policy, so that next time, it is more likely to choose actions that lead to higher rewards.

This loop continues until the agent has learned an optimal policy.

Q-Learning: A Simple RL Algorithm

One of the most fundamental RL algorithms is Q-Learning. It’s a model-free, off-policy algorithm that learns the value of taking a particular action in a particular state. It does this by learning a Q-function, which estimates the total future reward we can expect if we take a certain action a from a certain state s and then follow the optimal policy thereafter.

The Q-function is updated using the Bellman equation:

[ Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a’} Q(s’, a’) - Q(s, a)] ]

Where:

( \alpha ) is the learning rate.
( \gamma ) is the discount factor, which determines the importance of future rewards.
( r ) is the immediate reward.
( s’ ) is the new state.
( \max_{a’} Q(s’, a’) ) is the maximum Q-value for the next state over all possible actions ( a’ ).

A Conceptual Q-Learning Example in Python

Let’s illustrate Q-Learning with a very simple text-based game. Imagine a 1D world (a line of 5 cells), where the agent starts at position 0 and wants to reach the goal at position 4.

import numpy as np

# --- Environment Setup ---
# 1D world: [S, _, _, _, G] where S is start, G is goal
num_states = 5
goal_state = 4
q_table = np.zeros((num_states, 2)) # 2 actions: 0 (left), 1 (right)

# --- Hyperparameters ---
learning_rate = 0.1
discount_factor = 0.9
num_episodes = 1000
exploration_rate = 1.0 # Start with 100% exploration
max_exploration_rate = 1.0
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

# --- Training Loop ---
for episode in range(num_episodes):
    state = 0 # Start at the beginning
    done = False

    while not done:
        # --- Exploration-Exploitation Trade-off ---
        if np.random.uniform(0, 1) < exploration_rate:
            action = np.random.choice([0, 1]) # Explore: choose a random action
        else:
            action = np.argmax(q_table[state, :]) # Exploit: choose the best known action

        # --- Take action and observe new state and reward ---
        if action == 1: # Move Right
            new_state = state + 1
        else: # Move Left
            new_state = state - 1

        # Keep agent within bounds
        new_state = np.clip(new_state, 0, num_states - 1)

        # Define reward
        if new_state == goal_state:
            reward = 1
            done = True
        else:
            reward = 0

        # --- Update Q-table using Bellman equation ---
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \\
            learning_rate * (reward + discount_factor * np.max(q_table[new_state, :]))

        state = new_state

    # Decay exploration rate
    exploration_rate = min_exploration_rate + \\
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)

print("Final Q-table:")
print(q_table)

This simple code shows how an agent can learn a “Q-table” that tells it the expected future reward for moving left or right from any given position. After training, the agent will know that moving right is always the best policy to reach the goal.

Conclusion

Reinforcement Learning is a powerful framework for solving problems that involve sequential decision-making. It’s the technology behind AI that can play complex games like Go (AlphaGo), control robotic arms, and optimize resource management in complex systems. While Q-Learning is a basic example, modern RL uses deep neural networks to approximate the Q-function or the policy itself (Deep Reinforcement Learning), enabling agents to learn in environments with an enormous number of states, like the screen pixels of an Atari game.

Recent content

Can AI Catch What Clinicians Miss? A Comparative Study of Diagnostic Accuracy

MLOps: From Model to Production

Cross-Validation: The Gold Standard for Model Evaluation

Support Vector Machines: Maximizing the Margin

Popular topics

An Introduction to Reinforcement Learning: Learning by Doing

The Core Components of Reinforcement Learning

The Learning Loop

Q-Learning: A Simple RL Algorithm

A Conceptual Q-Learning Example in Python

Conclusion

BERT and the Power of Transfer Learning in NLP

Unsupervised Learning: Finding Patterns in the Noise

MLOps: From Model to Production

Cross-Validation: The Gold Standard for Model Evaluation

Support Vector Machines: Maximizing the Margin

The Bias-Variance Tradeoff: A Balancing Act in Machine Learning

Recent content

Popular topics

An Introduction to Reinforcement Learning: Learning by Doing

The Core Components of Reinforcement Learning

The Learning Loop

Q-Learning: A Simple RL Algorithm

A Conceptual Q-Learning Example in Python

Conclusion

You may also like