An Introduction to Reinforcement Learning: Learning by Doing
Photo by Maximalfocus on Unsplash

Unlike supervised learning, where models learn from labeled data, and unsupervised learning, where models find patterns in unlabeled data, Reinforcement Learning (RL) is about learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.

Reinforcement Learning is a paradigm of learning inspired by behaviorist psychology. It’s about an agent learning to behave in an environment by performing actions and seeing the results. This is how humans and animals often learn: through trial and error.

The Core Components of Reinforcement Learning

Every RL problem involves a few key elements:

  1. Agent: The learner or decision-maker. This is the algorithm we are training.
  2. Environment: The world through which the agent moves. The agent’s actions can change the state of the environment.
  3. State (S): A complete description of the state of the environment. It’s a snapshot of the current situation.
  4. Action (A): A choice made by the agent from a set of possible actions.
  5. Reward (R): A feedback signal from the environment. The agent’s sole objective is to maximize the total reward it receives over time.
  6. Policy (π): The agent’s strategy or “brain.” It’s a function that maps a given state to an action. The goal of RL is to find the optimal policy that maximizes the cumulative reward.

The Learning Loop

The interaction between the agent and the environment follows a simple but powerful loop:

  1. The agent observes the current state of the environment.
  2. Based on this state, the agent chooses an action according to its current policy.
  3. The environment transitions to a new state as a result of the action.
  4. The environment gives the agent a reward (which can be positive, negative, or zero).
  5. The agent uses this state-action-reward information to update its policy, so that next time, it is more likely to choose actions that lead to higher rewards.

This loop continues until the agent has learned an optimal policy.

Q-Learning: A Simple RL Algorithm

One of the most fundamental RL algorithms is Q-Learning. It’s a model-free, off-policy algorithm that learns the value of taking a particular action in a particular state. It does this by learning a Q-function, which estimates the total future reward we can expect if we take a certain action a from a certain state s and then follow the optimal policy thereafter.

The Q-function is updated using the Bellman equation:

[ Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a’} Q(s’, a’) - Q(s, a)] ]

Where:

  • ( \alpha ) is the learning rate.
  • ( \gamma ) is the discount factor, which determines the importance of future rewards.
  • ( r ) is the immediate reward.
  • ( s’ ) is the new state.
  • ( \max_{a’} Q(s’, a’) ) is the maximum Q-value for the next state over all possible actions ( a’ ).

A Conceptual Q-Learning Example in Python

Let’s illustrate Q-Learning with a very simple text-based game. Imagine a 1D world (a line of 5 cells), where the agent starts at position 0 and wants to reach the goal at position 4.

import numpy as np

# --- Environment Setup ---
# 1D world: [S, _, _, _, G] where S is start, G is goal
num_states = 5
goal_state = 4
q_table = np.zeros((num_states, 2)) # 2 actions: 0 (left), 1 (right)

# --- Hyperparameters ---
learning_rate = 0.1
discount_factor = 0.9
num_episodes = 1000
exploration_rate = 1.0 # Start with 100% exploration
max_exploration_rate = 1.0
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

# --- Training Loop ---
for episode in range(num_episodes):
    state = 0 # Start at the beginning
    done = False

    while not done:
        # --- Exploration-Exploitation Trade-off ---
        if np.random.uniform(0, 1) < exploration_rate:
            action = np.random.choice([0, 1]) # Explore: choose a random action
        else:
            action = np.argmax(q_table[state, :]) # Exploit: choose the best known action

        # --- Take action and observe new state and reward ---
        if action == 1: # Move Right
            new_state = state + 1
        else: # Move Left
            new_state = state - 1

        # Keep agent within bounds
        new_state = np.clip(new_state, 0, num_states - 1)

        # Define reward
        if new_state == goal_state:
            reward = 1
            done = True
        else:
            reward = 0

        # --- Update Q-table using Bellman equation ---
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \\
            learning_rate * (reward + discount_factor * np.max(q_table[new_state, :]))

        state = new_state

    # Decay exploration rate
    exploration_rate = min_exploration_rate + \\
        (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)

print("Final Q-table:")
print(q_table)

This simple code shows how an agent can learn a “Q-table” that tells it the expected future reward for moving left or right from any given position. After training, the agent will know that moving right is always the best policy to reach the goal.

Conclusion

Reinforcement Learning is a powerful framework for solving problems that involve sequential decision-making. It’s the technology behind AI that can play complex games like Go (AlphaGo), control robotic arms, and optimize resource management in complex systems. While Q-Learning is a basic example, modern RL uses deep neural networks to approximate the Q-function or the policy itself (Deep Reinforcement Learning), enabling agents to learn in environments with an enormous number of states, like the screen pixels of an Atari game.

An Introduction to Reinforcement Learning: Learning by Doing
Older post

BERT and the Power of Transfer Learning in NLP

Discover how BERT (Bidirectional Encoder Representations from Transformers) revolutionized NLP by learning deep contextual relationships, and how transfer learning allows us to leverage its power for custom tasks.

Newer post

Unsupervised Learning: Finding Patterns in the Noise

A look into unsupervised learning, the branch of machine learning that finds hidden patterns and structures in unlabeled data, focusing on clustering and dimensionality reduction.

An Introduction to Reinforcement Learning: Learning by Doing