Long Short-Term Memory (LSTM): Overcoming RNNs' Limitations
Photo by Gerd Altmann on Pixabay

As we discussed with Recurrent Neural Networks (RNNs), they can struggle to remember information over long sequences. This is known as the vanishing gradient problem. How can a model translate a long paragraph if it forgets the beginning by the time it reaches the end? The solution lies in a more sophisticated architecture: Long Short-Term Memory (LSTM).

LSTMs are a special kind of RNN, introduced by Hochreiter & Schmidhuber in 1997. They are explicitly designed to avoid the long-term dependency problem. Their genius lies in a unique structure called the cell state and a series of gates that regulate the flow of information.

The Core Components of an LSTM

An LSTM cell contains three critical gates that work together to protect and control the cell state:

  1. Forget Gate: This gate decides what information should be thrown away from the cell state. It looks at the previous hidden state and the current input and outputs a number between 0 and 1 for each number in the cell state. A 1 represents “completely keep this,” while a 0 represents “completely get rid of this.”
  2. Input Gate: This gate decides which new information we’re going to store in the cell state. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates a vector of new candidate values.
  3. Output Gate: This gate decides what we are going to output. The output will be based on our cell state, but will be a filtered version. A sigmoid layer decides which parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between -1 and 1) and multiply it by the output of the sigmoid gate.

This gating mechanism allows LSTMs to selectively remember or forget information, enabling them to learn dependencies over hundreds of time steps, something a simple RNN cannot do.

LSTM Implementation in Keras

Building an LSTM in Keras is very similar to building a SimpleRNN. You can simply swap the SimpleRNN layer for an LSTM layer.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

def build_lstm_model(vocab_size, embedding_dim, lstm_units, batch_size):
    """
    Builds a simple LSTM model.
    """
    model = Sequential([
        # Embedding Layer
        Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),

        # LSTM Layer
        # The LSTM layer replaces the SimpleRNN layer.
        # It has the same stateful and return_sequences parameters.
        LSTM(lstm_units,
             return_sequences=True,
             stateful=True,
             recurrent_initializer='glorot_uniform'),

        # Output Dense Layer
        Dense(vocab_size)
    ])
    return model

# Define some example parameters
VOCAB_SIZE = 10000
EMBEDDING_DIM = 256
LSTM_UNITS = 1024
BATCH_SIZE = 64

# Build the model
lstm_model = build_lstm_model(
    vocab_size=VOCAB_SIZE,
    embedding_dim=EMBEDDING_DIM,
    lstm_units=LSTM_UNITS,
    batch_size=BATCH_SIZE)

# Display the model's architecture
lstm_model.summary()

What’s Changed?

The only significant change from the previous RNN example is replacing SimpleRNN with LSTM. The LSTM layer in Keras encapsulates all the complex gate logic (forget, input, output), making it easy to use. The recurrent_initializer is often specified to ensure good weight initialization, which is important for training deep networks.

Applications and Impact

The ability to capture long-range dependencies made LSTMs the state-of-the-art for many NLP tasks for years:

  • Machine Translation: Google Translate used LSTMs for a significant period.
  • Text Generation: LSTMs can write plausible text by predicting the next word in a sequence.
  • Speech Recognition: They are a key component in systems that transcribe speech to text.

Another popular variant is the Gated Recurrent Unit (GRU), introduced by Cho, et al. in 2014. It combines the forget and input gates into a single “update gate” and has a simpler architecture, making it computationally more efficient than an LSTM, while often delivering comparable performance.

Conclusion

LSTMs and their variants like GRUs were a massive step forward for sequence modeling. They solved the critical problem of short-term memory that plagued simple RNNs and paved the way for many of the sophisticated language technologies we use today. While the Transformer architecture has since become dominant in many areas of NLP, understanding LSTMs is still fundamental to appreciating the evolution of deep learning for sequential data.

Long Short-Term Memory (LSTM): Overcoming RNNs' Limitations
Older post

Recurrent Neural Networks (RNNs): Understanding Sequential Data

An introduction to Recurrent Neural Networks (RNNs), the models that give machines a sense of memory, making them ideal for tasks like translation, speech recognition, and more.

Newer post

The Transformer Architecture: The Model That Changed NLP Forever

An exploration of the Transformer architecture and its core component, the self-attention mechanism, which has become the foundation for modern large language models like GPT and BERT.

Long Short-Term Memory (LSTM): Overcoming RNNs' Limitations