NLP's quantum leap: from N-grams to GPT-4
David Wood & Marcel Marais
May 24, 2023
In recent years, Natural Language Processing (NLP) has witnessed remarkable progress, fueled by breakthroughs in language modelling. These developments have revolutionised the way machines ingest and interact with human language, leading to unprecedented advancements in various natural language tasks such as question answering.
Language Models (LMs) are sophisticated systems designed to understand and generate text that closely resembles human language. While Language Models have been present in NLP for quite some time, research in the past five years, most notably the introduction of the transformer, has increased the efficiency and complexity of these models to new heights. This leap enables the training of models with hundreds of billions of parameters (learnable knobs / settings that a neural network adjusts to fit the training data) on massive datasets, giving rise to the concept of Large Language Models (LLMs). These LLMs, with their extensive language knowledge, have altered the landscape of NLP, unlocking new possibilities and pushing the boundaries of what can be achieved in language understanding and generation tasks.
What Preceded LLMs? Tracing the Evolution of Language Models
N-grams & Hidden Markov Models
Historically, NLP relied on statistical methods that involved the analysis of text corpora and the application of explicitly defined probabilistic models. One of the foundational techniques, n-grams, refers to contiguous sequences of 'n' items from a given text or speech. For instance, in the sentence "I love dogs", the bigrams (2-grams) are "I love" and "love dogs". These n-grams are used to predict the next item in a sequence, based on the occurrence of the previous 'n' items.
Another pivotal method, the Hidden Markov Model (HMM), is a statistical model that assumes an underlying process is a Markov process with unobserved states. HMMs are especially useful in part of speech tagging where each word in a sentence is assigned a tag based on its likelihood and the tags of preceding words. These models depended on counting occurrences of words and phrases in large datasets to determine their probabilities. However, a major limitation of these models is that they often could only account for a fixed size vocabulary. This meant that handling out-of-distribution words or phrases, which hadn't been seen in the training data, was a significant challenge.
Feature-based models
The next wave of NLP was characterised by feature-based models, where text was converted into a set of features (like word frequencies, presence/absence of certain words, etc.), which were then fed into machine learning algorithms. Support Vector Machines (SVMs) and Decision Trees were popular choices for tasks like text classification and sentiment analysis. These methods often required careful feature engineering, as the quality of the features would directly impact the model's performance. Domain knowledge was critical in designing these features, making NLP a highly specialised field.
However, the limitations of these traditional methods became apparent as the volume and complexity of data increased. They struggled with understanding context, handling idiomatic expressions, and managing the vast nuances of human language. Moreover, they couldn't scale effectively with the explosion of data on the internet, which consisted of diverse languages, slang, and evolving linguistic structures.
Enter RNNs & LSTM networks
A significant milestone in NLP came with the popularisation of deep learning and neural networks. Specifically, Recurrent Neural Networks (RNNs) and their variant, Long Short-Term Memory (LSTM) networks, brought the concept of sequential processing to deep learning. These models could capture dependencies within the text by utilising recurrent connections - connections that allow information to be passed from one step, or time point, to the next - enabling more context-aware language understanding.
While RNNs and LSTMs offered notable improvements and achieved state-of-the-art performance on many tasks for several years, they suffered from certain limitations. The issue of vanishing gradients (hindered information flow in a neural network) across long sequences, posed a significant challenge. More importantly, the computational demands of training these models were often prohibitively high, impeding their scalability.
In general, these methods often struggled to capture the complex and nuanced nature of human language which limited their effectiveness in more advanced applications such as generating fluent language.
What Enabled the Development of LLMs?
The LLMs of today, however, have done away with recurrence and instead favour the attention mechanism. In this section, we explore the key breakthroughs that have propelled the generative language modelling boom and enabled the open-source community to utilise these massive models with modest resources.
Transformers
The introduction of the transformer, as proposed by Vaswani et al, has dramatically shifted the focus of NLP researchers. Transformers rely on the concept of self-attention. This enables the models to selectively focus on different parts of the input sequence during processing. Attention allows for better capturing of dependencies and relationships within the text, resulting in more accurate language generation.
One of the key advantages of the transformer is its ability to process the entire input sequence in parallel, as opposed to sequential models such as RNNs. This is a significant departure as it completely subverts the vanishing gradient problem and allows for extremely efficient training and inference.
Transformers form the foundational building block of many LLMs and have been used in various architectures, with differing training objectives. We will discuss these differences as they relate to popular models such as GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers) in the next section.
Reinforcement Learning from Human Feedback (RLHF)
Once an LLM has completed its initial expensive training (usually called pretraining) it is not yet optimally suited to help with any specific task, in other words the model has a very advanced understanding of language but does not “know” that it needs to be helpful in any particular way. A popular technique used to resolve this is to incorporate a secondary training phase involving human feedback and Reinforcement Learning (a type of machine learning where an agent learns to make decisions by interacting with an environment). This phase is being referred to as “alignment”. With this technique we bias the model’s generation towards human preference, as an obvious example: when we ask a question we typically want it answered. RLHF is one of the driving factors behind the impressive performance of the famous ChatGPT and its predecessor InstructGPT.
Parameter-Efficient LLM Finetuning with Low Rank Adaptation (LORA)
Fine-tuning is another widely adopted technique for optimising the performance of LLMs in preparation for downstream tasks. Fine-tuning is a process in which a pre-trained model is further adjusted or tweaked to perform well on a specific task or dataset, this typically requires far less data and compute than the initial training. However, given the scale of LLMs (billions of parameters) even limited fine-tuning can be computationally expensive and time consuming. Luckily, recent research has introduced techniques like Low Rank Adaptation (LORA), which aim to make fine-tuning more efficient. LORA leverages low-rank matrix factorisation to approximate the large parameter matrices of language models (exploiting the fact that these models have more parameters than necessary or are “overparameterised”), enabling faster and more inexpensive adaptation to specific tasks. This approach has the potential to make fine-tuning more accessible and practical for a wide range of applications.
Advancements in Hardware
Hardware developments have played a crucial role in pushing the boundaries of large-scale language models. Companies like NVIDIA have introduced specialised graphics processing units (GPUs) such as the A100 and H100. These GPUs often contain “tensor cores” which enable mixed-precision computing, a technique essential for tractable matrix operations. In general, “AI native hardware” is designed to handle the intense computational requirements of training and deploying deep learning models.
Historical Foundation Models
In the rapidly changing world of LLMs it is almost impossible to make an up-to-date statement about the latest and greatest models. However, there are two notable families of transformer based models that have made significant contributions to the field and have gained widespread attention. These are: BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer). We aim to provide a brief explanation of these models and define which innovations they introduced.
BERT (Bidirectional Encoder Representations from Transformers): BERT, introduced by researchers at Google AI, proposed an innovative training objective. Unlike previous models that focused on left-to-right or right-to-left contexts, BERT introduced a bidirectional training approach. During pretraining, BERT is trained to predict missing words within a sentence by considering both the preceding and following words. This bidirectional objective allows BERT to capture a more comprehensive understanding of the contextual relationships between words. It also introduced the concept of pretraining followed by fine-tuning for various downstream tasks. BERT achieved state-of-the-art results on multiple NLP benchmarks, showcasing its prowess in tasks such as question answering, text classification, and named entity recognition. By capturing bidirectional context during pretraining, BERT excelled in understanding the nuances and subtleties of language, leading to significant improvements in language understanding tasks.
GPT (Generative Pre-trained Transformer): GPT, developed by OpenAI, impressed NLP researchers with its ability to generate coherent and contextually relevant text. GPT employed a generative approach where the model is trained to predict the next word in a sentence given the previous words (this is what is meant by autoregressive). By training on a vast corpus of text data, GPT acquired a deep understanding of language patterns, allowing it to generate human-like text that captivated the NLP community. With subsequent iterations like GPT-2, GPT-3 and now GPT-4, these models achieved astonishing results, demonstrating capabilities in areas such as text completion, language translation, and even creative writing.
Today, even the largest iteration of BERT is considered a small model. However, we believe it is still important to mention as it is still widely used, although mostly in non-generative contexts.
Concluding remarks
As we've transitioned from simple statistical methods to state-of-the-art deep learning models, the once clear boundaries of what machines could understand and generate have been significantly expanded, bringing us closer to Artificial General Intelligence. Yet, as with all fields, NLP will continue to evolve. While we've witnessed great leaps in recent years, the ever-changing nature of language ensures that there will always be new challenges to conquer.