Transformer architecture has taken the natural language processing (NLP) industry by storm. It is one of the most important ideas that happened in the world of NLP in the last decade. Transformers gave a colossal boost to language models, making it possible to use them for advanced tasks such as writing essays, summarizing texts, and writing code. All popular modern language models like Google's BERT or OpenAI's GPT have transformers at their core.
In this article, I want to talk about how transformers work, what benefits and limitations this architecture has, and answer the question, "Is transformer architecture the future of language models?"
What is a transformer architecture?
A transformer is a type of neural network architecture that is primarily used to understand texts. Neural networks have a lot of architectures that are used in working models, including Recurrent Neural Networks(RNN), which was a basic model for many years. RNN builds representations of each word in a sentence in a sequential manner, i.e., one word at a time. The downside of the models created using RNN architecture is that they have short-term memory— they are really bad at analyzing long blocks of text. By the time the model reaches the end of the long block of text, it will forget everything that happened at the beginning of the block. And RNN is really hard to train because it cannot paralyze well (it processes words sequentially, limiting opportunities to scale the model performance).
A paper called "Attention is all you need" that came out in 2016 defined the concept of transformers. The word 'transformer' suggests it's something that transforms from one state to another. The original transformer was designed for text translation from one language to another. This architecture allows the creation of language models that can handle many scenarios, such as writing essays, creating a summary of large text, and even generating code.
How transformer works
This article is not intended to be a tech specification of the transformer architecture, so we won’t dive into many technical details of this architecture. Instead, I want to focus on the basic concept of transformer architecture as well as three specific properties of transformers that make them so powerful. If you’re interested in learning more about technical aspects of transformers, check out the article Transformers are Graph Neural Networks or watch the video Transformers Neural Network: A step by step explanation
Transformers employ encoder/decoder architecture much like an RNN. The encoder works on the input sequence, and the decoder operates on the target sequence. The encoder generates encodings that define which parts of the input sequence are relevant to each other. The decoder uses the econding and generates the output sequence. The way transform works is to take the sequence you provide as an input, break it down into tokens and try to predict the word in the output sequence.
Nodes and data vectors
Similar to encoder/decoder mechanics, nodes and data vectors are things that are used in many networks. At a high level, all neural network architectures build representations of input data as vectors. The transformer model consists of the nodes that store vectors. Vectors encode useful semantic information about the data, and this information is used for practical tasks such as translating text from one language to another. The model works because nodes communicate with each other when they search for specific things via broadcasting. Nodes also update each other on the results of an interaction.
Now let's talk about things that are specific to transformers. Positional encoding, attention, and self-attention are three main things that define transformer architecture from others.
Positional encoding is an idea that instead of looking at words in the sentence sequentially, you take each word in the sentence, give it a number according to the position the word has in a sentence, and then provide it to the system. For example, "Tourists love the weather in Hawaii" will be "Tourists (1)," "love (2)," "the (3)", "weather (4)", "in (5)," and "Hawaii (6)." The information about word order is stored in the data itself. The model learns how to approach the word order from the data itself.
What makes transformers different from RNN is that they do not necessarily process data in order. Instead, transformers rely on the attention mechanism to ‘understand’ the context —figure out how important individual words in the sentence are and what part of the input the model should focus on. As a result, rather than starting the translation with the first word in the sequence (RNN would start with the word “Tourists”), transformers aim to understand the meaning of each word in the sequence. It tries to match words in the output and input to figure out the right position.
The benefit of the attention mechanism is that it doesn't suffer from the problem of short-term memory, and it helps transformers overcome the problem that RNN architecture has.
But how do transformers understand the right position? The model learns the relationships between words in input and output after training. The model is trained on a massive corpus of the text so that the model learns about grammatical specifics. When doing operations like translation from one language to another, the language model aligns the words in the two languages that it works with.
The same word in different sentences can have different meanings. Self-attention is a mechanism that helps solve this problem. As the name suggests, self-attention is attention with respect to oneself. Self-attention aims to understand the underlying meaning of the words in sentences. This method captures contextual relationships between individual words in the sentence; it focuses on understanding the context in which the word is used. For example, "We are on the ship" and "Ship this package today" both have the word 'ship' in them, but the context in which words are used is different. The language model analyzes the context, and, in the first case, it understands that the "ship" is a vessel, while in the second case, it means action.
The model understands that after training on the large corpus of data. The neural network learns to build better contextual understanding by receiving feedback, usually via error/loss functions.
What made transformers so good?
Andrej Karpathy perfectly summarized the advantages of transformers
A transformer is a general-purpose computer that is also trainable. General purpose means that you can train it on arbitrary problems. You can feed transformers any kind of content (text, images, speech), and it will process it.
Easy to train
Language models that have transformers at their core can be trained much easier than models based on RNN. Language models like GPT are called large language models (LLM) because they are trained on a vast dataset. For example, the GPT-3 model was trained on the information in the public web, that is 45 terabytes of text data.
Efficient computing and good scalability
The cost of computing is a significant factor that impacts the AI model. Transformers are very efficient in running on the current hardware because they are designed with serial operations rather than sequential operations. The attention mechanism is performed parallelly for each word in the sentence; as a result, transformers can run multiple sequences in parallel. That gives this model a massive advantage over other models. It's possible to efficiently parallelize transformers.
Is Transformer architecture the future of language models?
Despite all the advantages that large language models have, they are criticized by some experts in the field. A paper called "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" released in 2021, mentions significant downsides of the large language models such as BERT and GPT-2 & GPT-3.
Here are three downsides that might affect the evolution of the language model and have a direct or indirect impact on the transformer architecture:
High resource consumption
The progress that language model had in the last few years was achieved by increasing the size of language models. Models like GPT were trained on large amounts of data. It's proven that transformer models have been able to continuously benefit from larger quantities of data—training on a large dataset and fine-tuning for specific tasks leads to better performance of the language model. And model creators achieve high performance by increasing the size of the training data set.
But environmental impact scales with model size. Large language models consume much more resources than their predecessors. It leads to increasing environmental and financial costs. Training a single BERT language model will require as much energy as a trans-American flight. We will most likely see a trend for reducing the size of the models using various optimization techniques.
Risks of training model on the open training set
GPT was trained on large, uncurated datasets from the web. The size of data available on the web has enabled deep learning models to achieve high accuracy on specific NLP tasks. But this type of training naturally encodes and reinforces biases that negatively affect the system's output. LLM can reproduce and amplify the biases in the data it uses for training. LLM can provide racist, sexist, and extremist responses.
It happens because the large size of the data set doesn't guarantee diversity. Even though the GPT- model uses a filtered dataset, it still suffers from problems. For instance, part of the data that was used for training the GPT-2 model was sourced from links posted on Reddit. Pew Internet Research's 2016 survey reveals almost 70% of Reddit users in the US are men and more than 60% are between the ages of 18 and 29. No wonder GPT-2 naturally inherited some stereotypes and biases affecting this segment of users.
Most likely, we will see more effort into curating and documenting datasets. Model creators will likely need to spend more time assembling datasets suited for the tasks rather than feeding the model massive amounts of data from public Internet sources. For example, a mechanism for measuring the toxicity of text generated by LLMs might be used to minimize the risk of creating biases.
A language model is not general intelligence
Language models created on top of the transformers architecture do not have access to the meaning of the information they analyze. LLMs do natural language processing (NLP), but they do not perform natural language understanding (NLU). As a result, any claims that GPT or BERT is true AI are false. The paper mentioned above is called "on the danger of stochastic parrots" because it mentions that current language models don't understand the underlying meaning of the content they analyze. But LLMs are good at mimicking this understanding by manipulating linguistic form to convince users that it knows what it's talking about. And this might be dangerous, especially when users rely on the model when they make critical decisions.
Sam Altman, CEO of OpenAI, the company behind the GPT, is open about the risk of using large language models in daily work.
Transformers are taking over AI, but despite all the benefits that transformers bring to the world of NLP, the future of transformers is still uncertain. Transformer architecture has been resilient since 2016. Small changes were introduced in the architecture of models like BERT and GPT to make them more efficient, but the NLP progress was mainly achieved by scaling the size of training sets & number of parameters that models have. The architecture was kept unchanged.
Model creators didn't change the architecture for the reasons—transformers are remarkably stable. But it feels like we'll soon reach the moment when we will witness the limits of transformer architecture. Rumors say that OpenAI might try to optimize properties in the architecture of GPT-4 to be able to improve the model efficiency without increasing the size of the training set & the number of parameters. Or maybe we will soon find better architecture than transformers.