What is a Transformer?

June 28, 2023

OpenAI, Machine Learning, Transformers, AI, Deep Learning

The paper “Attention Is All You Need” kept coming up in my circles to read and I finally found time to look at it. It talks about a revolutionary architecture that has been a game-changer in the field, particularly when it comes to natural language processing. For those in the know, the concept is called a Transformer. It’s old news at this point, but for the uninitiated, what exactly is a Transformer, and how does it work? Let’s break it down.

Transformers: Pay attention! #

Picture this: you’re trying to translate a sentence from one language to another. The traditional approach would be to read the sentence from start to finish, remember everything, and then write out the translation. But this can be quite the challenge, especially for long sentences where you have to remember a ton of information.

Enter the Transformer. Instead of reading the sentence from start to finish, the Transformer takes a look at all the words at once. It pays “attention” to different words depending on what it’s currently trying to translate. For instance, if it’s translating the word “it” in the sentence “The cat chased its tail because it was bored”, the Transformer knows to pay more attention to “The cat” and “was bored” because those parts of the sentence help determine what “it” refers to.

The secret sauce here is something called “attention mechanisms”. These mechanisms allow the model to focus on different parts of the input sentence and decide which parts are important for the current word it’s trying to translate.

Two Parts: The Encoder & Decoder #

The Transformer is made up of two main parts: the encoder and the decoder. The encoder reads the input sentence and creates a sort of map that the decoder uses to generate the translation. Both the encoder and the decoder use attention mechanisms to figure out which words to focus on at each step in the process.

One of the coolest things about the Transformer is that it can look at all the words in the sentence at the same time, which makes it more efficient and faster at translating sentences than older models that read sentences from start to finish.

But Doesn’t That Use Up More Resources? #

You’d think that looking at the entire sentence at once would use up more computational resources, but the Transformer is designed to be efficient. The key is in the “attention” mechanism that the Transformer uses. Instead of processing each word in the sentence one by one, the Transformer calculates a score for each word in the sentence that represents how much “attention” should be paid to that word when translating a particular word in the sentence. This score is based on the relationship between the word being translated and all the other words in the sentence.

These scores are then used to create a weighted combination of all the words in the sentence, which forms a kind of context for the word being translated. This context is used by the Transformer to generate the translation.

Because all these scores can be calculated in parallel, the Transformer is able to process the entire sentence at once without using up more resources than necessary. In fact, this parallel processing makes the Transformer more efficient and faster at translating sentences than older models that process sentences word by word.

It’s also worth noting that the Transformer uses a fixed amount of memory regardless of the length of the sentence, because it represents the sentence as a fixed-size vector (a list of numbers). This is in contrast to some older models, like recurrent neural networks, which can use up more memory for longer sentences because they process each word in the sentence one by one.

Wrapping Up #

What’s struck me most is the uncanny resemblance between working with these models and the inner workings of a Transformer itself. Prompt engineering, in many ways, feels like operating a Transformer on a macro level. It’s as if I’m tweaking the attention mechanism of the model with each prompt, guiding it to focus on the elements that will yield the most meaningful output. It’s a fascinating parallel that’s given me a better model for thinking about how to better “grab the attention” of LLMs.