What Are Transformers In Machine Learning

What are they. Why do we need them. Where are they used.

Apr 05, 2022

Hey reader, welcome to the 💥 free edition 💥 of my weekly newsletter. I write about ML concepts, how to build ML products, and how to thrive at work. You can learn more about me here. Feel free to send me your questions and I’m happy to offer my thoughts. Subscribe to this newsletter to receive it in your inbox every week.

Machine Learning (ML) is progressing at a rapid pace. Many ML models are being built to address a variety of use cases. When it comes to Natural Language Processing (NLP), you need models that can understand sequential data. Sentences are sequences of words. If you build a model that can understand sentences well, you can build many applications -- search engines, Q&A platforms, language translators, speech recognizers.

A transformer is one such model. And transformers have completely taken the NLP world by storm. What's so special about them? Why do we need them when we already have neural networks that can model sequential data?

Why do we need sequence-to-sequence learning models?

When it comes to building models for sequential data, you need models that account for its sequential nature. Or else it will be like trying to fit a square peg into a round hole. In this situation, we need sequence-to-sequence learning models.

Their job is to understand a sequence and convert that into another sequence based on the task at hand.

Language translation is a good example of one such task. Understand a sequence of words in one language and convert it into a sequence of words in another language. That's why models such as Recurrent Neural Networks (RNNs) and Gated Recurrent Units (GRUs) were built. Long Short-Term Memory (LSTM) networks were built to improve upon RNNs.

What's the problem?

The basic issue with these models is that they don't remember long term information. The problem gets bigger as the sequence gets longer. Let's take the example of sentences. These models have a mechanism for information to propagate down the sequence of words.

But in practice, the likelihood that we retain the information reduces as we get further away from a word.

This is called vanishing gradient. Due to this issue, the model has a hard time learning longer sequences.

Another issue with RNNs is that they process the data sequentially. In the case of language translation, it processes one word at a time as the input and produces one word as an output. This is not efficient when it comes to computation. Takes a long time.

This is where transformers come into the picture.

What exactly is a Transformer?

A transformer is a neural network for sequence learning. It was introduced in a seminal 2017 paper by Google Brain titled Attention Is All You Need. Wait a minute, I thought love is all we need. Because The Beatles sang "All You Need Is Love" back in 1967. But that's a topic for another day.

The architecture of a transformer consists of a combination of encoders and decoders. It uses a mechanism called attention along with positional encoding and normalization to deliver amazing results.

They do semi-supervised learning which comprises of unsupervised pretraining followed by supervised fine-tuning. Pretraining allows a model to get most of the way there for any NLP task. The training is typically done on a huge dataset. To build a model for a specific task, we start with this pretrained model and fine-tune it based on training data specific to that task. Let's look at a few key characteristics below.

Encoder/decoder: Transformers use a framework of encoder/decoder to construct the neural network. The encoder converts the input sequence into a higher dimensional vector. This vector is fed to the decoder to produce the output sequence.

Attention: Transformers use the concept of attention to understand the relationship of a word with all the other words. CNNs use convolution and RNNs use recurrence. Well, transformers use self-attention. The idea here is to use a function that can help us learn the context. It means that we can use other words in the sequence to get a better understanding of the word in question. To learn the context, the function that we use is the scaled dot product. To make it more robust, the architecture can also employ Masked Multi-Head attention to build the neural network. This mechanism masks the next words in the sentence so that the model doesn't look at the future word.

Positional encoding: Transformers don't process words in a sequential manner. They look at a sequence as a whole. But how do we retain the information about the relative positions in a transformer? That's where positional encoding comes in. It allows the model to know the position of a word in a given sentence while also considering the overall length.

Why is it better?

Other sequence learning models take one input at a time, so training the models is a slow process. They can't make any use of parallel computing.

Given the architecture of transformers, they can take advantage of parallel computing.

They don't process the data sequentially, so the computations can be performed in parallel. This allows them to take advantage of high speed GPUs and build models much faster than other sequence learning models.

How is it used in the real world?

Transformers have been amazingly successful in the world of NLP. They have been used for a wide variety of applications such as:

Language translation
Answering questions
Search engines
Sentiment analysis
Document summarization
Text generation
Next sentence prediction
Biological sequence analysis
Video analysis

Where to go from here?

There are many pretrained transformers available such as GPT-2, GPT-3, BERT, XLNet, and RoBERTa. Many companies have built transformers and released it for public use. TensorFlow, PyTorch, Hugging Face, and OpenAI have created amazing transformers. The goal is to drive widespread adoption of transformers for NLP applications. You can save a lot of training time using these pretrained models. And it will save you money too since you don't have to rent high-end GPUs for training. Start with these pretrained models and fine-tune them using your own training data to achieve the results you want.

🎙🔥 Two new episodes on the Infinite ML pod

Emilie Schario: She is a Data Strategist-in-residence at Amplify Partners. Previously, she was the Director of Data at Netlify, where she led 8% of the company's headcount, and was the first data analyst at many companies, including GitLab, Doist, and Smile Direct Club. In this episode, she talks about how she entered the world of data science, looked for jobs, worked with customers, managed teams, interviewed people, and built a career. You can listen to it on Apple Podcasts, Spotify, and Google Podcasts.

Andrew Berry: He is a data science educator at Lighthouse Labs. He coaches aspiring data scientists and designs courses. He has worked with over 100 students from various backgrounds aiming to transition into data science. In this episode, he talks about his approach to teaching data science, designing courses, preparing data scientists for jobs, and building a career. You can listen to it on Apple Podcasts, Spotify, and Google Podcasts.

📋 Job Opportunities in AI

Check out this job board for the latest opportunities in AI. It features a list of open roles in Machine Learning, Data Science, Computer Vision, and NLP at startups and big tech.

💁🏻‍♀️ 💁🏻‍♂️ How would you rate this week’s newsletter?

You can rate this newsletter to let me know what you think. Your feedback will help make it better.

Infinite Curiosity Newsletter

Discussion about this post