statslink

linking statistics to all

Transformers, BERT, and GPT (Chapter 3)

I’m reading a book recently called Transformer, BERT, and GPT: Including ChatGPT and Prompt Engineering by Oswald Campesato (2024).

The book is divided into 10 Chapters. Here is a summary of the third chapter (Transformer Architecture).

Sequence-to-sequence models and “encoder-decoder” are related. The Seq2Seq model consists of two multilayer LSTM (long-term short-term memory): one maps the input to a fixed size vector and the second LSTM decodes the vector to produce a sequence in the target language. Google switched to this model in 2016.

Types of seq2seq models include Recurrent Neural Network (RNN) based, and LSTM based. Prediction involves forecasting the next value in a real valued sequence and outputting the class label. There is one-to-one and many-to-one and many-to-many.

Inputs to RNNs have the same length (and are padded) to a specified length and truncated if longer. But encoder/decoder models generate an encoding before passing the encoding to a decoder for processing. Both encoder decoder RNN and LSTM models are comprised of two models, first encodes the sequence into a fixed length vector followed by a second model that decodes the fixed vector and outputs the predicted sequence.

The LSTM model consists of multiple LSTM cells where the output of one LSTM cell is treated as a context vector that becomes the input of the next LSTM cell. The outputs of the intermediate cells are discarded and only the output of the final cell is passed. The final cell contains the cumulative information accrued from each of the preceding hidden layers. Unlike transformers, processing LSTM is sequential.

Autoregressive models are decoder only models using previous predictions to generate a new prediction. They can only see tokens that precede the current token and blocked from seeing subsequent tokens. This forces autoregressive models to predict the next token without knowing it. Typical transformer decoders are autoregressive at inference time and non-autoregressive at training time.

Autoencoding models is better suited for understanding compared to autoregressive models. Autoencoders learn a representation (encoding) for a set of data for the purposes of dimensionality reduction, so the input and output layers are the same. They corrupt the input tokens and attempt to reconstruct the original sentence (masking). An autoencoder is a simple neural network in which the input and output layer are the same. The purpose of the hidden layer is for dimensionality reduction (contents of the hidden layer are retained for downstream tasks but input/output layers are discarded. They are primarily used for unsupervised learning tasks (it is not trained on pre-classified input). The encoder compresses the input into latent space and the decoder reconstructs the input data from this representation. The goal is to minimize the difference between the reconstruction and the original input.

The autoencoding transformer is a neural network that combines the unsupervised learning capabilities of autoencoders with the powerful sequence processing abilities of transformers. It encodes the input into latent representations user transformers and decode them back. BERT (bidirectional encoder representations from transformers) is a kind of autoencoding transformer.

The transformer architecture was released in 2017 based on the encoder-decoder seq2seq model and facilitated the development of LLMs such as BERT and GPT. The original transformer architecture was actually based on LSTMs but the seminal paper “Attention is all you need” fundamentally changed the architecture to be based on the self-attention mechanism. The attention can be in the encoder, a masked attention in the decoder and it can also be between the encoder and decoder. BERT uses only the encoder, GPT only uses the decoder and the T5 model uses both.

The attention mechanism is a sequence to sequence operation that takes a series of vectors as input and produces a different sequence of vectors (e.g., a weighted average of all the input vectors). The encoder contains six blocks (each containing elements called layers that perform attention, each with two sublayers). The decoder contains six blocks each with three sublayers (two are the counterparts to the encoder and the third is a multi-head attention layer for transforming the input of the encoder.

The main two advantages of the transformer architecture is their lower computational complexity and higher connectivity which is good for sequences with long dependencies. Unlike LSTM-based models where only the final hidden layer is passed, every masked attention layer in the encoder component is passed to the decoder.

The input to a transformer consists of a set of tokens whose maximum length is 512 or 1024 (called the context size). Longer sentences past the maximum length are truncated leading to loss of accuracy.

In summary, transformers rely on an attention mechanism, model training can be parallelized, no CNN/RNN/LSTMs are required and there is usually a maximum input length of 512 or 1024.