I’m reading a book recently called Transformer, BERT, and GPT: Including ChatGPT and Prompt Engineering by Oswald Campesato (2024).
The book is divided into 10 Chapters. Here is a summary of the first chapter (Introduction).
Generative AI is a subset of artificial intelligence models designed to generate new data samples similar in nature to the input data with the ability to create as opposed to analyzing or classifying (as in machine learning).
- Data generation – creating new data not in the original training data but similar to it, synthesis – blending various inputs together to create a new kind output, learning distributions – learning the probability distribution of the training data to produce new samples from that distribution.
- One popular technique is called Generative Adversarial Networks (GANs) which consist of a generator (produces fake data) and discriminator (distinguish fake and real data). It is a class of unsupervised learning (as opposed to supervised, meaning they do not need labels for data training).
- Conversational AI has the goal of facilitating human-like interactions between machines and humans (requiring data from human-human or human-bot conversations); generative AI creates new content similar in structure and style to the training set (data comes from text, images or music). OpenAI’s DALL-E is an example of generative AI for image synthesis and learns the probability distribution of the training data.
- ChatGPT is another example of generative AI (in my opinion it is also conversational AI) because they generate text give user prompts, sample from the probability distribution of training data, and use unsupervised learning although it can be fine-tuned.
- Other notable companies are DeepMind (famously beating a master Go player Lee Sedol and developed transformer architecture), Hugging Face (open-source) and Anthropic (founded by former OpenAI employees)
- Large Language Models (LLMs) which are so-named for their size (at least 10 billion parameters, trained on datasets requiring millions of dollars to train) are based on transformer architecture (self-attention mechanism which allows the model to weigh the importance of different input tokens).
- Training data set size is more important that model size: for example, the Chinchilla LLM from DeepMind has 70 billion parameters but outperforms Megatron-Turing which as 530 billion parameters because the training dataset was five times larger.
- The author claims that LLMs are not able to understand language as humans do but can mimic intelligent choices like humans.
- Transformer architectures (used in deep learning) relies on loss functions which is a differentiable function that determines the error arising from model predictions. During each backward pass (backward error propagation), the loss function calculates the gradient (partial derivatives) of the model parameters to update the weights of the parameters to improve the accuracy of the model (this is used for fine-turing an LLM).
- AI drift occurs when the LLM responds in unexpected but consistent ways. Hallucinations are AI drifts that are random and not consistent. Model drift occurs when the production data is qualitatively different from the training data.
- Attention (or self-attention) is a mechanism in the transformer architecture by which contextual word embeddings are determined for words in a corpus. Basically it takes a word in the context of the sentence, espically if the same word such as “bank” can have different meanings in different contexts. It is based off of the seminal paper from Google in 2017 “Attention is all you need”.
- Neural networks consist of an input layer (vector v1), one or more hidden layers and an output layer. Given any pair of adjacent layers, the weights of the parameters (edges) between layers are represented by matrix W. The forward propagation starts with v1 and multiplies that with W to get v2. An activation function is applied to v2 to get v3 and repeat the mulplication and activation steps.
- To calculate attention first tokenize an input sentence, generate an embedding vector for each token. Combine these vectors into a matrix and use the X and W matrix to create a query matrix Q, a key matrix K and a value matrix V. Apply a scaling factor and use the softmax function which converts those numbers into probability distributions that measures how any pair of words are related. For example, in “I love Chicago pizza” [0.88, 0.06, 0.03, 0.03] means that “I” and “I” has a correlation of 0.88 but “I” only has a correlation of 0.06 with “love”. The attention matrix Z is the product of this matrix with the value matrix V also known as the scaled dot product attention (scaled by softmax and dot product of each row of that scaled matrix with matrix V). The result is a scaled version of matrix V.
- You can perform more than one attention matrix called multi-head attention (MHA). This can be combined with convolution neural networks (CNNs).
Leave a Reply