Transformers, BERT, and GPT (Chapter 2)

I’m reading a book recently called Transformer, BERT, and GPT: Including ChatGPT and Prompt Engineering by Oswald Campesato (2024).

The book is divided into 10 Chapters. Here is a summary of the second chapter (Tokenization).

Tokenization is the splitting of a group of words into numbers (a vector) as input into neural networks (NN). Think of it as pre-processing the words so that you can use it in a NN.
- Hugging Face defines pre-tokenization as removing spaces from words, lowercasing words, and generally making sure the words look clean and consistent, think of as pre-processing for tokenization. For example, some LLMs are uncased (using all lowercase) versus cased uses all uppercase letters.
- Tokenizing words is a non-trivial task, taking into account capitalization, word delimiters, word spelling variants, spelling mistakes.
- Word-based tokenizers are most limited and match word for word, for example ‘sing’ is a different word from ‘singing’, with the limitation that compound words are not recognized such as ‘bookkeeping’. To deal with mutiple languages, an encoding system such as UTF-8 can deal with diacritical marks in languages such as French and punctuations only used in Chinese and Japanese.
- Character-based tokenizers split a word into characters which generates a large number of tokens. Spelling mistakes are more difficult to fix with this method.
- Subword tokenization is based on heuristics and keeps words together like “quickly” is decomposed into “quick” and “ly”.
  - Byte-pair encoding (BPE) is an example of subword tokenization. The goal is the find the most frequently paired tokens. For example, the word “lowers” consists of “l-o-w-e-r-s”, first pair “lo” together because it’s the most frequently occurring, then get “low”, single out the “e” and pair “rs” together so you get “low-e-rs”. This is used in BERT and GPT.
- Hugging Face provides AutoTokenizer which you can call with from transformers import AutoTokenizer
Parameters in neural networks are edges that connect neurons to two different layers. The parameters are assigned numeric weights that are updated during the training (or fine-tuning) step of an LLM or neural network. There are also hyperparameters set prior to training. Finally there are adjustable inference parameters such as max tokens, token length, stop tokens, sample top K or P, and temperature.

Transformers, BERT, and GPT (Chapter 2)

Leave a Reply Cancel reply