
Visualizing Transformers and Attention#
This is the summary note from Grant Sanderson’s talk at TNG Big Tech 2024. My earlir article on transformers can be found here
Transformers and Their Flexibility#
- ๐ Origin: Introduced in 2017 in the “Attention is All You Need” paper, originally for machine translation.
- ๐ Applications Beyond Translation: Used in transcription (e.g., Whisper), text-to-speech, and even image classification.
- ๐ค Chatbot Models: Focused on models trained to predict the next token in a sequence, generating text iteratively one token at a time.
Next Token Prediction and Creativity#
- ๐ฎ Prediction Process: Predicts probabilities for possible next tokens, selects one, and repeats the process.
- ๐ก๏ธ Temperature Control: Adjusting randomness in token selection affects creativity vs. predictability in outputs.
Tokens and Tokenization#
- ๐งฉ What are Tokens? Subdivisions of input data (words, subwords, punctuation, or image patches).
- ๐ก Why Not Characters? Using characters increases context size and computational complexity; tokens balance meaning and computational efficiency.
- ๐ Byte Pair Encoding (BPE): A common method for tokenization.
Embedding Tokens into Vectors#
- ๐ Embedding: Tokens are mapped to high-dimensional vectors representing their meaning.
- ๐บ๏ธ Contextual Meaning: Vectors evolve through the network to capture context, disambiguate meaning, and encode relationships.
The Attention Mechanism#
- ๐ Purpose: Enables tokens to “attend” to others, updating their vectors based on relevance.
- ๐ Key Components:- Query Matrix: Encodes what a token is “looking for.”
- Key Matrix: Encodes how a token responds to queries.
- Value Matrix: Encodes information passed between tokens.
 
- ๐งฎ Calculations:- Dot Product: Measures alignment between keys and queries.
- Softmax: Converts dot products into normalized weights for updates.
 
- โ๏ธ Masked Attention: Ensures causality by blocking future tokens from influencing past ones.
Multi-Headed Attention#
- ๐ก Parallel Heads: Multiple attention heads allow different types of relationships (e.g., grammar, semantic context) to be processed simultaneously.
- ๐ Efficiency on GPUs: Designed to maximize parallelization for faster computation.
Multi-Layer Perceptrons (MLPs)#
- ๐ค Role in Transformers:- Add capacity for general knowledge and non-contextual reasoning.
- Store facts learned during training, e.g., associations like “Michael Jordan plays basketball.”
 
- ๐ข Parameters: MLPs hold the majority of the modelโs parameters.
Training Transformers#
- ๐ Learning Framework:- Models are trained on vast datasets using next-token prediction, requiring no manual labels.
- Cost Function: Measures prediction accuracy using negative log probabilities, guiding parameter updates.
 
- ๐๏ธ Optimization: Gradient descent navigates a high-dimensional cost surface to minimize error.
- ๐ Pretraining: Allows large-scale unsupervised learning before fine-tuning with human feedback.
Embedding Space and High Dimensions#
- ๐ Semantic Clusters: Similar words cluster together; directions in the space encode relationships (e.g., gender: King - Male + Female = Queen).
- ๐ High Dimensionality: Embedding spaces have thousands of dimensions, enabling distinct representations of complex concepts.
- ๐ Scaling Efficiency: High-dimensional spaces allow exponentially more “almost orthogonal” directions for encoding meanings.
Practical Applications#
- โ๏ธ Language Models: Effective for chatbots, summarization, and more due to their generality and parallel processing.
- ๐ผ๏ธ Multimodal Models: Transformers can integrate text, images, and sound by treating all as tokens in a unified framework.
Challenges and Limitations#
- ๐ Context Size Limitations: Attention grows quadratically with context size, requiring optimization for large contexts.
- โป๏ธ Inference Redundancy: Token-by-token generation can involve redundant computations; caching mitigates this at inference time.
Engineering and Design#
- ๐ ๏ธ Hardware Optimization: Transformers are designed to exploit GPUs’ parallelism for efficient matrix multiplication.
- ๐ Residual Connections: Baked into the architecture to enhance stability and ease of training.
The Power of Scale#
- ๐ Scaling Laws: Larger models and more data improve performance, often qualitatively.
- ๐ Self-Supervised Pretraining: Enables training on vast unlabeled datasets before fine-tuning.
BPE (Byte Pair Encoding)#
BPE is a widely used tokenization method in natural language processing (NLP) and machine learning. It is designed to balance between breaking text into characters and full words by representing text as a sequence of subword units. This approach helps models handle rare and unseen words effectively while keeping the vocabulary size manageable.
How BPE Works:#
- Start with Characters: - Initially, every character in the text is treated as a separate token.
 
- Merge Frequent Pairs: - BPE repeatedly identifies the most frequent pair of adjacent tokens in the training corpus and merges them into a single token. This process is iteratively applied.
- For example:- Input: low,lower,lowest
- Output Vocabulary: {low_, e, r, s, t}
 
- Input: 
 
- Build Vocabulary: - The merging process stops after a predefined number of merges, resulting in a vocabulary of subwords, characters, and some common full words.
 


Comments: