This course covers tokenization techniques used in modern AI models, including rule-based methods, subword tokenization (BPE, WordPiece, SentencePiece), and vocabulary optimizations. Learners will implement these methods and understand their impact on NLP model performance.
Overview
Syllabus
- Unit 1: Introduction to Tokenization (Rule-Based Tokenization)
- Tokenize Text with NLTK
- Sentence Tokenization with NLTK
- Extract Monetary Values with Regex
- Tokenization Showdown with NLTK and spaCy
- Unit 2: Byte-Pair Encoding (BPE) – Subword Tokenization
- Exploring Pre-trained Tokenizers with GPT-2
- Using Pre-trained Tokenizers with RoBERTa
- Comparing Tokenization with GPT-2 and RoBERTa
- Unit 3: Comparing BPE, WordPiece, and SentencePiece in NLP
- WordPiece Tokenization Challenge
- Tokenization Techniques in Action
- Tokenization Techniques in Action
- Tokenization Techniques for Special Texts
- Unit 4: Tokenization and Out-of-Vocabulary (OOV) Handling in NLP
- Tokenization Showdown BERT vs GPT2
- Multilingual Tokenization Challenge
- Multilingual Tokenization and OOV Reduction