Hugging Face Course - Fast Tokenizers and Token Classification Pipelines - Chapter 6

Learn advanced tokenization techniques and pipeline internals in this comprehensive tutorial covering fast tokenizers, their performance advantages, and implementation details across PyTorch and TensorFlow frameworks. Explore the inner workings of token classification and question answering pipelines, understanding how they process text and generate predictions. Master the creation of custom tokenizers by examining normalization processes, pre-tokenization steps, and three major tokenization algorithms: Byte Pair Encoding (BPE), WordPiece, and Unigram tokenization. Gain hands-on experience building tokenizers from scratch and understand the technical foundations that make modern NLP models efficient and effective in processing natural language text.

Syllabus

Why are fast tokenizers called fast?
Fast tokenizer superpowers
Inside the Token classification pipeline (PyTorch)
Inside the Token classification pipeline (TensorFlow)
Inside the Question answering pipeline (PyTorch)
Inside the Question answering pipeline (TensorFlow)
Training a new tokenizer
What is normalization?
What is pre-tokenization?
Byte Pair Encoding Tokenization
WordPiece Tokenization
Unigram Tokenization
Building a new tokenizer