Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Modern Tokenization Techniques for AI & LLMs

Go to class Write review

Details

Provider

CodeSignal
Pricing

Free Certificate
Languages

English
Certificate

Certificate Available
Effort

1 hour
Sessions

Self-Paced
Level

Intermediate

Found in

Part of

Data Processing for LLMs

Overview

This course covers tokenization techniques used in modern AI models, including rule-based methods, subword tokenization (BPE, WordPiece, SentencePiece), and vocabulary optimizations. Learners will implement these methods and understand their impact on NLP model performance.

Syllabus

Unit 1: Introduction to Tokenization (Rule-Based Tokenization)

Tokenize Text with NLTK
Sentence Tokenization with NLTK
Extract Monetary Values with Regex
Tokenization Showdown with NLTK and spaCy

Unit 2: Byte-Pair Encoding (BPE) – Subword Tokenization

Exploring Pre-trained Tokenizers with GPT-2
Using Pre-trained Tokenizers with RoBERTa
Comparing Tokenization with GPT-2 and RoBERTa

Unit 3: Comparing BPE, WordPiece, and SentencePiece in NLP

WordPiece Tokenization Challenge
Tokenization Techniques in Action
Tokenization Techniques in Action
Tokenization Techniques for Special Texts

Unit 4: Tokenization and Out-of-Vocabulary (OOV) Handling in NLP

Tokenization Showdown BERT vs GPT2
Multilingual Tokenization Challenge
Multilingual Tokenization and OOV Reduction

Reviews

Start your review of Modern Tokenization Techniques for AI & LLMs