Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CodeSignal

Modern Tokenization Techniques for AI & LLMs

via CodeSignal

Overview

This course covers tokenization techniques used in modern AI models, including rule-based methods, subword tokenization (BPE, WordPiece, SentencePiece), and vocabulary optimizations. Learners will implement these methods and understand their impact on NLP model performance.

Syllabus

  • Unit 1: Introduction to Tokenization (Rule-Based Tokenization)
    • Tokenize Text with NLTK
    • Sentence Tokenization with NLTK
    • Extract Monetary Values with Regex
    • Tokenization Showdown with NLTK and spaCy
  • Unit 2: Byte-Pair Encoding (BPE) – Subword Tokenization
    • Exploring Pre-trained Tokenizers with GPT-2
    • Using Pre-trained Tokenizers with RoBERTa
    • Comparing Tokenization with GPT-2 and RoBERTa
  • Unit 3: Comparing BPE, WordPiece, and SentencePiece in NLP
    • WordPiece Tokenization Challenge
    • Tokenization Techniques in Action
    • Tokenization Techniques in Action
    • Tokenization Techniques for Special Texts
  • Unit 4: Tokenization and Out-of-Vocabulary (OOV) Handling in NLP
    • Tokenization Showdown BERT vs GPT2
    • Multilingual Tokenization Challenge
    • Multilingual Tokenization and OOV Reduction

Reviews

Start your review of Modern Tokenization Techniques for AI & LLMs

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.