Building Transformer Tokenizers - Dhivehi NLP #1

Learn how to build an effective WordPiece tokenizer for Dhivehi, a low-resource language with a complex writing system. Explore the challenges of applying NLP to Dhivehi and follow along as the process of creating a custom tokenizer is demonstrated step-by-step. Discover the key components of tokenizer design, including normalization, pre-tokenization, post-tokenization, and decoding. Implement and train the tokenizer, test its functionality, and gain insights into working with low-resource languages in NLP. By the end of this tutorial, you'll have a solid understanding of tokenizer development for unique linguistic contexts and be able to apply these techniques to other low-resource languages.

Syllabus

Intro
Dhivehi Project
Hurdles for Low Resource Domains
Dhivehi Dataset
Download Dhivehi Corpus
Tokenizer Components
Normalizer Component
Pre-tokenization Component
Post-tokenization Component
Decoder Component
Tokenizer Implementation
Tokenizer Training
Post-processing Implementation
Decoder Implementation
Saving for Transformers
Tokenizer Test and Usage
Download Dhivehi Models
First Steps