AI, Data Science & Cloud Certificates from Google, IBM & Meta
Future-Proof Your Career: AI Manager Masterclass
Overview
Google, IBM & Meta Certificates – 40% Off
One plan covers every Professional Certificate on Coursera.
Unlock All Certificates
Learn how to build an effective WordPiece tokenizer for Dhivehi, a low-resource language with a complex writing system. Explore the challenges of applying NLP to Dhivehi and follow along as the process of creating a custom tokenizer is demonstrated step-by-step. Discover the key components of tokenizer design, including normalization, pre-tokenization, post-tokenization, and decoding. Implement and train the tokenizer, test its functionality, and gain insights into working with low-resource languages in NLP. By the end of this tutorial, you'll have a solid understanding of tokenizer development for unique linguistic contexts and be able to apply these techniques to other low-resource languages.
Syllabus
Intro
Dhivehi Project
Hurdles for Low Resource Domains
Dhivehi Dataset
Download Dhivehi Corpus
Tokenizer Components
Normalizer Component
Pre-tokenization Component
Post-tokenization Component
Decoder Component
Tokenizer Implementation
Tokenizer Training
Post-processing Implementation
Decoder Implementation
Saving for Transformers
Tokenizer Test and Usage
Download Dhivehi Models
First Steps
Taught by
James Briggs