Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Google

Google DeepMind: 02 Represent Your Language Data

Google via Google Skills

Overview

Build a Learning Habit
Download Class Central's free printable study calendar
Download for Free
In this Google DeepMind course you will learn how to prepare text data for language models to process. You will investigate the tools and techniques used to prepare, structure, and represent text data for language models, with a focus on tokenization and embeddings. You will be encouraged to think critically about the decisions behind data preparation, and what biases within the data may be introduced into models. You will analyze trade-offs, learn how to work with vectors and matrices, how meaning is represented in language models. Finally, you will practice designing a dataset ethically using the Data Cards process, ensuring transparency, accountability, and respect for community values in AI development.

Syllabus

  • Introduction to text data
    • Teaching a machine the soul of your language
    • A world of text: Types and sources
    • Exploring raw data
    • Learning objectives
    • How to get the most out of this course
  • Preprocessing
    • Lab: Preprocess Data
    • Harnessing the potential of low-resource languages
    • Data resources
    • Who owns the data?
    • Knowledge check 1
  • Tokenization
    • What is tokenization?
    • Lab: Tokenize Texts into Characters and Words
    • Lab: Tokenize Texts into Subword Tokens
    • Subword tokenization
    • Lab: Implement a BPE Tokenizer
    • Whose voice is missing?
    • Knowledge check 2
  • Embeddings
    • What are embeddings?
    • Design your own embeddings
    • Desired properties of embeddings
    • Lab: Experiment with Embeddings
    • Lab: Train an SLM with Your BPE Tokenizer
    • Knowledge check 3
  • Challenge
    • Why document data?
    • Build a dataset ethically with a Data Card
    • Knowledge check 4
  • Continue your journey
    • Summary
    • Looking forward
    • Additional resources and further reading
    • Glossary
    • Feedback
  • Your Next Steps
    • Claim credential

Reviews

Start your review of Google DeepMind: 02 Represent Your Language Data

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.