Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Inside OpenCoder - Data Processing and Training Pipeline for Code LLMs

Oxen via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn about the development and training process of OpenCoder, a code-focused large language model, in this comprehensive technical presentation. Explore the intricate details of data preparation, including preprocessing techniques, deduplication strategies, and their impact on model performance. Dive deep into the training methodology, covering pre-training data selection, RefineCode implementation, and the two-stage instruct tuning process. Discover how data transformation, filtering, and sampling techniques were utilized to create a robust code generation model. Examine the evaluation methods employed and gain insights into future development directions. Access supplementary materials including the research paper, training datasets, and detailed documentation through provided links, while connecting with the development community via Discord for ongoing discussions and updates.

Syllabus

Intro
OpenCoder
OpenCoder Goals
Pre-Training Data
RefineCode
Raw Code for Pre-Training
Data Preprocessing
Data Deduplication
How Data Deduplication Improved OpenCoder
Data Transformation
Data Filtering
Sampling
Code-Related Data
Post Training
The Two Stages of Instruct Tuning
Evaluation
Conclusion & Future Work

Taught by

Oxen

Reviews

Start your review of Inside OpenCoder - Data Processing and Training Pipeline for Code LLMs

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.