Inside OpenCoder - Data Processing and Training Pipeline for Code LLMs

Learn about the development and training process of OpenCoder, a code-focused large language model, in this comprehensive technical presentation. Explore the intricate details of data preparation, including preprocessing techniques, deduplication strategies, and their impact on model performance. Dive deep into the training methodology, covering pre-training data selection, RefineCode implementation, and the two-stage instruct tuning process. Discover how data transformation, filtering, and sampling techniques were utilized to create a robust code generation model. Examine the evaluation methods employed and gain insights into future development directions. Access supplementary materials including the research paper, training datasets, and detailed documentation through provided links, while connecting with the development community via Discord for ongoing discussions and updates.

Syllabus

Intro
OpenCoder
OpenCoder Goals
Pre-Training Data
RefineCode
Raw Code for Pre-Training
Data Preprocessing
Data Deduplication
How Data Deduplication Improved OpenCoder
Data Transformation
Data Filtering
Sampling
Code-Related Data
Post Training
The Two Stages of Instruct Tuning
Evaluation
Conclusion & Future Work