A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming
Databricks via YouTube
-
13
-
- Write review
Master Production-Ready Machine Learning, Step by Step
Build AI Apps with Azure, Copilot, and Generative AI — Microsoft Certified
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Explore ByteDance's innovative approach to solving data management and model training challenges through Magnus (enhanced Apache Iceberg) and Byted Streaming (customized Mosaic Streaming) in this 32-minute conference talk. Learn how ByteDance leveraged Iceberg's branch/tag functionality to efficiently manage massive datasets and checkpoints, while implementing enhanced metadata and a custom C++ data reader to achieve optimal sharding, shuffling, and data loading performance. Discover the flexible table migration capabilities, detailed metrics, and built-in full-text indexes on Iceberg tables that ensure training reliability. Understand how the team addressed scalability and performance issues with ultra-large datasets by customizing Mosaic Streaming to resolve challenges including slow startup times, high resource consumption, and limited data source compatibility. Gain insights into the technical enhancements made to both Magnus and Byted Streaming, and see demonstrations of how these solutions enable efficient and robust distributed training at scale.
Syllabus
A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming
Taught by
Databricks