A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming
Databricks via YouTube
-
12
-
- Write review
Power BI Fundamentals - Create visualizations and dashboards from scratch
Learn EDR Internals: Research & Development From The Masters
Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore ByteDance's innovative approach to solving data management and model training challenges through Magnus (enhanced Apache Iceberg) and Byted Streaming (customized Mosaic Streaming) in this 32-minute conference talk. Learn how ByteDance leveraged Iceberg's branch/tag functionality to efficiently manage massive datasets and checkpoints, while implementing enhanced metadata and a custom C++ data reader to achieve optimal sharding, shuffling, and data loading performance. Discover the flexible table migration capabilities, detailed metrics, and built-in full-text indexes on Iceberg tables that ensure training reliability. Understand how the team addressed scalability and performance issues with ultra-large datasets by customizing Mosaic Streaming to resolve challenges including slow startup times, high resource consumption, and limited data source compatibility. Gain insights into the technical enhancements made to both Magnus and Byted Streaming, and see demonstrations of how these solutions enable efficient and robust distributed training at scale.
Syllabus
A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming
Taught by
Databricks