Building Distributed Modern Data Lakehouse From Scratch with Apache Iceberg - An End to End Project
CodeWithYu via YouTube
Learn AI, Data Science & Business — Earn Certificates That Get You Hired
Stuck in Tutorial Hell? Learn Backend Dev the Right Way
Overview
AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off
One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.
Unlock All Certificates
Learn to build a comprehensive distributed data lakehouse from the ground up using Apache Iceberg, Trino, Airflow, DBT, MinIO, and Project Nessie in this hands-on tutorial. Master the high-level architecture of modern data lakehouses and understand Apache Iceberg fundamentals before diving into practical implementation. Set up a distributed Trino cluster with master-worker architecture, implement data orchestration using Apache Airflow, and create data transformations following the Medallion Architecture pattern with DBT. Integrate object storage with MinIO, manage versioned metadata through Project Nessie, and optimize Trino query performance for production environments. Explore distributed systems concepts, data pipeline orchestration techniques, and modern data lakehouse best practices through real-world implementation. Work with DataGrip for query execution and analysis while gaining practical insights into query optimization and performance tuning strategies for large-scale data processing systems.
Syllabus
0:00 Introduction
1:02 High Level System Architecture Walkthrough
12:00 What is Apache Iceberg?
24:00 Setting up Distributed Data Lakehouse from Scratch
40:15 Apache Airflow DAG Pipeline
1:46:10 DBT Project Setup with Medallion Architecture
1:30:55 Trino Cluster Optimisation
1:32:37 Trino Query Engine with DataGrip
1:41:17 Results and Discussion
1:55:00 Outro
Taught by
CodeWithYu