Building Distributed Modern Data Lakehouse From Scratch with Apache Iceberg - An End to End Project

Learn to build a comprehensive distributed data lakehouse from the ground up using Apache Iceberg, Trino, Airflow, DBT, MinIO, and Project Nessie in this hands-on tutorial. Master the high-level architecture of modern data lakehouses and understand Apache Iceberg fundamentals before diving into practical implementation. Set up a distributed Trino cluster with master-worker architecture, implement data orchestration using Apache Airflow, and create data transformations following the Medallion Architecture pattern with DBT. Integrate object storage with MinIO, manage versioned metadata through Project Nessie, and optimize Trino query performance for production environments. Explore distributed systems concepts, data pipeline orchestration techniques, and modern data lakehouse best practices through real-world implementation. Work with DataGrip for query execution and analysis while gaining practical insights into query optimization and performance tuning strategies for large-scale data processing systems.

Syllabus

0:00 Introduction
1:02 High Level System Architecture Walkthrough
12:00 What is Apache Iceberg?
24:00 Setting up Distributed Data Lakehouse from Scratch
40:15 Apache Airflow DAG Pipeline
1:46:10 DBT Project Setup with Medallion Architecture
1:30:55 Trino Cluster Optimisation
1:32:37 Trino Query Engine with DataGrip
1:41:17 Results and Discussion
1:55:00 Outro