Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Coursera

Apache Iceberg: From Zero to Production Data Lakehouse

Snowflake via Coursera

Overview

Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
This course is designed for data engineers, analytics engineers, data platform engineers, and data architects who work with data lakes and want to modernize their data infrastructure. It's also valuable for software engineers transitioning into data roles and technical leads evaluating Apache Iceberg for their data. By the end of this course, you will be able to: - Build and configure an Apache Iceberg lakehouse using catalogs, object storage, and query engines like Spark and Trino - Design optimal table structures using hidden partitioning, sort orders, and column metrics to maximize query performance - Migrate existing data from Hive tables, Parquet files, CSV, and databases into Iceberg using snapshot, migrate, and reserialization approaches - Implement production workflows using Write-Audit-Publish for validation, branching for testing, and rollback for recovery - Evolve table schemas and partition specifications without downtime or rewriting data - Execute maintenance operations including data file compaction, metadata compaction, and snapshot expiration - Configure write strategies (merge-on-read vs copy-on-write) and distribution modes for different workload requirements - Manage concurrent operations and avoid conflicts in multi-writer scenarios To be successful in this course, you should have: - Working knowledge of SQL and relational database concepts (tables, schemas, queries) - Basic understanding of data engineering concepts including ETL/ELT, data warehouses, and data lakes - Familiarity with command-line interfaces and Docker for running the course environment - Comfort reading and understanding code examples in Python/PySpark (code is provided; you don't need to write from scratch) - Experience with Apache Spark or distributed computing is helpful but not required—core concepts are explained throughout the course Apache Iceberg, Iceberg, Apache, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Syllabus

  • Apache Iceberg Fundamentals
    • Learn what Apache Iceberg is and how its metadata architecture enables powerful query optimizations. Build your first Iceberg lakehouse environment and explore how hidden partitioning and column metrics work together to skip unnecessary data during queries. Work with real NYC Taxi data to compare different partitioning strategies and measure their performance impact.
  • Taking Advantage of Apache Iceberg Tables
    • Move existing data into Iceberg using migration strategies for Parquet, Hive, CSV, and database sources. Master Git-like features including Write-Audit-Publish for validation, branching for safe experimentation, and tagging for marking milestones. Learn how to evolve both table schemas and partition specifications without downtime or rewriting data.
  • Operating and Optimizing Apache Iceberg
    • Optimize write performance and manage production Iceberg tables at scale. Understand streaming versus batch ingestion patterns, merge-on-read versus copy-on-write strategies, and how to handle concurrent operations safely. Execute essential maintenance operations including compaction and snapshot expiration to keep tables performant as they grow.

Taught by

Snowflake Northstar

Reviews

Start your review of Apache Iceberg: From Zero to Production Data Lakehouse

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.