Learn how to manage raw and semi-structured data at scale using AWS data lakes and lakehouses. Ingest data into S3, register schemas with Glue Data Catalog, and query data flexibly with Athena. Process large datasets using Spark to transform, clean, and aggregate data for analytics. Implement lakehouse tables with Iceberg to combine the flexibility of data lakes with the structure of data warehouses. Support schema evolution and ensure data remains queryable as requirements change.
Overview
Syllabus
- Introduction to Data Lakes and Lakehouses
- Differentiate between the data warehouse, data lake, and data lakehouse paradigms for structured and unstructured data, and learn medallion architecture principles
- Building Data Lakes on AWS
- Set up AWS S3 as a data lake foundation, ingest PostgreSQL CDC via DMS, catalog with Glue, query with Athena, and set up Lake Formation governance
- Processing Data in the Lake with Spark
- Transform bronze (raw) data to silver (clean/enriched) data with PySpark, then aggregate to gold KPIs (business metrics). Optimize your Spark queries with partitioning and caching.
- Implementing Lakehouse Tables
- Create Iceberg tables and S3 Tables with AWS Glue, implement ACID MERGE for CDC upserts, and query historical snapshots (time travel)
- Exastore Data Lakehouse on AWS
- Implement a full lakehouse solution with CDC pipeline ingestion, batch data ingestion, Spark ETL processing, medallion architecture, and lakehouse tables
Taught by
Sean Murdock - Instructor