Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Data Lakes and Lakehouses on AWS

Overview

Learn how to manage raw and semi-structured data at scale using AWS data lakes and lakehouses. Ingest data into S3, register schemas with Glue Data Catalog, and query data flexibly with Athena. Process large datasets using Spark to transform, clean, and aggregate data for analytics. Implement lakehouse tables with Iceberg to combine the flexibility of data lakes with the structure of data warehouses. Support schema evolution and ensure data remains queryable as requirements change.

Syllabus

Introduction to Data Lakes and Lakehouses

Differentiate between the data warehouse, data lake, and data lakehouse paradigms for structured and unstructured data, and learn medallion architecture principles

Building Data Lakes on AWS

Set up AWS S3 as a data lake foundation, ingest PostgreSQL CDC via DMS, catalog with Glue, query with Athena, and set up Lake Formation governance

Processing Data in the Lake with Spark

Transform bronze (raw) data to silver (clean/enriched) data with PySpark, then aggregate to gold KPIs (business metrics). Optimize your Spark queries with partitioning and caching.

Implementing Lakehouse Tables

Create Iceberg tables and S3 Tables with AWS Glue, implement ACID MERGE for CDC upserts, and query historical snapshots (time travel)

Exastore Data Lakehouse on AWS

Implement a full lakehouse solution with CDC pipeline ingestion, batch data ingestion, Spark ETL processing, medallion architecture, and lakehouse tables