Master Apache Airflow to build, schedule, and monitor data pipelines on AWS. Start with Airflow fundamentals—DAGs, tasks, XCom, Jinja templating, branching, and runtime configuration—then apply production patterns like single-responsibility design, data intervals, asset-driven scheduling, and dynamic task mapping. Build complete ETL and ELT pipelines that move data through S3 into Amazon Redshift using SQL operators, template inheritance, and data constraint checks. Then construct a modern data lakehouse using S3, Glue, Iceberg, and Athena, automating ingestion and promotion through bronze, silver, and gold layers while handling schema evolution. Deploy pipelines to Amazon MWAA and apply monitoring and observability best practices for production environments.
Overview
Syllabus
- Introduction to Data Pipelines and Airflow
- Discover how Apache Airflow orchestrates data pipelines as code. Author DAGs and tasks, pass data with XCom, configure runtime parameters, apply Jinja templating, and build branching dependencies.
- Data Lineage and Orchestration
- Orchestrate production pipelines with schedules, data intervals, and catchup. Apply single-responsibility design, debug with flatfile snapshots, trigger DAGs on asset events, and map tasks dynamically
- Orchestrating Warehouse Workflows with Amazon Redshift
- Build end-to-end ETL and ELT pipelines that move data through S3 into Redshift. Use SQL operators, Jinja template inheritance, and data constraint checks, then deploy to production with Amazon MWAA.
- Orchestrating Lakehouse Workflows with AWS Glue and Athena
- Build a lakehouse on AWS with S3, Glue, Iceberg, and Athena. Automate ingestion, handle schema evolution with crawlers, and promote data through bronze, silver, and gold layers.
- AWS Data Lakehouse Pipeline for Sparkify
- Design an event-driven lakehouse with Airflow, S3, Glue, Iceberg, and Athena. Build three asset-triggered DAGs that ingest, transform, and promote data through raw, transaction, and analytics layers.
Taught by
Sean Murdock