Learn to design, build, and automate data systems on AWS. Start by modeling data across relational, document, and graph databases—PostgreSQL, MongoDB, and Neo4j—understanding the tradeoffs of each paradigm. Then build cloud data warehouses in Amazon Redshift, designing dimensional schemas and ETL pipelines that extract from diverse sources, optimize query performance, and validate data quality. Explore modern lakehouse architecture with S3, Glue, Iceberg, and Athena, processing data through bronze, silver, and gold layers using Apache Spark. Finally, orchestrate production pipelines with Apache Airflow: scheduling workflows, managing data lineage, and deploying to Amazon MWAA. By the end of this program, you'll be ready to engineer end-to-end data platforms that scale.
Overview
Syllabus
- Data Modeling with AWS
- Master the fundamentals of data modeling across relational, document, and graph databases. Learn how databases store, query, and enforce structure, and understand the design considerations for OLTP vs. OLAP workloads. Design normalized schemas in PostgreSQL to prevent CRUD anomalies, then model flexible document collections in MongoDB using embedding and referencing strategies. Build graph models in Neo4j with Cypher to represent and traverse connected data. Compare ACID guarantees across all three paradigms, explore managed cloud services on AWS, and apply your skills by designing a complete multi-database backend for a growing e-commerce company.
- Data Warehouses on AWS
- Build cloud-based data warehouses that power analytical workloads. Learn dimensional modeling techniques—including star and snowflake schemas, fact grain, and surrogate keys—to structure data for efficient OLAP queries. Use Python and SQL to build ETL pipelines that extract from diverse source systems like PostgreSQL, Cassandra, and Neo4j, clean and conform data across sources, and load it into Amazon Redshift. Optimize table performance with distribution styles, sort keys, and compression to speed up queries at scale. Create materialized views that pre-compute common aggregations so analysts get fast answers without recalculating. Validate data quality to ensure your warehouse is accurate, complete, and production-ready.
- Data Lakes and Lakehouses on AWS
- Learn how to manage raw and semi-structured data at scale using AWS data lakes and lakehouses. Ingest data into S3, register schemas with Glue Data Catalog, and query data flexibly with Athena. Process large datasets using Spark to transform, clean, and aggregate data for analytics. Implement lakehouse tables with Iceberg to combine the flexibility of data lakes with the structure of data warehouses. Support schema evolution and ensure data remains queryable as requirements change.
- AWS Data Pipelines and Orchestration with Airflow
- Master Apache Airflow to build, schedule, and monitor data pipelines on AWS. Start with Airflow fundamentals—DAGs, tasks, XCom, Jinja templating, branching, and runtime configuration—then apply production patterns like single-responsibility design, data intervals, asset-driven scheduling, and dynamic task mapping. Build complete ETL and ELT pipelines that move data through S3 into Amazon Redshift using SQL operators, template inheritance, and data constraint checks. Then construct a modern data lakehouse using S3, Glue, Iceberg, and Athena, automating ingestion and promotion through bronze, silver, and gold layers while handling schema evolution. Deploy pipelines to Amazon MWAA and apply monitoring and observability best practices for production environments.
Taught by
Koosha Totonchi, Chester Ismay, Eduardo Mota, and Jo-L Collins