Build cloud-based data warehouses that power analytical workloads. Learn dimensional modeling techniques—including star and snowflake schemas, fact grain, and surrogate keys—to structure data for efficient OLAP queries. Use Python and SQL to build ETL pipelines that extract from diverse source systems like PostgreSQL, Cassandra, and Neo4j, clean and conform data across sources, and load it into Amazon Redshift. Optimize table performance with distribution styles, sort keys, and compression to speed up queries at scale. Create materialized views that pre-compute common aggregations so analysts get fast answers without recalculating. Validate data quality to ensure your warehouse is accurate, complete, and production-ready.
Overview
Syllabus
- Introduction to Data Warehousing
- Explore how data warehouses unify scattered operational systems into a single source of truth, and compare OLTP vs. OLAP, dimensional modeling basics, and star and snowflake schemas.
- Dimensional Modeling for Analytics
- Define fact grain, build fact and dimension tables with surrogate keys, and write Redshift DDL that encodes distribution styles, sort keys, and compression for performance.
- Extracting and Transforming Source Data
- Build ETL pipelines in Python that pull data from PostgreSQL, Cassandra, and Neo4j, then clean, conform, and derive dimensions for a consistent warehouse-ready dataset.
- Loading, Optimization, and Validation in Redshift
- Stage data through S3, load it with the COPY command, optimize tables for parallel query performance, create materialized views, and run data quality validation checks.
- Build a Multi-Source E-commerce Analytics Warehouse in Redshift
- Design a star schema and end-to-end ETL pipeline that integrates e-commerce data from three source systems into an optimized, validated Redshift warehouse.
Taught by
nd073 Matt Swaffer and nd073 Valerie Scarlata