Build ETL pipelines with AWS Glue and PySpark, convert raw JSON to Parquet, and run fast analytics with Amazon Athena. Learn to manage data across raw, processed, and curated zones and automate workflows to deliver business-ready insights from your data lake.
Overview
Syllabus
- Unit 1: Preparing Data for AWS
- Complete the S3 Folder Structure
- Upload JSON Data to S3
- Unit 2: Understanding the ETL Script
- Initialize Your First ETL Script
- Extract Data from S3 Storage
- Handle Missing Values in Data
- Complete Your ETL Data Pipeline
- Deploy ETL Script to Cloud Storage
- Unit 3: Creating Glue ETL Jobs
- Complete Your First Glue Job
- Start Your First Job Run
- Monitor Your Glue Job Runs
- Check and Debug Your Glue Job Logs
- Verify Your Parquet Output Files
- Unit 4: Cataloging Data with Glue Crawler
- Configure Your First Glue Crawler
- Start Your Glue Crawler
- Monitor Crawler Until Completion
- Unit 5: Querying Data with Athena
- Fix Your First Athena Query
- Filter Overdue Books with SQL
- Find Unique Patrons with DISTINCT
- Handle Athena Query Failures Gracefully
- Analyze Top Revenue Generating Genres
- Unit 6: Aggregating Data with Glue
- Adding Multiple Aggregation Functions
- Changing Grouping Strategy for Business Insights
- Debugging Broken Aggregation Functions