Open source Data Engineering with Spark, dbt & Airflow

Coursera via Coursera Professional Certificate

Go to class Write review

Overview

Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off

One annual plan covers every course and certificate on Coursera. 40% off for a limited time.

Get Full Access

This program equips you with the open-source tools and architectural thinking used by professional data engineers to build scalable, reliable data systems from the ground up. You will work hands-on with Apache Spark for distributed data processing, dbt for modular SQL-based transformation, and Apache Airflow for workflow orchestration — the same stack powering data infrastructure at leading technology and data-driven organizations worldwide. Across the courses, you will gain practical expertise in designing dimensional data models, implementing incremental load strategies, optimizing Spark job performance, enforcing data quality with automated testing frameworks, and deploying pipelines through CI/CD workflows. You will also develop foundational skills in cloud storage provisioning, containerization with Docker, and version control best practices that mirror real production environments. By the end of this Program, you will be able to design and deploy end-to-end data pipelines that ingest from diverse sources, transform data through well-tested models, and deliver analytics-ready datasets to downstream consumers — demonstrating job-ready engineering skills valued across analytics engineering, data platform, and data infrastructure roles.

Syllabus

Course 1: Building Automated Data Pipelines with Spark,dbt,and Airflow
Course 2: Optimizing Spark and Cloud Data Storage for Analytics
Course 3: Data Modeling & Warehousing Fundamentals in Data Engineering
Course 4: DevOps and CI/CD for Data Engineering Performance
Course 5: Data Quality and Debugging for Reliable Pipelines
Course 6: Career Development For Open Source Data Engineering

Courses

0 reviews

View details

You'll master the art of building production-ready data pipelines that automatically process millions of records. In this hands-on course, you'll design end-to-end workflows that integrate diverse data sources—from databases and APIs to real-time streams—using industry-standard tools like Apache Spark, dbt, and Apache Airflow. You'll learn to create robust data models that preserve historical changes, implement performance optimizations that reduce processing time by 30% or more, and build automated workflows with intelligent retry logic and monitoring alerts. By the end, you'll have created a complete data pipeline system that demonstrates the technical skills data engineering teams need most. You'll know how to unify fragmented data sources, apply advanced transformation techniques, and ensure your pipelines run reliably at scale. This practical experience directly translates to the challenges you'll face as a data engineer, data analyst, or anyone working with large-scale data systems in modern organizations.
0 reviews

View details

You'll finish this course with a job-ready portfolio, a clear professional positioning strategy, and a concrete 30-day action plan to launch your data engineering career. You'll know how to present your pipeline-building skills in ways that resonate with hiring managers—and how to stand out in a competitive entry-level market. What makes this course unique is its focus on demonstrable capability over credentials. Rather than reviewing technical concepts, you'll learn how to translate your hands-on experience with Airflow, dbt, and Spark into a compelling resume, an optimized LinkedIn profile, and a GitHub portfolio that proves you can build production-style systems. You'll also practice real interview scenarios, develop structured responses to technical and behavioral questions, and build the communication skills that turn interviews into offers. Whether you're entering data engineering for the first time or transitioning from a related technical role, this course gives you the strategy and tools to connect your skills to market needs—confidently and effectively.
0 reviews

View details

You'll learn the essential skills needed to design, build, and maintain robust data warehouses that power business intelligence and analytics. Through hands-on practice, you'll learn to create star schema data models that enable self-service reporting, apply database normalization techniques while preserving query performance, and use advanced SQL window functions for complex analytical calculations. You'll also gain expertise in configuring database replication for high availability and implementing incremental loading strategies to efficiently update large datasets. This comprehensive course integrates fundamental data engineering concepts with practical implementation skills, preparing you to build scalable data infrastructure that supports enterprise analytics. By combining data modeling theory with real-world database administration techniques, you'll develop the versatile skill set that data engineering professionals need to create reliable, performant data systems that drive business insights and decision-making across organizations.
0 reviews

View details

You'll build the diagnostic and preventive skills that keep data pipelines trustworthy and production-ready. In this course, you'll learn to define automated data quality tests, trace anomalies back to their source, and apply advanced Python debugging techniques to resolve complex pipeline failures — three capabilities that employers consistently seek in data engineering roles. What sets this course apart is its end-to-end, practical focus: you won't just learn what data quality means — you'll write YAML test suites, navigate monitoring dashboards, analyze stack traces, and step through live code with debugging tools. Each skill builds toward a complete picture of pipeline reliability, from prevention to detection to resolution. By the end, you'll be equipped to catch data issues before they reach downstream consumers, communicate root causes clearly, and ship more dependable data products.
0 reviews

View details

You'll build the skills to manage, automate, and optimize production-grade data systems using industry-standard DevOps practices. By completing this course, you'll be able to resolve complex version control conflicts, design branching strategies for collaborative development, containerize data environments with Docker, automate infrastructure configuration with Ansible, deploy data pipelines through CI/CD workflows, and optimize query performance to maintain service levels. This course is unique because it bridges the gap between software engineering and data engineering — giving you hands-on experience with the exact tools and workflows used in real production environments. Rather than covering concepts in isolation, you'll integrate version control, containerization, automation, and performance tuning into a cohesive DevOps skillset that employers actively seek. Whether you're moving into a data engineering role or strengthening your current practice, you'll finish with portfolio-ready work that demonstrates job-ready capability.
0 reviews

View details

You will master advanced performance optimization techniques for large-scale data processing using Apache Spark and cloud storage technologies. In this hands-on course, you'll learn to diagnose and resolve performance bottlenecks that plague distributed data systems, implement strategic partitioning and caching strategies that can improve job performance by 30% or more, and design secure, cost-effective cloud data infrastructure. You will gain expertise in transactional data lake technologies like Delta Lake, evaluate storage formats to optimize analytical workloads, and provision enterprise-grade cloud infrastructure with proper security controls. Through practical exercises, you'll analyze Spark execution plans, implement data versioning and ACID transactions, and benchmark different storage formats to make informed architectural decisions. By the end, you will have the skills to optimize data pipelines at scale, reduce cloud storage costs through intelligent format selection, and build robust data infrastructure that meets enterprise security requirements. This expertise directly addresses the performance challenges faced by data engineers working with petabyte-scale datasets in production environments.