Overview

Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off

One annual plan covers every course and certificate on Coursera. 40% off for a limited time.

This course is designed to equip data engineers with the skills to build scalable and efficient data pipelines using Scala and Spark. Data engineers will learn best practices for development, testing, and deployment in cloud environments, with a focus on optimizing performance and ensuring data quality. The course provides the necessary tools to transform raw data into actionable insights, making it highly relevant in today’s data-driven world. Throughout the course, learners will improve their data engineering skills by mastering techniques for building both streaming and batch data pipelines. The content emphasizes practical outcomes such as performance tuning and data profiling. With hands-on examples and step-by-step guidance, learners will gain a solid understanding of real-time and batch processing pipelines. What makes this course unique is its combination of foundational theory and real-world applications. By the end, you will be able to use Scala and Spark to process large datasets and optimize pipelines in cloud environments effectively. This course is ideal for data engineers with some experience in data processing. While it assumes familiarity with data engineering concepts and cloud technologies, anyone eager to improve their skills in Scala and Spark will benefit from the practical, step-by-step approach.

Syllabus

Scala Essentials for Data Engineers

In this section, we explore functional programming, higher-order functions, polymorphic functions, and pattern matching in Scala for data engineering applications.

Environment Setup

In this section, we explore cloud-based and local environments for data engineering pipelines, focusing on setup processes, trade-offs, and practical applications.

An Introduction to Apache Spark and Its APIs DataFrame Dataset and Spark SQL

In this section, we explore Apache Spark's APIs, focusing on DataFrame and Dataset for distributed data processing.

Working with Databases

In this section, we explore using Spark JDBC API for database access, designing database interfaces, and performing operations with configuration loading.

Object Stores and Data Lakes

In this section, we explore object stores, data lakes, and lakehouses, focusing on their roles in managing large-scale data workflows efficiently.

Understanding Data Transformation

In this section, we explore Spark transformations, aggregations, joins, and window functions to enhance data processing for BI and analytics. Key concepts include efficient data manipulation and pipeline development.

Data Profiling and Data Quality

In this section, we explore Deequ for implementing data quality checks, analyzing completeness and accuracy, and defining constraints to ensure reliable data pipelines.

Test-Driven Development, Code Health, and Maintainability

In this section, we explore test-driven development, static code analysis, and linting to improve code quality, maintainability, and consistency in data engineering projects.

CI/CD with GitHub

In this section, we explore CI/CD practices with GitHub to automate Scala data pipeline workflows, focusing on GitHub Actions, version control, and reliable deployment processes.

Data Pipeline Orchestration

In this section, we explore data pipeline orchestration using tools like Airflow, Argo, Databricks, and Azure Data Factory. We focus on workflow design, task management, and real-world implementation strategies.

Performance Tuning

In this section, we analyze Spark UI metrics to identify performance issues, optimize data shuffling, and right-size compute resources for efficient data processing.

Building Batch Pipelines Using Spark and Scala

In this section, we explore building batch pipelines using Spark and Scala, focusing on medallion architecture, data ingestion, transformation, and orchestration for scalable data processing.

Building Streaming Pipelines Using Spark and Scala

In this section, we explore building real-time data pipelines using Spark, Scala, and Kafka for IoT applications. Key concepts include data ingestion, transformation, and serving layer design.