Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Coursera

Data Engineering with Scala and Spark

Packt via Coursera

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This course is designed to equip data engineers with the skills to build scalable and efficient data pipelines using Scala and Spark. Data engineers will learn best practices for development, testing, and deployment in cloud environments, with a focus on optimizing performance and ensuring data quality. The course provides the necessary tools to transform raw data into actionable insights, making it highly relevant in today’s data-driven world. Throughout the course, learners will improve their data engineering skills by mastering techniques for building both streaming and batch data pipelines. The content emphasizes practical outcomes such as performance tuning and data profiling. With hands-on examples and step-by-step guidance, learners will gain a solid understanding of real-time and batch processing pipelines. What makes this course unique is its combination of foundational theory and real-world applications. By the end, you will be able to use Scala and Spark to process large datasets and optimize pipelines in cloud environments effectively. This course is ideal for data engineers with some experience in data processing. While it assumes familiarity with data engineering concepts and cloud technologies, anyone eager to improve their skills in Scala and Spark will benefit from the practical, step-by-step approach.

Syllabus

  • Scala Essentials for Data Engineers
    • In this section, we explore functional programming, higher-order functions, polymorphic functions, and pattern matching in Scala for data engineering applications.
  • Environment Setup
    • In this section, we explore cloud-based and local environments for data engineering pipelines, focusing on setup processes, trade-offs, and practical applications.
  • An Introduction to Apache Spark and Its APIs DataFrame Dataset and Spark SQL
    • In this section, we explore Apache Spark's APIs, focusing on DataFrame and Dataset for distributed data processing.
  • Working with Databases
    • In this section, we explore using Spark JDBC API for database access, designing database interfaces, and performing operations with configuration loading.
  • Object Stores and Data Lakes
    • In this section, we explore object stores, data lakes, and lakehouses, focusing on their roles in managing large-scale data workflows efficiently.
  • Understanding Data Transformation
    • In this section, we explore Spark transformations, aggregations, joins, and window functions to enhance data processing for BI and analytics. Key concepts include efficient data manipulation and pipeline development.
  • Data Profiling and Data Quality
    • In this section, we explore Deequ for implementing data quality checks, analyzing completeness and accuracy, and defining constraints to ensure reliable data pipelines.
  • Test-Driven Development, Code Health, and Maintainability
    • In this section, we explore test-driven development, static code analysis, and linting to improve code quality, maintainability, and consistency in data engineering projects.
  • CI/CD with GitHub
    • In this section, we explore CI/CD practices with GitHub to automate Scala data pipeline workflows, focusing on GitHub Actions, version control, and reliable deployment processes.
  • Data Pipeline Orchestration
    • In this section, we explore data pipeline orchestration using tools like Airflow, Argo, Databricks, and Azure Data Factory. We focus on workflow design, task management, and real-world implementation strategies.
  • Performance Tuning
    • In this section, we analyze Spark UI metrics to identify performance issues, optimize data shuffling, and right-size compute resources for efficient data processing.
  • Building Batch Pipelines Using Spark and Scala
    • In this section, we explore building batch pipelines using Spark and Scala, focusing on medallion architecture, data ingestion, transformation, and orchestration for scalable data processing.
  • Building Streaming Pipelines Using Spark and Scala
    • In this section, we explore building real-time data pipelines using Spark, Scala, and Kafka for IoT applications. Key concepts include data ingestion, transformation, and serving layer design.

Taught by

Packt - Course Instructors

Reviews

Start your review of Data Engineering with Scala and Spark

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.