Data Engineering with Databricks Cookbook

Packt via Coursera

Go to class Write review

Details

Go to class

Provider

Coursera
Pricing

Paid Course
Languages

English
Certificate

Certificate Available
Effort

11 weeks, 1 hour a week
Sessions

Self-Paced
Level

Intermediate
Subtitles

English

Found in

Overview

AI, Data Science & Cloud Certificates from Google, IBM & Meta — 40% Off

One plan covers every Professional Certificate on Coursera. 40% off Coursera Plus Annual.

Unlock All Certificates

This course offers a hands-on approach to mastering data engineering using Apache Spark, Delta Lake, and Databricks. By combining these technologies, you will learn how to build robust, scalable data pipelines and implement effective data management strategies in real-world applications. With a focus on performance optimization, data orchestration, and modern data engineering practices, this course provides essential skills for professionals working in the data engineering space. You’ll start by exploring data ingestion techniques using Apache Spark, followed by methods for transforming and managing data within a data lakehouse. Each section builds on the last, providing learners with actionable insights that can be directly applied to their workflows. The course also covers DataOps and DevOps practices to help you streamline and automate your data processes. What sets this course apart is its emphasis on practical, real-world applications. You’ll work through concrete examples and recipes for managing data, from ingestion to transformation, ensuring that you can tackle data engineering challenges with confidence. Ideal for data engineers, data scientists, and IT professionals with a background in SQL and Python, this course will help you enhance your skills in data pipeline orchestration and optimization.

Syllabus

Data Ingestion and Data Extraction with Apache Spark

This module introduces practical techniques for ingesting and extracting data from various formats such as CSV, JSON, and XML using Apache Spark. Learners will explore common challenges, data transformation functions, and methods for handling nested and complex data structures. By the end, participants will be equipped to efficiently process and manipulate diverse data sources in Spark.

Data Transformation and Data Manipulation with Apache Spark

This module introduces learners to essential data manipulation techniques using Apache Spark and PySpark, including filtering, joining, aggregating, and handling null values in large datasets. Learners will explore both standard and advanced operations such as approximate aggregations and nested window functions to efficiently process and analyze data. By the end, participants will be equipped to transform and manage data at scale using Spark's distributed computing capabilities.

Data Management with Delta Lake

This module introduces the core concepts and practical skills needed to manage data using Delta Lake, an open-source storage layer for lakehouse architectures. Learners will explore reading and merging data, implementing change data capture, optimizing tables, and leveraging versioning and time travel features to ensure data integrity and performance. Hands-on exercises will reinforce best practices for handling big data workloads with Delta Lake in Python.

Ingesting Streaming Data

This module introduces the fundamentals of processing real-time data streams using Apache Spark Structured Streaming. Learners will explore how to ingest data from sources like Apache Kafka, apply transformations and filters, configure checkpoints and triggers, and perform windowed aggregations for robust stream processing applications.

Processing Streaming Data

This module explores real-time data processing using Apache Spark Structured Streaming and Delta Lake. Learners will discover techniques for idempotent stream writing, merging change data capture events, joining streaming and static datasets, and monitoring streaming queries. Practical recipes and examples will help you build robust, scalable streaming data pipelines.

Performance Tuning with Apache Spark

This module explores advanced techniques for optimizing Apache Spark applications, focusing on improving performance and resource efficiency. Learners will discover strategies such as minimizing data shuffling, handling data skew, leveraging broadcast variables, and optimizing partitioning and join operations. Practical guidance on caching and persistence will also be provided to help accelerate data processing workflows.

Performance Tuning in Delta Lake

This module explores advanced techniques to enhance query performance in Delta Lake, including data partitioning, Z-ordering, data skipping, and compression strategies. Learners will gain practical skills to optimize storage and reduce I/O costs for large-scale data processing.

Orchestration and Scheduling Data Pipeline with Databricks Workflows

This module introduces learners to automating and managing data pipelines using Databricks Workflows. You will explore how to configure, monitor, and parameterize workflows, implement conditional branching, and trigger jobs based on external events such as file arrivals. By the end, you'll be equipped to orchestrate robust data processing tasks on the Databricks platform.

Building Data Pipelines with Delta Live Tables

This module guides learners through building robust data pipelines using Delta Live Tables on Databricks. You will explore techniques for ingesting and transforming streaming data, enforcing data quality, quarantining invalid records, monitoring pipeline health, deploying with asset bundles, and implementing change data capture (CDC). By the end, you'll be equipped to create scalable, reliable pipelines for real-time analytics.

Data Governance with Unity Catalog

This module introduces the core features of Databricks Unity Catalog for managing data governance in a lakehouse environment. Learners will explore catalog creation, fine-grained access controls, metadata management, data lineage, and system table querying to ensure secure and compliant data operations. Practical exercises demonstrate how to implement row filters, column masks, and leverage the Unity Catalog UI for effective data stewardship.

Implementing DataOps and DevOps on Databricks

This module explores practical strategies for implementing DataOps and DevOps workflows on the Databricks platform. Learners will discover how to automate tasks using the Databricks CLI, streamline development with the VSCode extension, manage infrastructure with Databricks Asset Bundles, and integrate CI/CD pipelines using GitHub Actions. By the end, participants will be equipped to enhance data and software development efficiency through automation and best practices.