Prepare and process data with Azure Databricks

Microsoft via Microsoft Learn

Go to class Write review

Learn how to design and implement data modeling strategies in Azure Databricks with Unity Catalog, including ingestion patterns, table formats, partitioning, slowly changing dimensions, and clustering strategies.
By the end of this module, you'll be able to:
- Design data ingestion logic and configure data source connections
- Select the appropriate data ingestion tool for your scenario
- Choose between Delta Lake, Apache Iceberg, and other table formats
- Design and implement effective data partitioning schemes
- Select and implement slowly changing dimension types
- Design and implement temporal tables for change tracking and auditing
- Choose appropriate data granularity for fact and dimension tables
- Design and implement clustering strategies for query optimization
- Evaluate when to use managed versus external tables
Learn how to ingest data from diverse sources into Unity Catalog tables in Azure Databricks using managed connectors, notebooks, SQL commands, streaming, and declarative pipelines.
By the end of this module, you'll be able to:
- Configure Lakeflow Connect to ingest data from external sources using managed connectors
- Ingest batch and streaming data using notebooks with DataFrames and Structured Streaming
- Use SQL commands like COPY INTO and CREATE TABLE AS SELECT for file-based ingestion
- Process change data capture feeds with the AUTO CDC API
- Configure Spark Structured Streaming for real-time data ingestion from Kafka and Event Hubs
- Set up Auto Loader to automatically detect and process new files with schema evolution
- Orchestrate data ingestion workflows using Lakeflow Spark Declarative Pipelines
Learn how to cleanse, transform, and load data into Unity Catalog tables in Azure Databricks by profiling data, handling duplicates and nulls, applying transformations, and using various loading strategies.
By the end of this module, you'll be able to:
- Profile data using SQL commands and data profiling features to assess data quality
- Choose appropriate column data types to optimize storage and ensure data integrity
- Identify and resolve duplicate, missing, and null values in datasets
- Apply filtering, grouping, and aggregation operations to transform data
- Combine datasets using joins and set operators like UNION, INTERSECT, and EXCEPT
- Reshape data using denormalization, pivot, and unpivot techniques
- Load transformed data into Unity Catalog tables using INSERT, MERGE, and overwrite operations
Learn how to implement and manage data quality constraints in Azure Databricks using Unity Catalog, including validation checks, schema enforcement, and pipeline expectations.
By the end of this module, you'll be able to:
- Implement validation checks for nullability, cardinality, and range constraints
- Implement data type checks using schema enforcement and explicit casting
- Enforce schema and manage schema drift using Auto Loader and Delta Lake
- Manage data quality using pipeline expectations in Lakeflow Spark Declarative Pipelines

Syllabus

Design and implement data modeling with Azure Databricks
- Introduction
- Design ingestion logic and data source configuration
- Choose a data ingestion tool
- Choose a data table format
- Design and implement a data partitioning scheme
- Choose a slowly changing dimension (SCD) type
- Implement a slowly changing dimension (SCD) type 2
- Design and implement a temporal (history) table to record changes over time
- Choose granularity on a column or table based on requirements
- Choose managed vs unmanaged tables
- Design and implement a clustering strategy
- Exercise - Design and Implement Data Modeling with Azure Databricks
- Knowledge check
- Summary
Ingest data into Unity Catalog
- Introduction
- Ingest data with Lakeflow Connect
- Ingest data with notebooks
- Ingest data with SQL methods
- Ingest data with CDC feed
- Ingest data with Spark Structured Streaming
- Ingest data with Auto Loader
- Ingest data with Lakeflow Spark Declarative Pipelines
- Exercise - Ingest Data into Unity Catalog
- Module assessment
- Summary
Cleanse, transform, and load data into Unity Catalog
- Introduction
- Profile data
- Choose column data types
- Resolve duplicates and nulls
- Transform data with filters and aggregations
- Transform data with joins and set operators
- Transform data with denormalization and pivots
- Load data with merge, insert, and append
- Exercise - Cleanse, Transform, and Load Data into Unity Catalog
- Module assessment
- Summary
Implement and manage data quality constraints with Azure Databricks
- Introduction
- Implement validation checks
- Implement data type checks
- Detect and manage schema drift
- Manage data quality with pipeline expectations
- Exercise - Implement and Manage Data Quality Constraints with Azure Databricks
- Module assessment
- Summary