Azure DataBricks - Data Engineering With Real Time Project

via Udemy

Go to class Write review

Details

Go to class

Provider

Udemy
Pricing

Paid Course
Languages

English
Certificate

Certificate Available
Effort

16 hours 56 minutes
Sessions

Self-Paced
Level

Beginner
Subtitles

English, Polish, German, French, Portuguese

Found in

Real Time Project on Retail Data - PySpark ,SQL, Delta/Delta Live Table,Unity Catalogue, AutoLoader & Performance Tuning

What you'll learn:

Medallion Architecture , Dimensional Data Modelling Design , DeltalakeHouse Design , Spark Core Architecture , Unity Catalogue Setup , Spark Cluster Setup
PySpark Dataframe Reader , Writer , Transformation Functions , Action Functions , DateTime Functions , Aggregation Functions , Dataframe Joins , Complex Data
Spark SQL External Tables , Managed Tables , Delta Lake Tables , Create Table As Script(CTAS) , Temp Views , Table Joins , Data Transformation Functions
Four Reusable Ingestion Pipelines To Ingest Source Data From Web(HTTP) Service , Database Tables , API Source Systems , Incremental Loading & Job Scheduling
Seven Data Transformation Pipelines to process source data in Silver & Gold Layers and Build Reporting Database And Datalake With Change Data Capture
Spark Streaming Reader & Writer Configuration To Process Real Time Steaming Data , CHECKPOINTLOCATION setup for automated Incremental Loading in Streaming Data
Delta Live Tables - Materialised Views , Streaming Tables setup , Delta Live Table Pipeline Configuration , Data Quality Checks , AUTOLOADER and APPLY CHANGES
Monitoring And Logging Setup To Monitor Production Job Runs, Setup Alerts for Job Failure and Extended Logging of Job Runs and Service Metrics
Security Settings in Azure using Microsoft Entra ID , IAM Role Based Access Control(RBAC) And Databricks Workspace Admin Settings
Configure Github Repository , Git Repos Folders in Databricks Workspace , Ways of Working with Git branches , Merging Code & PULL requests
Setup Production Environment , CI/CD Pipeline to automate Code Deployment Using GitHub Actions

ByCompleting this course you will be equipped with below Data Engineer Roles &Responsibilities in the real time project

• Designing and Configuring UnityCatalogue for Better Access Control & Connecting toExternal DataStores

• Designing and Developing Databricks(PySpark) Notebooks to Ingest the data from Web(HTTP)Services

• Designing and Developing Databricks(PySpark) Notebooks to Ingest the data from SQLDatabases

• Designing and Developing Databricks(PySpark) Notebooks to Ingest the data from API source Systems

• Designing and Developing SparkSQLExternal and Managed Tables

• Developed Databricks Spark SQL Reusable Notebooks To Create and populate Delta LakeTables

• Developed Databricks SQLCode to populate Reporting Dimensiontables

• Developed Databricks SQLCode to populate Reporting SCDType 2 Dimensiontables

• Developed Databricks SQLCode to populate Reporting Fact Table

• Designing and Developing Databricks(PySpark ) Notebooks to Process andFlatten Semi Structured JSON Data using EXPLODE function

• Designing and Developing Databricks(PySpark ) Notebooks to Integrate(JOIN) Data and load into Datalake Gold Layer

• Designing and Developing Databricks(PySpark) Notebooks to Process Semi Structured JSON Data in DataLake Silver Layer

• Designing and Developing Databricks(SQL) Notebooks to IntegrateData and load into Datalake Gold Layer

• Developed Databricks Jobs for Scheduling the Data Ingestion and Transformation Notebooks

• Designing and Configuring Delta Live Tables inall layers for seamless Data Integration

• Setup Azure Monitor and Log Analytics for Automated Monitoring of Job Runs and Stored Extended Log Details

• SetupAzure Key Vault and Configure KeyVault Backed Secret Scopes inDatabricks Workspace

• Configuring GitHubRepository and creating GitRepoFolders in Databricks Workspace

• Designing and Configuring CI/CDPipelines to release the code into multiple environment

• Identifying performance bottle necks and perform the performance tuning using ZORDER BY ,BROADCASTJOIN , ADAPTIVEQUERYEXECUTION , DATASALTINGand LIQUIDCLUSTERING

Syllabus

Introduction
Azure Portal Overview & Create Azure Resources
PySpark Introduction
SparkSQL Introduction
Unity Catalogue Configuration
Ingest Source Data From Web(HTTP) Service Into Bronze Layer Using PYSPARK
Ingest Source Data From Database Tables Using PYSPARK
Silver Layer Transformation - Parquet Files & Delta Table Config Using Spark SQL
Dimensional Data Modelling (Star Schema) - Reporting Database Design
Reporting Dimension(SCD Types 1 & 2 ) And Fact Tables Load Using Spark SQL
Spark Structured Streaming - Real Time Data Processing
Delta Live Tables Introduction
Datalake Bronze Layer Load - Ingest Geo-Location API Source Data
DataLake Silver Layer Transformations - Transform Geo Location API Source Data
Datalake Bronze Layer Load - Ingest Weather-Data API Source
DataLake Silver Layer Transformations - Transform Weather Data (ASSIGNMENT)
DataLake Gold Layer Load- Publish Price Prediction AI Source Data (ASSIGNMENT)
Monitoring And Logging - Azure Monitor , Log Analytics & Job Notifications
Security Settings - AZURE IAM(RBAC) Access Control & Databricks WorkSpace Admin
Git Repository Integration For Databricks WorkSpace
CI/CD (Continuous Integration / Continuous Deployment) Pipeline
Performance Tuning