What you'll learn:
- How to use Databricks to build and run data engineering workflows
- The principles of the Lakehouse architecture with Delta Lake
- How to process data with Spark SQL and PySpark
- Best practices for Databricks compute, jobs, and orchestration
- How to apply governance with Unity Catalog and manage secure access
- Working with streaming pipelines using Structured Streaming and Lakeflow
- Applying concepts to real-world projects with modular code and version control
- Real World Scenarios
[ This course has been completely refreshed with 17 hours of brand-new content (Sept 2025)]
I’m Malvik Vaghadia, a Data Engineer and Architect with nearly 15 years of professional experience. I’ve worked on multiple large-scale lakehouse implementations and consulted for enterprise clients. As an instructor, I’ve taught 200,000+ students worldwide and hold a 4.6+ instructor rating. Since launching this course, it has become one of Udemy’s best-sellers in the Databricks category, and this new version (Sept 2025) has been completely rebuilt with 17 hours of brand-new content.
Why Learn Databricks
Databricks is recognised as a Leader in the Gartner Magic Quadrant for Data & AI platforms. It has become the go-to lakehouse platform for modern data engineering, enabling organisations to build, orchestrate, and optimise pipelines at scale. By mastering Databricks, you’ll be learning one of the most in-demand skills in today’s data landscape.
Course Delivery Style
This course is designed with the right balance of theory, hands-on coding, and practical projects. Every concept is explained clearly, then demonstrated live in Databricks, and reinforced with a multi-phase, end-to-end project that you’ll build step by step. You’ll also get all course notebooks as downloadable materials, containing the full code, step-by-step documentation, and extra resources so you can follow along easily.
Curriculum Highlights:
Four Part Course Project: End-to-end NYC Taxi project and further pipeline builds across multiple parts as you develop your knowledge.
Foundations: What data engineering is, why Databricks, the Spark architecture, PySpark, and the Lakehouse.
Azure setup: Account creation, resources, role-based access control, naming conventions, and cost management.
Databricks setup: Creating and configuring a workspace, navigating the UI, and handling personal email restrictions.
Databricks notebooks and workspace: Markdown, comments, organising objects, mixing languages, and notebook tips.
Databricks compute: Clusters, DBU pricing, runtimes, serverless vs all-purpose compute, instance pools, and SQL warehouses.
Spark SQL (Python): Writing Spark SQL code using both SQL syntax and DataFrame APIs, reading/writing different file formats, defining schemas, and managing tables and views.
PySpark Transformations: Column operations, functions, filtering, sorting, joining, aggregations, pivots, and conditional logic.
Medallion architecture: Bronze, Silver, and Gold layers explained and implemented.
Delta Lake: Transaction log, schema enforcement and evolution, time travel, and DML operations (MERGE, UPDATE, DELETE).
Workflows and jobs: Passing parameters, handling failures, concurrency, conditional tasks, and monitoring.
Git & local development: VS Code setup, linking with GitHub, repos, and workflow best practices.
Functions and modularization: Creating and importing Python modules, UDFs, and project structuring.
Unity Catalog & governance: Metastores, securable objects, workspace roles, external locations, and permissions.
Streaming & Lakeflow pipelines: Structured Streaming concepts, Auto Loader, watermarking, triggers, and the new Lakeflow (DLT) pipeline model.
Performance: Lazy evaluation, explain plans, caching, shuffles, broadcast joins, partitioning, Z-ORDER, and Liquid Clustering.
Automation & CI/CD: Programmatic interaction with Databricks, CLI demo, and high-level CI/CD overview.
By the end of the course, you’ll have both the knowledge and confidence to design, build, and optimise production-grade data pipelines on Databricks.