Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
This five-course specialization takes you from lakehouse fundamentals to production-grade AI systems on the Databricks platform. You begin by building data pipelines with Apache Spark and Delta Lake, learning medallion architecture (bronze, silver, gold) and Unity Catalog governance. You then advance to Delta Live Tables for declarative ETL with built-in data quality expectations, streaming ingestion with Auto Loader, and Change Data Capture with APPLY CHANGES. The specialization progresses into machine learning engineering with MLflow tracking and the Databricks Model Registry, generative AI with LLM fine-tuning and RAG pipelines using Vector Search, and concludes with production governance — model serving, A/B testing, monitoring, and CI/CD for ML systems. Every course includes hands-on labs on the Databricks platform using real-world datasets and production patterns.
Syllabus
- Course 1: Databricks Lakehouse Fundamentals
- Course 2: Data Engineering with Delta Lake on Databricks
- Course 3: Machine Learning with Databricks and MLflow
- Course 4: Generative AI and LLMs on Databricks
- Course 5: Production Governance and MLOps on Databricks
Courses
-
Build production-ready data pipelines using Delta Live Tables and the Medallion Architecture on Databricks. This hands-on course teaches you to design, implement, and monitor ETL workflows that transform raw data into reliable, business-ready datasets through a structured bronze-silver-gold layering pattern. This course is primarily aimed at first- and second-year undergraduates interested in engineering or science, along with professionals with an interest in programming. You will start by mastering DLT fundamentals — declarative pipeline syntax in both SQL and Python, streaming ingestion with Auto Loader, and schema evolution strategies. Next, you will implement each Medallion Architecture layer: bronze for raw ingestion with lineage tracking, silver for data cleaning with expectations-based quality gates, and gold for business aggregations optimized with Z-ordering and partitioning. The course culminates in a capstone project where you build a complete inventory management system using Change Data Capture with `apply_changes()`, multi-source ingestion, and end-to-end pipeline orchestration. Every concept is reinforced through labs on Databricks Community Edition — no paid account required. Whether you are transitioning from batch ETL to streaming or building your first lakehouse pipeline, this course gives you the practical skills employers demand in modern data engineering roles.
-
Learn to build data pipelines on the Databricks Lakehouse Platform — from architecture concepts to hands-on Spark and Delta Lake. This beginner course starts with why the lakehouse pattern replaced separate data warehouses and data lakes, then moves directly into the Databricks workspace where you'll configure compute, write PySpark and SQL queries, and manage data with Unity Catalog's three-level namespace. Week by week, you'll progress from navigating the platform to transforming DataFrames with select, filter, groupBy, and joins, then to creating Delta Lake tables with ACID transactions, schema enforcement, and time travel. You'll perform real DML operations — INSERT, UPDATE, DELETE, and MERGE — and learn to schedule production pipelines using Databricks Jobs with DAG-based orchestration. The course runs entirely on Databricks Free Edition, so there's no cloud billing. Six hands-on labs reinforce each module: explore the workspace, write notebook-based transformations, build Delta tables, and wire up an automated workflow. By the end, you'll have built a complete data engineering pipeline from raw ingestion through Delta Lake to scheduled production jobs.
-
Build production GenAI systems on Databricks by mastering prompt engineering, RAG pipelines, model governance, and code intelligence. You will apply chain-of-thought, ReAct, and few-shot prompting patterns to decompose complex tasks, then construct retrieval-augmented generation pipelines that fuse vector search with BM25 using Reciprocal Rank Fusion. The course progresses from foundational techniques through production deployment across four weeks. Week one covers tokenization mechanics, sampling parameters, system prompts, and the Databricks Playground. Week two builds RAG systems using embeddings, MLflow experiment tracking, Feature Store, and PMAT code intelligence with TDG scoring and PageRank on call graphs. Week three addresses the fine-tuning vs RAG decision matrix, cryptographic model signing with SHA-256 chain-of-trust verification, AI Gateway configuration, model registry governance via Unity Catalog, and Databricks compute infrastructure. Week four integrates all concepts into a capstone project: a quality-aware code retrieval pipeline using trueno-rag and pmat. You will evaluate RAG quality using faithfulness-relevance diagnostic quadrants and six standard retrieval metrics including MRR, NDCG, recall, precision, hit rate, and MAP.
-
This course teaches you to build, track, and deploy machine learning models on the Databricks platform using MLflow. You start with the reproducibility crisis in ML — understanding why untracked experiments, scattered notebooks, and missing version control create production failures — and learn how MLflow solves these problems with structured experiment tracking, model versioning, and artifact management. You then explore MLflow's architecture in depth: the Tracking layer for logging parameters, metrics, and artifacts; the Model Registry for governance and stage gates; and the Projects layer for reproducible environments. The course covers Feature Store architecture for eliminating training/serving skew, where features are computed once and served two ways — batch for training and real-time for inference. You progress through the ML algorithm spectrum from manual implementations to AutoML, learning when to choose transparency over automation for regulated industries. The second module focuses on production deployment: the MLOps maturity staircase (L0 through L3), inference patterns for batch and real-time serving, and the infrastructure decisions that separate prototype ML from production ML. Hands-on labs on Databricks reinforce every concept.
-
This intermediate course provides a practical, hands-on exploration of Databricks Governance, focusing on the essential tools and workflows for managing and securing your data lakehouse. You will learn to navigate and control access to your data assets using Unity Catalog, the foundation of Databricks governance. The course covers the core hierarchy of metastores, catalogs, schemas, and tables, and teaches you how to manage them programmatically using the Databricks Python SDK, CLI, and VS Code extension. Beyond foundational access control, you will master the skills to implement modern CI/CD and MLOps practices directly within the Databricks environment. You'll learn to integrate Databricks Repos with GitHub, automate notebook testing and deployment with GitHub Actions, and understand the architectural considerations for managing machine learning models in production. Finally, you will explore how to ensure ongoing data reliability by setting up and understanding Lakehouse Monitoring for data quality and freshness. This course is unique because it moves beyond theory, demonstrating how to apply these governance concepts with the actual tools and code used by data professionals. By the end, you'll be equipped to build, deploy, and monitor secure and reliable data pipelines and AI applications on the Databricks platform
Taught by
Alfredo Deza and Noah Gift