Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

The Data Engineering Bootcamp: Zero to Mastery

via Zero To Mastery

Go to class Write review

Overview

Learn Data Engineering end-to-end. Build real-time pipelines with Apache Kafka & Flink, data lakes on AWS, machine learning workflows with Spark, and integrate LLMs into production-ready systems. Designed to launch your career as a future-ready Data Engineer.

Learn the skills and real-world tools used by Data Engineers and become top 10% in your field
Build stream-processing pipelines with Apache Kafka and Apache Flink
Create scalable, cloud-based data lakes on AWS using S3, EMR, and Athena
Develop distributed processing jobs with Apache Spark and orchestrate workflows with Apache Airflow
Future-proof your skills by learning to integrate AI & machine learning including using Spark ML and LLMs
Build real-world, production-ready projects and pipelines using popular open source software

Syllabus

Introduction

The Data Engineering Bootcamp: Zero to Mastery
Exercise: Meet Your Classmates and Instructor
Course Resources
Understanding Your Video Player
Set Your Learning Streak Goal

Section 00 - Introduction to Data Engineering

Introduction
Storing Data
Processing Data
Data Sources
Orchestration
Stream Processing
AI and ML with Data Engineering
Serving Data
Cloud and Data Engineering
Source Code for This Bootcamp
Prerequisites
What’s Next?
Installing Software for the Course
[Optional] Using Windows

Section 01: Data Engineering Fundamentals: Python, SQL + more

Introduction
Quick Note On This Section
Jupyter Notebooks
Python - Lists
Python - Tuples
Python - Dictionaries
Python - Sets
Python - Range
Python - Comprehensions
Python - Strings Formatting
Python - Functions
Python - Decorators
Python - Exceptions
Python - Classes - Part 1
Python - Classes - Part 2
Python - Iterators
CLI - Basic Commands
CLI - Combining Commands
CLI - Environment Variables
Virtual Environments - What Is a Virtualenv?
SQL - Introduction
SQL - Environment Set Up
SQL - Fetching Data
SQL - Grouping Rows
SQL - Joining Data
SQL - Creating Data

Section 02 - Big Data Processing with Apache Spark: Process & Analyze Real-World Airbnb Data

Introduction
Apache Spark
How Spark Works
Spark Application
DataFrames
Installing Spark
Installing Spark on Linux
Inside Airbnb Data
Writing Your First Spark Job
Lazy Processing
[Note] Minor correction
[Exercise] Basic Functions
[Exercise] Basic Functions - Solution
Aggregating Data
Joining Data
Aggregations and Joins with Spark
Complex Data Types
[Exercise] Aggregate Functions
[Exercise] Aggregate Functions - Solution
User Defined Functions
Data Shuffle
Data Accumulators
Optimizing Spark Jobs
Submitting Spark Jobs
Other Spark APIs
Spark SQL
[Exercise] Advanced Spark
[Exercise] Advanced Spark - Solution
Summary
Let's Have Some Fun (+ More Resources)

Section 03 - Creating a Data Lake with AWS

Introduction
What Is a Data Lake?
Amazon Web Services (AWS)
Simple Storage Service (S3)
Setting Up an AWS Account
Data Partitioning
Using S3
EMR Serverless
IAM Roles
Running a Spark Job
Parquet Data Format
Implementing a Data Catalog
Data Catalog Demo
Querying a Data Lake
Summary
Course Check-In

Section 04 - Implementing Data Pipelines with Apache Airflow

Introduction
What Is Apache Airflow?
Airflow’s Architecture
Installing Airflow
Defining an Airflow DAG
Errors Handling
Idempotent Tasks
Creating a DAG - Part 1
Creating a DAG - Part 2
Handling Failed Tasks
[Exercise] Data Validation
[Exercise] Data Validation - Solution
Spark with Airflow
Using Spark with Airflow - Part 1
Using Spark with Airflow - Part 2
Sensors In Airflow
Using File Sensors
Data Ingestion
Reading Data From Postgres - Part 1
Reading Data from Postgres - Part 2
[Exercise] Average Customer Review
[Exercise] Average Customer Review - Solution
Advanced DAGs
Summary
Unlimited Updates

Section 05 - Machine Learning with Spark ML: Create a Data Pipeline, Train a Model + more

Introduction
What Is Machine Learning
Regression Algorithms
Building a Regression Model
Training a Model
Model Evaluation
Testing a Regression Model
Model Lifecycle
Feature Engineering
Improving a Regression Model
Machine Learning Pipelines
Creating a Pipeline
[Exercise] House Price Estimation
[Exercise] House Price Estimation - Solution
[Exercise] Imposter Syndrome
Classification
Classifiers Evaluation
Training a Classifier
Hyperparameters
Optimizing a Model
[Exercise] Loan Approval
[Exercise] Load Approval - Solution
Deep Learning
Summary
Implement a New Life System

Section 06 - Using AI with Data Engineering: LLMs, HuggingFace + more

Introduction
Natural Language Processing (NLP) before LLMs
Transformers
Types of LLMs
Hugging Face
Databricks Set Up
Using an LLM
Structured Output
Producing JSON Output
LLMs With Apache Spark
Summary

Section 07 - Real-Time Data Processing ("Stream Processing") with Apache Kafka

Introduction
What Is Apache Kafka?
Partitioning Data
Kafka API
Kafka Architecture
Set Up Kafka
Writing to Kafka
Reading from Kafka
Data Durability
Kafka vs Queues
[Exercise] Processing Records
[Exercise] Processing Records - Solution
Delivery Semantics
Kafka Transactions
Log Compaction
Kafka Connect
Using Kafka Connect
Outbox Pattern
Schema Registry
Using Schema Registry
Tiered Storage
[Exercise] Track Order Status Changes
[Exercise] Track Order Status Changes - Solution
Summary

Section 08 - Stream Processing with Apache Flink

Introduction
What Is Apache Flink?
Flink Applications
Multiple Streams
Installing Apache Flink
Processing Individual Records
[Exercise] Stream Processing
[Exercise] Stream Processing - Solution
Time Windows
Keyed Windows
Using Time Windows
Watermarks
Advanced Window Operations
Stateful Stream Processing
Using Local State
[Exercise] Anomalies Detection
[Exercise] Anomalies Detection - Solution
Joining Streams
Summary

Where To Go From Here?

Thank You!
Review This Course!
Become An Alumni
Learning Guideline
ZTM Events Every Month
LinkedIn Endorsements

Taught by

Ivan Mushketyk

Reviews

Start your review of The Data Engineering Bootcamp: Zero to Mastery