In this course, you’ll learn how to use Spark to work with big data and build machine learning models at scale, including how to wrangle and model massive datasets with PySpark, the Python library for interacting with Spark. In the first lesson, you will learn about big data and how Spark fits into the big data ecosystem. In lesson two, you will be practicing processing and cleaning datasets to get comfortable with Spark’s SQL and dataframe APIs. In the third lesson, you will debug and optimize your Spark code when running on a cluster. In lesson four, you will use Spark’s Machine Learning Library to train machine learning models at scale.
Overview
Syllabus
- Introduction to the Course
- In this lesson, you will learn more about this course - what will be covered, and who you will be learning from - let's get started!
- The Power of Spark
- In this lesson, you will learn about the problems that Apache Spark is designed to solve. You'll also learn about the greater Big Data ecosystem and how Spark fits into it.
- Data Wrangling with Spark
- In this lesson, we'll dive into how to use Spark for cleaning and aggregating data.
- Setting up Spark Clusters with AWS
- In this lesson, you will learn to run Spark on a distributed cluster in AWS UI and AWS CLI.
- Debugging and Optimization
- In this lesson, you will learn best practices for debugging and optimizing your Spark applications.
- Machine Learning with Spark
- In this lesson, we'll explore Spark's ML capabilities and build ML models and pipelines.
Taught by
David Drummond and Judit Lantos