Advance your data skills by mastering Apache Spark. Using the Spark Python API, PySpark, you will leverage parallel computation with large datasets, and get ready for high-performance machine learning. From cleaning data to creating features and implementing machine learning models, you'll execute end-to-end workflows with Spark. The track ends with building a recommendation engine using the popular MovieLens dataset and the Million Songs dataset.
Overview
Syllabus
- Introduction to PySpark
- Master PySpark to handle big data with ease—learn to process, query, and optimize massive datasets for powerful analytics!
- Big Data Fundamentals with PySpark
- Learn the fundamentals of working with big data with PySpark.
- Cleaning Data with PySpark
- Learn how to clean data with Apache Spark in Python.
- Feature Engineering with PySpark
- Learn the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering.
- Machine Learning with PySpark
- Learn how to make predictions from data with Apache Spark, using decision trees, logistic regression, linear regression, ensembles, and pipelines.
- Building Recommendation Engines with PySpark
- Learn tools and techniques to leverage your own big data to facilitate positive experiences for your users.
- Building a Demand Forecasting Model
Taught by
Nick Solomon, Lore Dirick, John Hogue, Shantanu Trivedi, Upendra Kumar Devisetty, Andrew Collier, and Mike Metzger