Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

University of Pittsburgh

Big Data Processing with Hadoop and Spark

University of Pittsburgh via Coursera

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Master the tools and techniques that power large-scale data processing and analytics. This course introduces the principles and frameworks of Big Data Processing with Hadoop and Spark, enabling learners to manage, process, and analyze massive datasets efficiently. You’ll start by understanding the Hadoop ecosystem, including HDFS and MapReduce, and how distributed storage and computation work together to handle data at scale. Then, you’ll explore Apache Spark, a powerful framework for fast, in-memory data processing and real-time analytics. Through guided exercises and case studies, you’ll learn how to build scalable data pipelines, optimize performance, and apply transformations for business insights. By the end of this course, you’ll be equipped to handle complex data workloads using industry-standard big data tools. Ideal for aspiring data engineers, analysts, and developers, this course bridges data management and cloud computing—preparing you to design, implement, and manage big data solutions that drive intelligent decision-making in modern organizations.

Syllabus

  • Hadoop
    • This module guides you through the core components of the Hadoop ecosystem, starting with its architecture and distributed file system. You’ll explore how Hadoop processes data, gain insight into its broader ecosystem, and apply your knowledge in hands-on activities using both Docker and a Linux virtual machine.
  • Programming Models
    • This module introduces you to key programming models for distributed data processing, with a focus on MapReduce and its practical applications. You'll explore core concepts and terminology, work through guided code walkthroughs using Python to implement word count and server log analysis tasks, and gain experience using Apache Pig for data transformation. You'll also gain hands-on experience writing data transformation scripts in Apache Pig, culminating in an assignment that applies these skills to web log analysis.
  • Apache Spark
    • This module introduces you to Apache Spark, covering its core concepts, architecture, and machine learning capabilities through MLlib. You’ll learn how to set up Spark using Docker and Linux VM, explore how PySpark operates within the Spark framework, and compare Spark MLlib with scikit-learn through hands-on code walkthroughs. By the end of the module, you'll apply what you've learned in graded activities and an assignment focused on building a predictive model with PySpark and MLlib.

Taught by

Dmitriy Babichenko

Reviews

Start your review of Big Data Processing with Hadoop and Spark

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.