Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Linux Foundation

Building Robust Streaming Data Pipelines with Apache Spark

Linux Foundation via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the challenges and solutions for building robust streaming data pipelines with Apache Spark in this 42-minute conference talk by Zak Hassan from Red Hat. Learn how to integrate Apache Kafka, Apache Spark, and Apache Camel to create a continuous data pipeline for Spark applications, addressing issues like dirty data in ETL processes. Discover techniques for extracting, transforming, and loading data from various systems into Apache Kafka, and leverage Spark's built-in Kafka connector. Gain insights into running these technologies inside Docker and benefit from lessons learned in real-world implementations. The talk covers data preparation, various data types and formats, and includes demonstrations comparing Hive and Spark, as well as practical examples using HDFS and Python code.

Syllabus

Introduction
Data Preparation
Data Types
Camel
Data formats
Demo
Hive vs Spark
Demo Time
Demo Starts
Logs
HDFS
Python
Code
Recap
Office Hours

Taught by

Linux Foundation

Reviews

Start your review of Building Robust Streaming Data Pipelines with Apache Spark

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.