Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Udemy

Real Time Spark Project for Beginners: Hadoop, Spark, Docker

via Udemy

Overview

Building Real Time Data Pipeline Using Apache Kafka, Apache Spark, Hadoop, PostgreSQL, Django and Flexmonster on Docker

What you'll learn:
  • Complete Development of Real Time Streaming Data Pipeline using Hadoop and Spark Cluster on Docker
  • Setting up Single Node Hadoop and Spark Cluster on Docker
  • Features of Spark Structured Streaming using Spark with Scala
  • Features of Spark Structured Streaming using Spark with Python(PySpark)
  • How to use PostgreSQL with Spark Structured Streaming
  • Basic understanding of Apache Kafka
  • How to build Data Visualisation using Django Web Framework and Flexmonster
  • Fundamentals of Docker and Containerization

  • In many data centers, different type of servers generate large amount of data(events, Event in this case is status of the server in the data center) in real-time.

  • There is always a need to process these data in real-time and generate insights which will be used by the server/data center monitoring people and they have to track these server's status regularly and find the resolution in case of issues occurring, for better server stability.

  • Since the data is huge and coming in real-time, we need to choose the right architecture with scalable storage and computation frameworks/technologies.

  • Hence we want to build the Real Time Data Pipeline Using Apache Kafka, Apache Spark, Hadoop, PostgreSQL, Django and Flexmonster on Docker to generate insights out of this data.

  • The Spark Project/Data Pipeline is built using Apache Spark with Scala and PySpark on Apache Hadoop Cluster which is on top of Docker.

  • Data Visualization is built using Django Web Framework and Flexmonster.

  • Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.

    Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

    Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

    A NoSQL (originally referring to "non-SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Syllabus

  • Introduction
  • Environment Setup
  • Development | Project Code Walk-through
  • Complete Project Demo
  • Docker Beginners Guide

Taught by

PARI MARGU

Reviews

3.9 rating at Udemy based on 96 ratings

Start your review of Real Time Spark Project for Beginners: Hadoop, Spark, Docker

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.