Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Udemy

Apache Spark for Java Developers

via Udemy

Overview

Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning - and real time streaming with Kafka!

What you'll learn:
  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data
  • See how Structured Streaming can be used to build pipelines with Kafka

Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers.

If you're new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java APIfor spark is a great way to get started, fast.

All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQLand DataFrames are covered in detail, with easy to follow examples. You'll be able to follow along with all of the examples, and run them on your own local development computer.

Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data!No mathematical experience is necessary!

And finally, there's a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.


Optionally, if you have an AWSaccount, you'll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you're not familiar with AWSyou can skip this video, but it's still worthwhile to watch rather than following along with the coding.

You'll be going deep into the internals of Spark and you'll find out how it optimizes your execution plans. We'll be comparing the performance of RDDs vs SparkSQL, and you'll learn about the major performance pitfalls which could save a lot of money for live projects.

Throughout the course, you'll be getting some great practice with Java Lambdas - a great way to learn functional-style Java if you're new to it.



Syllabus

  • Introduction
  • Getting Started
  • Reduces on RDDs
  • Mapping and Outputting
  • Tuples
  • PairRDDs
  • FlatMaps and Filters
  • Reading from Disk
  • Keyword Ranking Practical
  • Sorts and Coalesce
  • Deploying to AWS EMR (Optional)
  • Joins
  • Big Data Big Exercise
  • RDD Performance
  • Module 2 - Chapter 1 SparkSQL Introduction
  • SparkSQL Getting Started
  • Datasets
  • The Full SQL Syntax
  • In Memory Data
  • Groupings and Aggregations
  • Date Formatting
  • Multiple Groupings
  • Ordering
  • DataFrames API
  • Pivot Tables
  • More Aggregations
  • Practical Exercise
  • User Defined Functions
  • SparkSQL Performance
  • HashAggregation
  • SparkSQL Performance vs RDDs
  • Module 3 - SparkML for Machine Learning
  • Linear Regression Models
  • Training Data
  • Model Fitting Parameters
  • Feature Selection
  • Non-Numeric Data
  • Pipelines
  • Case Study
  • Logistic Regression
  • Decision Trees
  • K Means Clustering
  • Recommender Systems
  • Module 4 -Spark Streaming and Structured Streaming with Kafka
  • Streaming Chapter 2 - Streaming with Apache Kafka
  • Streaming Chapter 3- Structured Streaming

Taught by

Richard Chesterwood, Matt Greencroft and Virtual Pair Programmers

Reviews

4.6 rating at Udemy based on 3749 ratings

Start your review of Apache Spark for Java Developers

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.