Apache Spark: Why and How | Online

Data

Online Campus

Online
Anywhere
Online

Past Locations for this Class

Seattle

Stay up to date

Follow this class to get an email the next time it’s scheduled.

Sign-up not required

About this class

Spark is emerging as the platform of choice for processing large data sets. We will look at the the Spark architecture and how it addresses big-data problems, and we will do several hands-on examples using Spark to do analysis and learning tasks.

In this class you will learn the basics of the Spark and MapReduce frameworks and the primary interfaces for working with Spark. Through an extended example, we will see how Spark works to execute common data processing tasks, and see where performance bottlenecks can happen and how to avoid them. We will understand both the fundamental data structure (the RDD) then the more modern abstraction (the DataFrame). Finally the example will cover a simple predictive modeling example.

Takeaways

Understand the Spark architecture and system, including why performance problems occur and what to do about them
Be able to read data into a Spark engine and do simple data preparation and analysis tasks directly in Spark using both RDDs and DataFrames
Run a basic machine-learning model in Spark
Connect Spark and R to make analysis tasks at scale more convenient

Prereqs & Preparation

You should be comfortable with basic data manipulation concepts like joins, groups, and filters
We will be doing examples in Python and Pandas so familiarity with the language will be useful but not required
I will make code and data sets for the example available, so if you want to work in parallel on your laptop, bring it with Spark installed ahead of time.

View All

Online

Data

Online Campus

Past Locations for this Class

Stay up to date

Online Campus

Past Locations for this Class

Stay up to date

About this class

Takeaways

Prereqs & Preparation

Let’s Keep You Updated

Thanks