Apache Spark:  Why and How

Washington, D.C. campuses

GA D.C., 1776
1133 15th Street NW, 8th Floor
Washington D.C. 20005

Apache Spark: Why and How

Washington, D.C.

Washington, D.C. campuses

GA D.C., 1776
1133 15th Street NW, 8th Floor
Washington D.C. 20005

About this class

Spark is emerging as the platform of choice for processing large data sets. We will look at the the Spark architecture and how it addresses big-data problems, and we will do several hands-on examples using Spark to do analysis and learning tasks.

In this class you will learn the basics of the Spark and MapReduce frameworks and the primary interfaces for working with Spark. Through an extended example, we will see how Spark works to execute common data processing tasks, and see where performance bottlenecks can happen and how to avoid them. We will understand both the fundamental data structure (the RDD) then the more modern abstraction (the DataFrame). Finally the example will cover a simple predictive modeling example.

Takeaways

  • Understand the Spark architecture and system, including why performance problems occur and what to do about them

  • Be able to read data into a Spark engine and do simple data preparation and analysis tasks directly in Spark using both RDDs and DataFrames

  • Run a basic machine-learning model in Spark

  • Connect Spark and R to make analysis tasks at scale more convenient

Prereqs & Preparation

  • You should be comfortable with basic data manipulation concepts like joins, groups, and filters

  • We will be doing examples in Python and Pandas so familiarity with the language will be useful but not required

  • I will make code and data sets for the example available, so if you want to work in parallel on your laptop, bring it with Spark installed ahead of time.

Coming up near you

Let’s Keep You Updated

Enter your email to start following

By providing us with your email, you agree to the terms of our Privacy Policy and Terms of Service.