Spark is emerging as the platform of choice for processing large data sets. We will look at the the Spark architecture and how it addresses big-data problems, and we will do several hands-on examples using Spark to do analysis and learning tasks.
In this class you will learn the basics of the Spark and MapReduce frameworks and the primary interfaces for working with Spark. Through an extended example, we will see how Spark works to execute common data processing tasks, and see where performance bottlenecks can happen and how to avoid them. We will understand both the fundamental data structure (the RDD) then the more modern abstraction (the DataFrame). Finally the example will cover a simple predictive modeling example.
Understand the Spark architecture and system, including why performance problems occur and what to do about them
Be able to read data into a Spark engine and do simple data preparation and analysis tasks directly in Spark using both RDDs and DataFrames
Run a basic machine-learning model in Spark
Connect Spark and R to make analysis tasks at scale more convenient
You should be comfortable with basic data manipulation concepts like joins, groups, and filters
We will be doing examples in Python and Pandas so familiarity with the language will be useful but not required
I will make code and data sets for the example available, so if you want to work in parallel on your laptop, bring it with Spark installed ahead of time.