data Tag Archives - General Assembly Blog

Harnessing the Power of Data for Disaster Relief

By

2455_header

Data is the engine driving today’s digital world. From major companies to government agencies to nonprofits, business leaders are hunting for talent that can help them collect, sort, and analyze vast amounts of data — including geodata — to tackle the world’s biggest challenges.

In the case of emergency management, disaster preparedness, response, and recovery, this means using data to expertly identify, manage, and mitigate the risks of destructive hurricanes, intense droughts, raging wildfires, and other severe weather and climate events. And the pressure to make smarter data-driven investments in disaster response planning and education isn’t going away anytime soon — since 1980, the U.S. has suffered 246 weather and climate disasters that topped over $1 billion in losses according to the National Centers for Environmental Information.

Employing creative approaches for tackling these pressing issues is a big reason why New Light Technologies (NLT), a leading company in the geospatial data science space, joined forces with General Assembly’s (GA) Data Science Immersive (DSI) course, a hands-on intensive program that fosters job-ready data scientists. Global Lead Data Science Instructor at GA, Matt Brems, and Chief Scientist and Senior Consultant at NLT, Ran Goldblatt, recognized a unique opportunity to test drive collaboration between DSI students and NLT’s consulting work for the Federal Emergency Management Agency (FEMA) and the World Bank.

The goal for DSI students: build data solutions that address real-world emergency preparedness and disaster response problems using leading data science tools and programming languages that drive visual, statistical, and data analyses. The partnership has so far produced three successful cohorts with nearly 60 groups of students across campuses in Atlanta, Austin, Boston, Chicago, Denver, New York City, San Francisco, Los Angeles, Seattle, and Washington, D.C., who learn and work together through GA’s Connected Classroom experience.

Taking on Big Problems With Smart Data

nlt-ga-2

DSI students present at NLT’s Washington, D.C. office.

“GA is a pioneering institution for data science, so many of its goals coincide with ours. It’s what also made this partnership a unique fit. When real-world problems are brought to an educational setting with students who are energized and eager to solve concrete problems, smart ideas emerge,” says Goldblatt.

Over the past decade, NLT has supported the ongoing operation, management, and modernization of information systems infrastructure for FEMA, providing the agency with support for disaster response planning and decision-making. The World Bank, another NLT client, faces similar obstacles in its efforts to provide funding for emergency prevention and preparedness.

These large-scale issues served as the basis for the problem statements NLT presented to DSI students, who were challenged to use their newfound skills — from developing data algorithms and analytical workflows to employing visualization and reporting tools — to deliver meaningful, real-time insights that FEMA, the World Bank, and similar organizations could deploy to help communities impacted by disasters. Working in groups, students dived into problems that focused on a wide range of scenarios, including:

  • Using tools such as Google Street View to retrieve pre-disaster photos of structures, allowing emergency responders to easily compare pre- and post-disaster aerial views of damaged properties.
  • Optimizing evacuation routes for search and rescue missions using real-time traffic information.
  • Creating damage estimates by pulling property values from real estate websites like Zillow.
  • Extracting drone data to estimate the quality of building rooftops in Saint Lucia.

“It’s clear these students are really dedicated and eager to leverage what they learned to create solutions that can help people. With DSI, they don’t just walk away with an academic paper or fancy presentation. They’re able to demonstrate they’ve developed an application that, with additional development, could possibly become operational,” says Goldblatt.

Students who participated in the engagements received the opportunity to present their work — using their knowledge in artificial intelligence and machine learning to solve important, tangible problems — to an audience that included high-ranking officials from FEMA, the World Bank, and the United States Agency for International Development (USAID). The students’ projects, which are open source, are also publicly available to organizations looking to adapt, scale, and implement these applications for geospatial and disaster response operations.

“In the span of nine weeks, our students grew from learning basic Python to being able to address specific problems in the realm of emergency preparedness and disaster response,” says Brems. “Their ability to apply what they learned so quickly speaks to how well-qualified GA students and graduates are.”

Here’s a closer look at some of those projects, the lessons learned, and students’ reflections on how GA’s collaboration with NLT impacted their DSI experience.

Leveraging Social Media to Map Disasters

2455_sec1_socialmediamap_560x344

The NLT engagements feature student work that uses social media to identify “hot spots” for disaster relief.

During disasters, one of the biggest challenges for disaster relief organizations is not only mapping and alerting users about the severity of disasters but also pinpointing hot spots where people require assistance. While responders employ satellite and aerial imagery, ground surveys, and other hazard data to assess and identify affected areas, communities on the ground often turn to social media platforms to broadcast distress calls and share status updates.

Cameron Bronstein, a former botany and ecology major from New York, worked with group members to build a model that analyzes and classifies social media posts to determine where people need assistance during and after natural disasters. The group collected tweets related to Hurricane Harvey of 2017 and Hurricane Michael of 2018, which inflicted billions of dollars of damage in the Caribbean and Southern U.S., as test cases for their proof-of-concept model.

“Since our group lacked premium access to social media APIs, we sourced previously collected and labeled text-based data,” says Bronstein. “This involved analyzing and classifying several years of text language — including data sets that contained tweets, and transcribed phone calls and voice messages from disaster relief organizations.”

Contemplating on what he enjoyed most while working on the NLT engagement, Bronstein states, “Though this project was ambitious and open to interpretation, overall, it was a good experience and introduction to the type of consulting work I could end up doing in the future.”

Quantifying the Economic Impact of Natural Disasters

2455_sec2_economicimpact_560x344

Students use interactive data visualization tools to compile and display their findings.

Prior to enrolling in General Assembly’s DSI course in Washington D.C., Ashley White learned early in her career as a management consultant how to use data to analyze and assess difficult client problems. “What was central to all of my experiences was utilizing the power of data to make informed strategic decisions,” states White.

It was White’s interest in using data for social impact that led her to enroll in DSI where she could be exposed to real-world applications of data science principles and best practices. Her DSI group’s task: developing a model for quantifying the economic impact of natural disasters on the labor market. The group selected Houston, Texas as its test case for defining and identifying reliable data sources to measure the economic impact of natural disasters such as Hurricane Harvey.

As they tackled their problem statement, the group focused on NLT’s intended goal, while effectively breaking their workflow into smaller, more manageable pieces. “As we worked through the data, we discovered it was hard to identify meaningful long-term trends. As scholarly research shows, most cities are pretty resilient post-disaster, and the labor market bounces back quickly as the city recovers,” says White.

The team compiled their results using the analytics and data visualization tool Tableau, incorporating compelling visuals and story taglines into a streamlined, dynamic interface. For version control, White and her group used GitHub to manage and store their findings, and share recommendations on how NLT could use the group’s methodology to scale their analysis for other geographic locations. In addition to the group’s key findings on employment fluctuations post-disaster, the team concluded that while natural disasters are growing in severity, aggregate trends around unemployment and similar data are becoming less predictable.

Cultivating Data Science Talent in Future Engagements

Due to the success of the partnership’s three engagements, GA and NLT have taken steps to formalize future iterations of their collaboration with each new DSI cohort. Additionally, mutually beneficial partnerships with leading organizations such as NLT present a unique opportunity to uncover innovative approaches for managing and understanding the numerous ways data science can support technological systems and platforms. It’s also granted aspiring data scientists real-world experience and visibility with key decision-makers who are at the forefront of emergency and disaster management.

“This is only the beginning of a more comprehensive collaboration with General Assembly,” states Goldblatt. “By leveraging GA’s innovative data science curriculum and developing training programs for capacity building that can be adopted by NLT clients, we hope to provide students with essential skills that prepare them for the emerging, yet competitive, geospatial data job market. Moreover, students get the opportunity to better understand how theory, data, and algorithms translate to actual tools, as well as create solutions that can potentially save lives.”

***

New Light Technologies, Inc. (NLT) provides comprehensive information technology solutions for clients in government, commercial, and non-profit sectors. NLT specializes in DevOps enterprise-scale systems integration, development, management, and staffing and offers a unique range of capabilities from Infrastructure Modernization and Cloud Computing to Big Data Analytics, Geospatial Information Systems, and the Development of Software and Web-based Visualization Platforms.

In today’s rapidly evolving technological world, successfully developing and deploying digital geospatial software technologies and integrating disparate data across large complex enterprises with diverse user requirements is a challenge. Our innovative solutions for real-time integrated analytics lead the way in developing highly scalable virtualized geospatial microservices solutions. Visit our website to find out more and contact us at https://NewLightTechnologies.com.

Designing a Dashboard in Tableau for Business Intelligence

By

Tableau is a data visualization platform that focuses on business intelligence. It has become very popular in recent years because of its flexibility and beautiful visualizations. Clients love the way Tableau presents data and how easy it makes performing analyses. It is one of my favorite analytical tools to work with.

A simple way to define a Tableau dashboard is as a glance view of a company’s key performance indicators, or KPIs. There are different kinds of dashboards available — it all depends on the business questions being asked and the end user. Is this for an operational team (like one at a distribution center) that needs to see the amount of orders by hour and if sales goals are achieved? Or is this for a CEO who would like to measure the productivity of different departments and products against forecast? The first case will require the data to be updated every 10 minutes, almost in real time. The second doesn’t require the same cadence, and once a day will be enough to track the company performance.

Over the past few years, I’ve built many dashboards for different types of users, including department heads, business analysts, and directors, and helped many mid-level managers with data analysis. Here are some best practices for creating Tableau dashboards I’ve learned throughout my career.

First Things First: Why Use a Data Visualization?

Visualizations are among the most effective ways to analyze data from any business process (sales, returns, purchase orders, warehouse operation, customer shopping behavior, etc.).

Below we have a grid report and bar chart that contain the same information. Which is easier to interpret?

Grid report

Bar Chart
Grid report vs. bar chart.

That’s right — it’s quicker to identify the category with the lowest sales, Tops, using the chart.

Many companies used to use grid reports to operate and make decisions, and many departments still do today, especially in retail. I once went to a trading meeting on a Monday morning where team members printed pages of Excel reports with rows and rows of sales and stock data by product and took them to a meeting room with a ruler and a highlighter to analyze sales trends. Some of these reports took at least two hours to prepare and required combining data from different data sources with VLOOKUPs — a function that allows users to search through columns in Excel. After the meeting, they threw the papers away (what a waste of paper and ink!) and then the following Monday it all started again.

Wouldn’t it be better to have a reporting tool in which the company’s KPIs were updated on a daily basis and presented in an interactive dashboard that could be viewed on tablets/laptops and digitally sliced and diced? That’s where tools like Tableau dashboards come in. You can drill down into details and answer questions raised in the meeting in real time — something you couldn’t do with paper copies.

How to Design a Dashboard in Tableau

Step 1: Identify who will use the dashboard and with what frequency.

Tableau dashboards can be used for many different purposes and therefore will be designed differently for each circumstance. This means that, before you can begin designing a dashboard, you need to know who is going to use it and how often.

Step 2: Define your topic.

The stakeholder (i.e., director, sales manager, CEO, business analyst, buyer) should be able to tell you what kind of business questions need to be answered and the decisions that will be made based on the dashboard.

Here, I am going to use data from a fictional retail company to report on monthly sales.

The commercial director would like to know 1) the countries to which the company’s products have been shipped, 2) which categories are performing well, and 3) sales by product. The option of browsing products is a plus, so the dashboard should include as much detail as possible.

Step 3: Initially, make sure you have all of the necessary data available to answer the questions specified.

Clarify how often you will get the data, the format in which you will receive the data (inside a database or in loose files), the cleanliness of the data, and if there are any data quality issues. You need to evaluate all of this before you promise a delivery date.

Step 4: Create your dashboard.

When it comes to dashboard design, it’s best practice to present data from top to bottom. The story should go from left to right, like a comic book, where you start at the top left and finish at the bottom right.

Let’s start by adding the data set to Tableau. For this demo, the data is contained in an Excel file generated by a software I developed myself. It’s all dummy data.

To connect to an Excel file from Tableau, select “Excel” from the Connect menu. The tables are on separate Excel sheets, so we’re going to use Tableau to join them, as shown in the image below. Once the tables are joined, go to the bottom and select Sheet 1 to create your first visualization.

Excel Sheet in Tableau
Joining Excel sheet in Tableau.

We have two columns in the Order Details table: Quantity and Unit Price. The sales amount is Quantity x Unit Price, so we’re going to create the new metric, “Sales Amount”. Right-click on the measures and select Create > Calculated Field.

Creating a Map in Tableau

We can use maps to visualize data with a geographical component and compare values across geographical regions. To answer our first question — “Which countries the company’s products have been shipped to?” — we’ll create a map view of sales by country.

1. Add Ship Country to the rows and Sales Amount to the columns.

2. Change the view to a map.

Map
Visualizing data across geographical regions.

3. Add Sales Amount to the color pane. Darker colors mean higher sales amounts aggregated by country.

4. You can choose to make the size of the bubbles proportional to the Sales Amount. To do this, drag the Sales Amount measure to the Size area.

5. Finally, rename the sheet “Sales by Country”.

Creating a Bar Chart in Tableau

Now, let’s visualize the second request, “Which categories are performing well?” We’ll need to create a second sheet. The best way to analyze this data is with bar charts, as they are to compare data across categories. Pie charts work in a similar way, but in this case we have too many categories (more than four) so they wouldn’t be effective.

1. To create a bar chart, add Category Name to the rows and Sales Amount to the columns.

2. Change the visualization to a bar chart.

3. Switch columns and rows, sort it by descending order, and show the values so users can see the exact value that the size of the rectangle represents.

4. Drag the category name to “Color”.

5. Now, rename the sheet to “Sales by Category”.

Sales category bar chart
Our Sales by Category breakdown.

Assembling a Dashboard in Tableau

Finally, the commercial director would like to see the details of the products sold by each category.

Our last page will be the product detail page. Add Product Name and Image to the rows and Sales Amount to the columns. Rename the sheet as “Products”.

We are now ready to create our first dashboard! Rearrange the chart on the dashboard so that it appears similar to the example below. To display the images, drag the Web Page object next to the Products grid.

Dashboard Assembly
Assembling our dashboard.

Additional Actions in Tableau

Now, we’re going to add some actions on the dashboard such that, when we click on a country, we’ll see both the categories of products and a list of individual products sold.

1. Go to Dashboard > Actions.

2. Add Action > Filter.

3. Our “Sales by Country” chart is going to filter Sales by Category and Products.

4. Add a second action. Sales by Category will filter Products.

5. Add a third action, this time selecting URL.

6. Select Products, <Image> on URL, and click on the Test Link to test the image’s URL.

What we have now is an interactive dashboard with a worldwide sales view. To analyze a specific country, we click on the corresponding bubble on the map and Sales by Category will be filtered to what was sold in that country.

When we select a category, we can see the list of products sold for that category. And, when we hover on a product, we can see an image of it.

In just a few steps, we have created a simple dashboard from which any head of department would benefit.

Dashboard
The final product.

Dashboards in Tableau at General Assembly

In GA’s Data Analytics course, students get hands-on training with the versatile Tableau platform. Create dashboards to solve real-world problems in 1-week, accelerated or 10-week, part-time course formats — on campus and online. You can also get a taste in our interactive classes and workshops.

Ask a Question About Our Data Programs

Meet Our Expert

Samanta Dal Pont is a business intelligence and data analytics expert in retail, eCommerce, and online media. With an educational background in software engineer and statistics, her great passion is transforming businesses to make the most of their data. Responsible for the analytics, reporting, and visualization in a global organization, Samanta has been an instructor for Data Analytics courses and SQL bootcamps at General Assembly London since 2016.

Samanta Dal Pont, Data Analytics Instructor, General Assembly London

Excel: Building the Foundation for Understanding Data Analytics

By

If learning data analytics is like trying to ride a bike, then learning Excel is like having a good set of training wheels. Although some people may want to jump right ahead without them, they’ll end up with fewer bruises and a smoother journey if they begin practicing with them on. Indeed, Excel provides an excellent foundation for understanding data analytics.

What exactly is data analytics? It’s more than just simply “crunching numbers,” for one. Data analytics is the art of analyzing and communicating insights from data in order to influence decision-making.

In the age of increasingly sophisticated analytical tools like Python and R, some seasoned analytics professionals may scoff at Excel, which was first released by Microsoft in 1987, as nothing more than petty spreadsheet software. Unfortunately, most people only touch the tip of the iceberg when it comes to fully leveraging this ubiquitous program’s power as a stepping stone into analytics.

Using Excel for Data Analysis: Management, Cleaning, Aggregation, and More

I refer to Excel as the gateway into analytics. Once you’ve learned the platform inside and out, throughout your data analytics journey you’ll continually say to yourself, “I used to do this in Excel. How do I do it in X or Y?” In today’s digital age, it may seem like there are new analytical tools and software packages coming out every day. As a result, many roles in data analytics today require an understanding of how to leverage and continuously learn multiple tools and packages across various platforms. Thankfully, learning Excel and its fundamentals will provide a strong bedrock of knowledge that you’ll find yourself frequently referring back to when learning newer, more sophisticated programs.

Excel is a robust tool that provides foundational knowledge for performing tasks such as:

  • Database management. Understanding the architecture of any data set is one of first steps of the data analytics workflow. In Excel, each worksheet can be thought of as a table in a database. Each row in a worksheet can then be considered a record while each column can be considered an attribute. As you continue to work with multiple worksheets and tables in Excel, you’ll learn that functions such as “VLOOKUP” and “INDEXMATCH” are similar to the “JOIN” clauses seen in SQL.
  • Data cleaning. Cleaning data is often one of the most crucial and time-intensive components of the data analytics workflow. Excel can be used to clean a data set using various string functions such as “TRIM”, “MID”, or “SUBSTITUTE”. Many of these functions cut across various programs and will look familiar when you learn similar functions in SQL and Tableau.
  • Data aggregation. Once the data’s been cleaned, you’ll need to summarize and compile it. Excel’s aggregation functions such as “COUNT”, “SUM”, “MIN”, or “MAX” can be used to summarize the data. Furthermore, Excel’s Pivot Tables can be leveraged to aggregate and filter data quickly and efficiently. As you continue to manipulate and aggregate data, you’ll begin to understand the underlying SQL queries behind each Pivot Table.
  • Statistics. Descriptive statistics and inferential statistics can be applied through Excel’s functions and add-ons to better understand our data. Descriptive statistics such as the “AVERAGE”, “MEDIAN”, or “STDEV” functions tell us about the central tendency and variability of our data. Additionally, inferential statistics such as correlation and regression can help to identify meaningful patterns in the data which can be further analyzed to make predictions and forecasts.
  • Dashboarding and visualization. One of the final steps of the data analytics workflow involves telling a story with your data. The combination of Excel’s Pivot Tables, Pivot Charts, and slicers offer the underlying tools and flexibility to construct dynamic dashboards with visualizations to convey your story to your audience. As you build dashboards in Excel, you’ll begin to uncover how the Pivot Table fields in Excel are the common denominator in almost any visualization software and are no different than the “Shelfs” used in Tableau to create visualizations.

If you want to jump into Excel but don’t have a data set to work with, why not analyze your own personal data? You could leverage Excel to keep track of your monthly budget and create a dashboard to see what your spending trends look like over time. Or if you have a fitness tracker, you could export the data from the device and create a dashboard to show your progress over time and identify any trends or areas for improvement. The best way to jump into Excel is to use data that’s personal and relevant — so your own health or finances can be a great start.

Excel at General Assembly

In GA’s part-time Data Analytics course and online Data Analysis course, Excel is the starting point for leveraging other analytical tools such as SQL and Tableau. Throughout the course, you’ll continually have “data déjà vu” as you tell yourself, “Oh this looks familiar.” Students will understand why Excel is considered a jack-of-all-trades by providing a great foundation in database management, statistics, and dashboard creation. However, as the saying goes, “A jack-of-all-trades is a master of none.” As such, students will also recognize the limitations of Excel and the point at which tools like SQL and Tableau offer greater functionality.

At GA, we use Excel to clean and analyze data from sources like the U.S. Census and Airbnb to formulate data-driven business decisions. During final capstone projects, students are encouraged to use data from their own line of work to leverage the skills they’ve learned. We partner with students to ensure that they are able to connect the dots along the way and “excel” in their data analytics journey.

Having a foundation in Excel will also benefit students in GA’s full-time Data Science Immersive program as they learn to leverage Python, machine learning, visualizations, and beyond, and those in our part-time Data Science course, who learn skills like statistics, data modeling, and natural language processing. GA also offers day-long Excel bootcamps across our campuses, during which students learn how to simplify complex tasks including math functions, data organization, formatting, and more.

Ask a Question About Our Data Programs

Meet Our Expert

Mathu A. Kumarasamy is a self-proclaimed analytics evangelist and aspiring data scientist. A believer in the saying that “data is the new oil,” Mathu leverages analytics to find, extract, refine, and distribute data in order to help clients make confident, evidence-based decisions. He is especially passionate about leveraging data analytics, technology, and insights from the field of behavioral economics to help establish a culture of evidence-based, value-driven health care in the United States. Mathu enjoys converting others into analytics geeks while teaching General Assembly’s part-time Data Analytics course in Atlanta.

Mathu A. Kumarasamy, Data Analytics Instructor, GA Atlanta

SQL: Using Data Science to Boost Business and Increase Efficiency

By

In today’s digital age, we’re constantly bombarded with information about new apps, transformative technologies, and the latest and greatest artificial intelligence system. While these technologies may serve very different purposes in our life, all of them share one thing in common: They rely on data. More specifically, they all use databases to capture, store, retrieve, and aggregate data. This begs the question: How do we actually interact with databases to accomplish all of this? The answer: We use Structured Query Language, or SQL (pronounced “sequel” or “ess-que-el”).

Put simply, SQL is the language of data — it’s a programming language that enables us to efficiently create, alter, request, and aggregate data from those mysterious things called databases. It gives us the ability to make connections between different pieces of information, even when we’re dealing with huge data sets. Modern applications are able to use SQL to deliver really valuable pieces of information that would otherwise be difficult for humans to keep track of independently. In fact, pretty much every app that stores any sort of information uses a database. This ubiquity means that developers use SQL to log, record, alter, and present data within the application, while analysts use SQL to interrogate that same data set in order to find deeper insights.

Finding SQL in Everyday Life

Think about the last time you looked up the name of a movie on IMDB. I’ll bet you quickly noticed an actress on the cast list and thought something like, “I didn’t realize she was in that,” then clicked a link to read her bio. As you were navigating through that app, SQL was responsible for returning the information you “requested” each time you clicked a link. This sort of capability is something we’ve come to take for granted these days.

Let’s look at another example that truly is cutting-edge, this time at the intersection of local government and small business. Many metropolitan cities are supporting open data initiatives in which public data is made easily accessible through access to the databases that store this information. As an example, let’s look at Los Angeles building permit data, business listings, and census data.

Imagine you work at a real estate investment firm and are trying to find the next up-and-coming neighborhood. You could use SQL to combine the permit, business, and census data in order to identify areas that are undergoing a lot of construction, have high populations, and contain a relatively low number of businesses. This might be a great opportunity to purchase property in a soon-to-be thriving neighborhood! For the first time in history, it’s easy for a small business to leverage quantitative data from the government in order to make a highly informed business decision.

Leveraging SQL to Boost Your Business and Career

There are many ways to harness SQL’s power to supercharge your business and career, in marketing and sales roles, and beyond. Here are just a few:

  • Increase sales: A sales manager could use SQL to compare the performance of various lead-generation programs and double down on those that are working.
  • Track ads: A marketing manager responsible for understanding the efficacy of an ad campaign could use SQL to compare the increase in sales before and after running the ad.
  • Streamline processes: A business manager could use SQL to compare the resources used by various departments in order to determine which are operating efficiently.

SQL at General Assembly

At General Assembly, we know businesses are striving to transform their data from raw facts into actionable insights. The primary goal of our data analytics curriculum, from workshops to full-time courses, is to empower people to access this data in order to answer their own business questions in ways that were never possible before.

To accomplish this, we give students the opportunity to use SQL to explore real-world data such as Firefox usage statistics, Iowa liquor sales, or Zillow’s real estate prices. Our full-time Data Science Immersive and part-time Data Analytics courses help students build the analytical skills needed to turn the results of those queries into clear and effective business recommendations. On a more introductory level, after just a couple of hours of in one of our SQL workshops, students are able to query multiple data sets with millions of rows.

Ask a Question About Our Data Programs

Meet Our Expert

Michael Larner is a passionate leader in the analytics space who specializes in using techniques like predictive modeling and machine learning to deliver data-driven impact. A Los Angeles native, he has spent the last decade consulting with hundreds of clients, including 50-plus Fortune 500 companies, to answer some of their most challenging business questions. Additionally, Michael empowers others to become successful analysts by leading trainings and workshops for corporate clients and universities, including General Assembly’s part-time Data Analytics course and SQL/Excel workshops in Los Angeles.

“In today’s fast-paced, technology-driven world, data has never been more accessible. That makes it the perfect time — and incredibly important — to be a great data analyst.”

– Michael Larner, Data Analytics Instructor, General Assembly Los Angeles

Using Apache Spark For High Speed, Large Scale Data Processing

By

Apache Spark is an open-source framework used for large-scale data processing. The framework is made up of many components, including four programming APIs and four major libraries. Since Spark’s release in 2014, it has become one of Apache’s fastest growing and most widely used projects of all time.

Spark uses an in-memory processing paradigm to speed up computation and run programs 10 to 100 times faster than other big data technologies like Hadoop MapReduce. According to the 2016 Apache Spark Survey, more than 900 companies, including IBM, Google, Netflix, Amazon, Microsoft, Intel, and Yahoo, use Spark in production for data processing and querying.

Apache Spark is important to the big data field because it represents the next generation of big data processing engines and is a natural successor to MapReduce. One of Spark’s advantages is that its use of four programming APIs — Scala, Python, R, and Java 8 — allows the user flexibility to work in the language of their choice. This makes the tool much more accessible to a wide range of programmers with different capabilities. Spark also has great flexibility in its ability to read all types of data from various locations such as Hadoop Distributed File Storage (HDFS), Amazon’s web-based Simple Storage Service (S3), or even the local filesystem.

Production-Ready and Scalable

Spark’s greatest advantage is that it maximizes the capabilities of data science’s most expensive resource: the data scientist. Computers and programs have become so fast, that we are no longer limited by what they can do as much as we are limited by human productivity. By providing a flexible language platform and having concise syntax, the data scientist can write more programs, iterate through their programs, and have them run much quicker. The code is production-ready and scalable, so there’s no need to hand off code requirements to a development team for changes.

It takes only a few minutes to write a word-count program in Spark, but would take much longer to write the same program in Java. Because the Spark code is so much shorter, there’s less of a need to debug or use version control tools.

Spark’s concise syntax can best be illustrated with the following examples. The Spark code is only four lines compared with almost 58 for Java.

Java vs. Spark

Faster Processing

Spark utilizes in-memory processing to speed up applications. The older big data frameworks, such as Hadoop, use many intermediate disc reads and writes to accomplish the same task. For small jobs on several gigabytes of data, this difference is not as pronounced, but for machine learning applications and more complex tasks such as natural language processing, the difference can be tremendous. Logistic regression, a technique taught in all of General Assembly’s full- and part-time data science courses, can be sped up over 100x.

Spark has four key libraries that also make it much more accessible and provide a wider set of tools for people to use. Spark SQL is ideal for leveraging SQL skills or work with data frames; Spark Streaming has functions for data processing, useful if you need to process data in near real time; and GraphX has pre-written algorithms that are useful if you have graph data or need to do graph processing. The library most useful to students in our Data Science Immersive, though, is the Spark MLlib machine learning library, which has prewritten distributed machine learning algorithms for use on data frames.

Spark at General Assembly

At GA, we teach both the concepts and the tools of data science. Because hiring managers from marketing, technology, and biotech companies, as well as guest speakers like company founders and entrepreneurs, regularly talk about using Spark, we’ve incorporated it into the curriculum to ensure students are fluent in the field’s most relevant skills. I teach Spark as part of our Data Science Immersive (DSI) course in Boston, and I previously taught two Spark courses for Cloudera and IBM. Spark is a great tool to teach because the general curriculum focuses mostly on Python, and Spark has a Python API/library called PySpark.

When we teach Spark in DSI, we cover resilient distributed data sets, directed acyclic graphs, closures, lazy execution, and reading JavaScript Object Notation (JSON), a common big data file format.

Ask a Question About Our Data Programs

Meet Our Expert

Joseph Kambourakis has over 10 years of teaching experience and over five years of experience teaching data science and analytics. He has taught in more than a dozen countries and has been featured in Japanese and Saudi Arabian press. He holds a bachelor’s degree in electrical and computer engineering from Worcester Polytechnic Institute and an MBA with a focus in analytics from Bentley University. He is a passionate Arsenal FC supporter and competitive Magic: The Gathering player. He currently lives with his wife and daughter in Needham, Massachusetts.

“GA students come to class motivated to learn. Throughout the Data Science Immersive course, I keep them on their path by being patient and setting up ideas in a simple way, then letting them learn from hands-on lab work.”

Joseph Kambourakis, Data Science Instructor, General Assembly Boston

Machine Learning for Data-Driven Predictions and Problem Solving

By

Ever wonder how apps, websites, and machines seem to be able to predict the future? Like how Amazon knows what your next purchase may be, or how self-driving cars can safely navigate a complex road situation?

The answer lies in machine learning.

Machine learning is a branch of artificial intelligence (AI) that concentrates on building systems that can learn from and make decisions based on data. Instead of explicitly programming the machine to solve the problem, we show it how it was solved in the past and the machine learns the key steps that are required to do the same task on its own from the examples.

Think about how Netflix makes movie recommendations. The recommendation engine peeks at the movies you’ve viewed/rated in the past. It then starts to learn the factors that influence your movie preferences and stores them in a database. It could be as simple as noting that you prefer to watch “comedy movies released after 2005 featuring Adam Sandler.” It then starts recommending similar movies that you haven’t watched — all without writing any explicit rules!

This is the power of machine learning.

Machine learning is revolutionizing every industry by bringing greater value to companies’ years of saved data. Leveraging machine learning enables organizations to make more precise decisions instead of following intuition. Companies have begun to embrace the power of machine learning and revise their strategies in order to remain more competitive.

Data Scientists: The Forces Behind Machine Learning

Machine learning is typically practiced by data scientists, who help organizations discover hidden value from their data — thereby enabling them to make smarter business decisions. For instance, insurers use machine learning to make accurate predictions on fraudulent claims, rather than relying on traditional analysis or human judgement. This has a significant impact that can result in lower costs and higher revenue for businesses. Data scientists work with various stakeholders in a company, like business users or product owners, to discover problems and gather data that will be used to solve them.

Data scientists collect, process, clean up, and verify the integrity of data. They apply their engineering, modeling, and statistical skills to build end-to-end machine learning systems. They constantly monitor the performance of those systems and make improvements wherever possible. Often, they need to communicate to non-technical audiences — including stakeholders across the company — in a compelling way to highlight the business impact and opportunity. At the end of the day, those stakeholders have to act on and possibly make far-reaching decisions based on the data scientist’s’ findings.

Above all, data scientists need to be creative and avid problem-solvers. Possessing this combination of skills makes them a rare breed — so it’s no wonder they’re highly sought after by companies across many industries, such as health care, retail, manufacturing, and technology.

Supervised Learning

Machine learning algorithms fall into two categories, supervised and unsupervised learning. Supervised learning tries to predict a future value by relying on training from past data. For instance, Netflix’s movie-recommendation engine is most likely supervised. It uses a user’s past movie ratings as training data to the model and then predicts your rating for unseen movies. Supervised learning enjoys more commercial success than unsupervised learning. Some of the popular use cases include fraud detection, image recognition, credit scoring, product recommendation, and malfunction prediction.

Unsupervised Learning

Unsupervised learning is not about prediction but rather about uncovering hidden structures from the data. It’s helpful in identifying segments or groups, especially when there is no prior information available about those segments. These algorithms are commonly used in market segmentation. They enable marketers to identify target segments in order to maximize revenue, create anomaly detection systems to identify suspicious user behavior, and more.

For instance, Netflix may know how many customers it has, but wants to understand what kind of groupings they fall into in order to offer services targeted to them. The streaming service may have 50 or more different customer types, aka segments, but its data scientists don’t know yet.

If the company knows that most of its customers are in the “families with children” segment, it can invest in building specific programs to meet customer needs. But without that information, Netflix’s data scientists can’t build a supervised machine learning system. So, they build an unsupervised machine learning algorithm instead, which identifies and extracts various customer segments within the data and allows them to identify groups such as “families with children” or “working professionals.”

Machine Learning at General Assembly

At General Assembly, our Data Science Immersive program trains students in machine learning, programming, data visualization, and other skills needed to become a job-ready data scientist. Students learn the hands-on languages and techniques, like SQLPython, and UNIX, that are needed to gather and organize data, build predictive models, create data visualizations, and tackle real-world projects. In class, students work on data science labs, compete on the data science platform Kaggle, and complete a capstone project to showcase their data science skills. They also gain access to career coaching, job-readiness training, and networking opportunities.

If you’re looking to learn during evenings and weekends, you can explore our part-time Data Science course, or visit one of GA’s worldwide campuses for a short-form event or workshop led by local professionals in the field.

Ask a Question About Our Data Programs

Meet Our Expert

Kirubakumaresh Rajendran is an experienced data scientist who’s passionate about applying machine learning and statistical modeling techniques to the domain of business problems. He has worked with IBM and Morgan Stanley to build data-driven products that leverage machine learning techniques. He is a co-instructor for the Data Science Immersive course at GA’s Sydney campus, and enjoys teaching, mentoring, and guiding aspiring data scientists.

“Machines are helping humans build self-driving cars, cancer detection, and more, making it the right time to roll up your sleeves, get into the world of machine learning, and teach machines to make the world a better place.”

– Kirubakumaresh Rajendran, Data Science Immersive Instructor, GA Sydney

Python: The Programming Language Everyone Needs to Learn

By

What’s one thing that Bill Gates, Mark Zuckerberg, Sheryl Sandberg, will.i.am, Chris Bosh, Karlie Kloss, and I, a data science instructor at General Assembly, all have in common? We all think you should learn how to code.

There are countless reasons to learn how to code, even if you don’t want to become a full-time programmer:

  • Programming teaches you amazing problem-solving skills.
  • You’ll be better able to collaborate with engineers and developers if you can “speak their language.”
  • It enables you to help build the technologies of the future, including web applications, machine learning models, chatbots, and anything else you can imagine.

To most people, learning to program — or even choosing what language to learn — seems daunting. I’ll make it simple: Python is an excellent place to start.

Python is an immensely popular programming language commonly used by data analystsdata scientists, and software engineers. In addition to being one of the most popular — it’s used by companies like Google, SpaceX, and Instagram to do a huge variety different things including data cleaning, build AI models, building web apps, and more — Python stands out for being very simple to read and write, while offering extreme flexibility and having an active community.

Here’s a cool example of just how simple Python is: Here is code that tells the computer to print the words “Hello World”:

In Python:

print ("Hello World")

Yup, that’s really all it takes! For context, let’s compare that to another popular programming language, Java, which has a steeper learning curve (though is still a highly desirable skill set in the job market).

public class HelloWorld {   public static void main(String[] args) {      System.out.println("Hello, World");   } }

Clearly, Python requires much less code.

Experiencing Python in Everyday Life

Let’s talk about some of the ways in which Python is used today, including automating a process, building the functionality of an application, or delving into machine learning.

Here are some fascinating examples of how Python is shaping the world we live in:

  • Hollywood special effects: Remember that summer blockbuster with the huge explosions? A lot of companies, including Lucasfilm’s Industrial Light & Magic (ILM), use Python to help program those awesome special effects. By using Python, companies like ILM have been able to develop standard toolkits that they can reuse across productions, while still retaining the flexibility to build custom effects in less time than ever before.
  • File-sharing applications: When Dropbox was created in 2007, it used Python to build the desktop applications and server infrastructure responsible for actually sharing the files. After more than a decade, Python is still powering the company’s desktop applications. In other words, Dropbox was able to write a single application for both Macs and PCs that still works after more than a decade!
  • Web applications: Python is used to run various parts of some of today’s most popular websites, including Pinterest, Instagram, Spotify, and YouTube. In fact, Pinterest has used Python in some form since it was founded (e.g., to power its web app, build and maintain data pipelines, and perform analyses).
  • Artificial intelligence: Python is especially popular in the artificial intelligence community, again for its ease of use and flexibility. For example, in just a few hours, a business could build a basic chatbot that answers some of the most common questions from its customers. To do this, programmers could use Python to scrape the contents of all of the email exchanges with the company’s customers, identify common themes in these exchanges with visualizations, and then build a predictive model that can be used by the chatbot application to give appropriate responses.

Python at General Assembly

General Assembly focuses on building practical experience when learning new technical skills. We want students to walk away from our data science courses and bootcamps equipped to tackle the challenges they’re facing in their own lives and careers.

Python at General Assembly section, change the second graf to:

Many of our courses are designed to teach folks with limited exposure to Python to use it to answer real business questions. Dive into fundamental concepts and techniques, and build your own custom web or data application in our part-time Python Programming course. Or learn to leverage the language as part of our full-time Data Science Immersive program, part-time Data Science course, or a one-day Python bootcamp. Projects students have tackled include visualizing SAT scores from across the country, scraping data from public websites, identifying causes of airplane delays, and predicting Netflix ratings based on viewer sentiment and information from IMDB.

Ask a Question About Our Coding Programs

Meet Our Expert

Michael Larner is a passionate leader in the analytics space who specializes in using techniques like predictive modeling and machine learning to deliver data-driven impact. A Los Angeles native, he has spent the last decade consulting with hundreds of clients, including 50-plus Fortune 500 companies, to answer some of their most challenging business questions. Additionally, Michael empowers others to become successful analysts by leading trainings and workshops for corporate clients and universities, including General Assembly’s part-time Data Analytics course and SQL/Excel workshops in Los Angeles.

“GA provides an amazing community of colleagues, peers, and fellow learners that serve as a wonderful resource as you continue to build your career. GA exposes students to real-world analyses to gain practical experience.”

Michael Larner, Data Analytics Instructor, General Assembly Los Angeles