data Tag Archives - General Assembly Blog

Harnessing the Power of Data for Disaster Relief

By

2455_header

Data is the engine driving today’s digital world. From major companies to government agencies to nonprofits, business leaders are hunting for talent that can help them collect, sort, and analyze vast amounts of data — including geodata — to tackle the world’s biggest challenges.

In the case of emergency management, disaster preparedness, response, and recovery, this means using data to expertly identify, manage, and mitigate the risks of destructive hurricanes, intense droughts, raging wildfires, and other severe weather and climate events. And the pressure to make smarter data-driven investments in disaster response planning and education isn’t going away anytime soon — since 1980, the U.S. has suffered 246 weather and climate disasters that topped over $1 billion in losses according to the National Centers for Environmental Information.

Employing creative approaches for tackling these pressing issues is a big reason why New Light Technologies (NLT), a leading company in the geospatial data science space, joined forces with General Assembly’s (GA) Data Science Immersive (DSI) course, a hands-on intensive program that fosters job-ready data scientists. Global Lead Data Science Instructor at GA, Matt Brems, and Chief Scientist and Senior Consultant at NLT, Ran Goldblatt, recognized a unique opportunity to test drive collaboration between DSI students and NLT’s consulting work for the Federal Emergency Management Agency (FEMA) and the World Bank.

The goal for DSI students: build data solutions that address real-world emergency preparedness and disaster response problems using leading data science tools and programming languages that drive visual, statistical, and data analyses. The partnership has so far produced three successful cohorts with nearly 60 groups of students across campuses in Atlanta, Austin, Boston, Chicago, Denver, New York City, San Francisco, Los Angeles, Seattle, and Washington, D.C., who learn and work together through GA’s Connected Classroom experience.

Taking on Big Problems With Smart Data

nlt-ga-2

DSI students present at NLT’s Washington, D.C. office.

“GA is a pioneering institution for data science, so many of its goals coincide with ours. It’s what also made this partnership a unique fit. When real-world problems are brought to an educational setting with students who are energized and eager to solve concrete problems, smart ideas emerge,” says Goldblatt.

Over the past decade, NLT has supported the ongoing operation, management, and modernization of information systems infrastructure for FEMA, providing the agency with support for disaster response planning and decision-making. The World Bank, another NLT client, faces similar obstacles in its efforts to provide funding for emergency prevention and preparedness.

These large-scale issues served as the basis for the problem statements NLT presented to DSI students, who were challenged to use their newfound skills — from developing data algorithms and analytical workflows to employing visualization and reporting tools — to deliver meaningful, real-time insights that FEMA, the World Bank, and similar organizations could deploy to help communities impacted by disasters. Working in groups, students dived into problems that focused on a wide range of scenarios, including:

  • Using tools such as Google Street View to retrieve pre-disaster photos of structures, allowing emergency responders to easily compare pre- and post-disaster aerial views of damaged properties.
  • Optimizing evacuation routes for search and rescue missions using real-time traffic information.
  • Creating damage estimates by pulling property values from real estate websites like Zillow.
  • Extracting drone data to estimate the quality of building rooftops in Saint Lucia.

“It’s clear these students are really dedicated and eager to leverage what they learned to create solutions that can help people. With DSI, they don’t just walk away with an academic paper or fancy presentation. They’re able to demonstrate they’ve developed an application that, with additional development, could possibly become operational,” says Goldblatt.

Students who participated in the engagements received the opportunity to present their work — using their knowledge in artificial intelligence and machine learning to solve important, tangible problems — to an audience that included high-ranking officials from FEMA, the World Bank, and the United States Agency for International Development (USAID). The students’ projects, which are open source, are also publicly available to organizations looking to adapt, scale, and implement these applications for geospatial and disaster response operations.

“In the span of nine weeks, our students grew from learning basic Python to being able to address specific problems in the realm of emergency preparedness and disaster response,” says Brems. “Their ability to apply what they learned so quickly speaks to how well-qualified GA students and graduates are.”

Here’s a closer look at some of those projects, the lessons learned, and students’ reflections on how GA’s collaboration with NLT impacted their DSI experience.

Leveraging Social Media to Map Disasters

2455_sec1_socialmediamap_560x344

The NLT engagements feature student work that uses social media to identify “hot spots” for disaster relief.

During disasters, one of the biggest challenges for disaster relief organizations is not only mapping and alerting users about the severity of disasters but also pinpointing hot spots where people require assistance. While responders employ satellite and aerial imagery, ground surveys, and other hazard data to assess and identify affected areas, communities on the ground often turn to social media platforms to broadcast distress calls and share status updates.

Cameron Bronstein, a former botany and ecology major from New York, worked with group members to build a model that analyzes and classifies social media posts to determine where people need assistance during and after natural disasters. The group collected tweets related to Hurricane Harvey of 2017 and Hurricane Michael of 2018, which inflicted billions of dollars of damage in the Caribbean and Southern U.S., as test cases for their proof-of-concept model.

“Since our group lacked premium access to social media APIs, we sourced previously collected and labeled text-based data,” says Bronstein. “This involved analyzing and classifying several years of text language — including data sets that contained tweets, and transcribed phone calls and voice messages from disaster relief organizations.”

Contemplating on what he enjoyed most while working on the NLT engagement, Bronstein states, “Though this project was ambitious and open to interpretation, overall, it was a good experience and introduction to the type of consulting work I could end up doing in the future.”

Quantifying the Economic Impact of Natural Disasters

2455_sec2_economicimpact_560x344

Students use interactive data visualization tools to compile and display their findings.

Prior to enrolling in General Assembly’s DSI course in Washington D.C., Ashley White learned early in her career as a management consultant how to use data to analyze and assess difficult client problems. “What was central to all of my experiences was utilizing the power of data to make informed strategic decisions,” states White.

It was White’s interest in using data for social impact that led her to enroll in DSI where she could be exposed to real-world applications of data science principles and best practices. Her DSI group’s task: developing a model for quantifying the economic impact of natural disasters on the labor market. The group selected Houston, Texas as its test case for defining and identifying reliable data sources to measure the economic impact of natural disasters such as Hurricane Harvey.

As they tackled their problem statement, the group focused on NLT’s intended goal, while effectively breaking their workflow into smaller, more manageable pieces. “As we worked through the data, we discovered it was hard to identify meaningful long-term trends. As scholarly research shows, most cities are pretty resilient post-disaster, and the labor market bounces back quickly as the city recovers,” says White.

The team compiled their results using the analytics and data visualization tool Tableau, incorporating compelling visuals and story taglines into a streamlined, dynamic interface. For version control, White and her group used GitHub to manage and store their findings, and share recommendations on how NLT could use the group’s methodology to scale their analysis for other geographic locations. In addition to the group’s key findings on employment fluctuations post-disaster, the team concluded that while natural disasters are growing in severity, aggregate trends around unemployment and similar data are becoming less predictable.

Cultivating Data Science Talent in Future Engagements

Due to the success of the partnership’s three engagements, GA and NLT have taken steps to formalize future iterations of their collaboration with each new DSI cohort. Additionally, mutually beneficial partnerships with leading organizations such as NLT present a unique opportunity to uncover innovative approaches for managing and understanding the numerous ways data science can support technological systems and platforms. It’s also granted aspiring data scientists real-world experience and visibility with key decision-makers who are at the forefront of emergency and disaster management.

“This is only the beginning of a more comprehensive collaboration with General Assembly,” states Goldblatt. “By leveraging GA’s innovative data science curriculum and developing training programs for capacity building that can be adopted by NLT clients, we hope to provide students with essential skills that prepare them for the emerging, yet competitive, geospatial data job market. Moreover, students get the opportunity to better understand how theory, data, and algorithms translate to actual tools, as well as create solutions that can potentially save lives.”

***

New Light Technologies, Inc. (NLT) provides comprehensive information technology solutions for clients in government, commercial, and non-profit sectors. NLT specializes in DevOps enterprise-scale systems integration, development, management, and staffing and offers a unique range of capabilities from Infrastructure Modernization and Cloud Computing to Big Data Analytics, Geospatial Information Systems, and the Development of Software and Web-based Visualization Platforms.

In today’s rapidly evolving technological world, successfully developing and deploying digital geospatial software technologies and integrating disparate data across large complex enterprises with diverse user requirements is a challenge. Our innovative solutions for real-time integrated analytics lead the way in developing highly scalable virtualized geospatial microservices solutions. Visit our website to find out more and contact us at https://NewLightTechnologies.com.

Using D3 Visualization for Data-Driven Interactive Motion Graphics

By

In the midst of the 2012 U.S. presidential election, The New York Times published a series of online articles that contained beautiful, interactive, data-driven graphics to illustrate changes in voter behavior over time and the candidates’ paths to winning the election. Created using a JavaScript library called D3 (for Data-Driven Documents), these data visualizations caused a lot of excitement among developers.

Until that point, these kinds of fastinteractive motion graphics based on large data sets hadn’t really been seen in websites by the general public. Developers recognized that with D3, they could easily create beautiful, data-driven graphics, in this case in collaboration with data scientists, journalists, and graphic designers.

But D3 was more than just another JavaScript library. It was the final link in a series of technological advancements that led to these kinds of graphic possibilities:

1. Large amounts of data became ubiquitous, and the speed at which it is transmitted and processed increased exponentially.

Since the mid 2000s, there has been a surge in the amount of data available to developers, the speed at which it can be processed, and the ways in which we conceptualize how to display it. At the heart of this explosion are advances in storage technology that have made it incredibly easy, cheap, and fast to store massive amounts of information. For example, in 2007 a 1GB hard drive cost about $60. Ten years later, the same amount can get you 1TB of storage. That’s 1,000 times more data for the same cost. Computer and internet connection speeds have dramatically increased as well.

As a result, the fields of artificial intelligence, big data, and data science have been able to mature in ways that were previously not possible. All of this means that the manner in which we think about, analyze, and visually present large data sets has drastically changed in a relatively short amount of time. In addition, people on their home computers now have the ability to quickly download large data sets — processors are 60 times faster than 10 years ago — and perform calculations to display them in interesting ways.

2. All major web browsers began to support Scalable Vector Graphics (SVG), a technology that draws images from mathematical equations.

Scalable Vector Graphics are the second piece of the puzzle because they allow for the creation of images from mathematical equations and code. A normal image, like a .jpg or a .gif, is made up of a series of colored dots, or pixels, like this:

Pixels

These are great for creating photos, but it’s almost impossible to link an image like this to data in such a way that the code is used to generate the image. Contrast that with the SVG for a circle and its code:

SVG code and circle

Note the cx=”50″ cy=”50″ r=”40″ code. This is what defines the circle’s position and size. Since it is code, data can be used to change these values. After all the major browsers allowed SVG code to be embedded within web pages, developers could use JavaScript to manipulate the images. The difficulty, though, was in converting data values to SVG code. This is the key part that D3 plays.

3. The D3 JavaScript library was created to tie together data and SVG.

At its core, D3 maps data values to visual values — but as simple as that sounds, it is incredibly powerful. Using D3, a developer no longer needs to worry about the math involved in converting, for example, the number of votes to the height of a rectangle in a bar graph. They could simply give the data to D3, which can analyze it to figure out the minimum and maximum values and tell D3 what the minimum and maximum height of the bars should be. Then D3 can generate all the bars in the graph by itself. D3 takes a lot of work involving tricky math and converts it into a few simple steps.

D3 has much more advanced functionality, too. It can generate common elements of graphs, such as axes and pie-chart segments. It can animate between different visual values, so, for example, complex line graphs smoothly morph as the data changes. There’s plenty more functionality, too. People are just beginning to scratch the surface of D3’s capabilities, and it’s really an exciting time to get involved with it.

D3.js at General Assembly

D3 is an incredibly easy library to use. Because of this, in General Assembly’s full-time Web Development Immersive (WDI) program, on campus and remotely, we often reserve it as an optional topic at the end of the course. WDI focuses on the fundamentals of programming, from front-end essentials like JavaScript, through back-end skills like Ruby on Rails and APIs. After having a thorough understanding of these competencies, learning D3 will come easily.

GA also offers occasional short-form workshops on D3 and other data visualization techniques, so developers and data scientists can begin to discover how to leverage programming to create data-driven stories.

Ask a Question About Our Data Analytics Programs

Meet Our Expert

Matt Huntington has worked as a developer for over 15 years and has a full understanding of all aspects of development (server side, client side, and mobile). Matt graduated magna cum laude from Vassar College with a degree in computer science. He teaches the full-time Web Development Immersive Remote course at General Assembly, and has worked for clients including Nike, IBM, Pfizer, MTV, Chanel, Verizon, Goldman Sachs, AARP, and BAM.

Matt Huntington, Web Development Immersive Remote Instructor

Designing a Dashboard in Tableau for Business Intelligence

By

Tableau is a data visualization platform that focuses on business intelligence. It has become very popular in recent years because of its flexibility and beautiful visualizations. Clients love the way Tableau presents data and how easy it makes performing analyses. It is one of my favorite analytical tools to work with.

A simple way to define a Tableau dashboard is as a glance view of a company’s key performance indicators, or KPIs. There are different kinds of dashboards available — it all depends on the business questions being asked and the end user. Is this for an operational team (like one at a distribution center) that needs to see the amount of orders by hour and if sales goals are achieved? Or is this for a CEO who would like to measure the productivity of different departments and products against forecast? The first case will require the data to be updated every 10 minutes, almost in real time. The second doesn’t require the same cadence, and once a day will be enough to track the company performance.

Over the past few years, I’ve built many dashboards for different types of users, including department heads, business analysts, and directors, and helped many mid-level managers with data analysis. Here are some best practices for creating Tableau dashboards I’ve learned throughout my career.

First Things First: Why Use a Data Visualization?

Visualizations are among the most effective ways to analyze data from any business process (sales, returns, purchase orders, warehouse operation, customer shopping behavior, etc.).

Below we have a grid report and bar chart that contain the same information. Which is easier to interpret?

Grid report

Bar Chart
Grid report vs. bar chart.

That’s right — it’s quicker to identify the category with the lowest sales, Tops, using the chart.

Many companies used to use grid reports to operate and make decisions, and many departments still do today, especially in retail. I once went to a trading meeting on a Monday morning where team members printed pages of Excel reports with rows and rows of sales and stock data by product and took them to a meeting room with a ruler and a highlighter to analyze sales trends. Some of these reports took at least two hours to prepare and required combining data from different data sources with VLOOKUPs — a function that allows users to search through columns in Excel. After the meeting, they threw the papers away (what a waste of paper and ink!) and then the following Monday it all started again.

Wouldn’t it be better to have a reporting tool in which the company’s KPIs were updated on a daily basis and presented in an interactive dashboard that could be viewed on tablets/laptops and digitally sliced and diced? That’s where tools like Tableau dashboards come in. You can drill down into details and answer questions raised in the meeting in real time — something you couldn’t do with paper copies.

How to Design a Dashboard in Tableau

Step 1: Identify who will use the dashboard and with what frequency.

Tableau dashboards can be used for many different purposes and therefore will be designed differently for each circumstance. This means that, before you can begin designing a dashboard, you need to know who is going to use it and how often.

Step 2: Define your topic.

The stakeholder (i.e., director, sales manager, CEO, business analyst, buyer) should be able to tell you what kind of business questions need to be answered and the decisions that will be made based on the dashboard.

Here, I am going to use data from a fictional retail company to report on monthly sales.

The commercial director would like to know 1) the countries to which the company’s products have been shipped, 2) which categories are performing well, and 3) sales by product. The option of browsing products is a plus, so the dashboard should include as much detail as possible.

Step 3: Initially, make sure you have all of the necessary data available to answer the questions specified.

Clarify how often you will get the data, the format in which you will receive the data (inside a database or in loose files), the cleanliness of the data, and if there are any data quality issues. You need to evaluate all of this before you promise a delivery date.

Step 4: Create your dashboard.

When it comes to dashboard design, it’s best practice to present data from top to bottom. The story should go from left to right, like a comic book, where you start at the top left and finish at the bottom right.

Let’s start by adding the data set to Tableau. For this demo, the data is contained in an Excel file generated by a software I developed myself. It’s all dummy data.

To connect to an Excel file from Tableau, select “Excel” from the Connect menu. The tables are on separate Excel sheets, so we’re going to use Tableau to join them, as shown in the image below. Once the tables are joined, go to the bottom and select Sheet 1 to create your first visualization.

Excel Sheet in Tableau
Joining Excel sheet in Tableau.

We have two columns in the Order Details table: Quantity and Unit Price. The sales amount is Quantity x Unit Price, so we’re going to create the new metric, “Sales Amount”. Right-click on the measures and select Create > Calculated Field.

Creating a Map in Tableau

We can use maps to visualize data with a geographical component and compare values across geographical regions. To answer our first question — “Which countries the company’s products have been shipped to?” — we’ll create a map view of sales by country.

1. Add Ship Country to the rows and Sales Amount to the columns.

2. Change the view to a map.

Map
Visualizing data across geographical regions.

3. Add Sales Amount to the color pane. Darker colors mean higher sales amounts aggregated by country.

4. You can choose to make the size of the bubbles proportional to the Sales Amount. To do this, drag the Sales Amount measure to the Size area.

5. Finally, rename the sheet “Sales by Country”.

Creating a Bar Chart in Tableau

Now, let’s visualize the second request, “Which categories are performing well?” We’ll need to create a second sheet. The best way to analyze this data is with bar charts, as they are to compare data across categories. Pie charts work in a similar way, but in this case we have too many categories (more than four) so they wouldn’t be effective.

1. To create a bar chart, add Category Name to the rows and Sales Amount to the columns.

2. Change the visualization to a bar chart.

3. Switch columns and rows, sort it by descending order, and show the values so users can see the exact value that the size of the rectangle represents.

4. Drag the category name to “Color”.

5. Now, rename the sheet to “Sales by Category”.

Sales category bar chart
Our Sales by Category breakdown.

Assembling a Dashboard in Tableau

Finally, the commercial director would like to see the details of the products sold by each category.

Our last page will be the product detail page. Add Product Name and Image to the rows and Sales Amount to the columns. Rename the sheet as “Products”.

We are now ready to create our first dashboard! Rearrange the chart on the dashboard so that it appears similar to the example below. To display the images, drag the Web Page object next to the Products grid.

Dashboard Assembly
Assembling our dashboard.

Additional Actions in Tableau

Now, we’re going to add some actions on the dashboard such that, when we click on a country, we’ll see both the categories of products and a list of individual products sold.

1. Go to Dashboard > Actions.

2. Add Action > Filter.

3. Our “Sales by Country” chart is going to filter Sales by Category and Products.

4. Add a second action. Sales by Category will filter Products.

5. Add a third action, this time selecting URL.

6. Select Products, <Image> on URL, and click on the Test Link to test the image’s URL.

What we have now is an interactive dashboard with a worldwide sales view. To analyze a specific country, we click on the corresponding bubble on the map and Sales by Category will be filtered to what was sold in that country.

When we select a category, we can see the list of products sold for that category. And, when we hover on a product, we can see an image of it.

In just a few steps, we have created a simple dashboard from which any head of department would benefit.

Dashboard
The final product.

Dashboards in Tableau at General Assembly

In GA’s Data Analytics course, students get hands-on training with the versatile Tableau platform. Create dashboards to solve real-world problems in 1-week, accelerated or 10-week, part-time course formats — on campus and online. You can also get a taste in our interactive classes and workshops.

Ask a Question About Our Data Programs

Meet Our Expert

Samanta Dal Pont is a business intelligence and data analytics expert in retail, eCommerce, and online media. With an educational background in software engineer and statistics, her great passion is transforming businesses to make the most of their data. Responsible for the analytics, reporting, and visualization in a global organization, Samanta has been an instructor for Data Analytics courses and SQL bootcamps at General Assembly London since 2016.

Samanta Dal Pont, Data Analytics Instructor, General Assembly London

Data Storytelling: 3 Objectives to Accomplish With Visualizations

By

Storytelling is as old as humanity itself. The words, the cadence, the visuals — whether seen with our eyes or in our minds — quickly capture us, engaging every area of our brains as we listen intently, anticipate logically, and become entangled emotionally. From the podcasts we listen to during our morning commute, to office gossip, to Thursday night Must See TV, we are awash in stories. And we love them so much that we heap money and accolades upon our best storytellers, from singers, authors, and movie directors to the social media stars sharing the visual story of their lives. So, it should be no surprise that we desire a good story with our data as well.

What Is Data Storytelling?

Data is a snapshot of measurable information that details what has happened at some point in time. Examples of data in a business context may include measurable events, such as amount of sales achieved, the number of social media impressions captured, or the duration of rental bike rides during weekdays and weekends. Data storytelling allows you, the business professional, to explain why these events have occurred, what may happen next, and what business decisions can be made with this newly acquired knowledge. Effective data storytelling, in the form of presentations that combine context, details, and visual illustrations, accomplishes three main objectives:

  • Builds a credible narrative around an analysis of data.
  • Connects a series of insights with a smooth flow of information.
  • Concludes the narrative with a compelling call to action.

Data Storytelling in Business: Inside Luxury Retail

Being skilled at data storytelling is critical in every business, and I’ve seen it hard at work specifically in the world of luxury retail, the industry I’ve been lucky enough to analyze and explore. Data is created with every physical interaction in a store and every click on a screen, and decisions must keep pace of this constant stream of information. On top of that, almost every retail initiative requires collaboration between many people and teams, including finance, merchandising, marketing, stores, visual, and logistics, to name a few. By presenting a clear flow of analysis that all teams and stakeholders can follow with an actionable conclusion, I’m able to motivate my team and drive results.

By completing an analysis of style selling by price point, I may discover that my customers buy more handbags that cost $500 than handbags that cost $1,000. This might be the opposite of what my buyers and sales associates expected. Based on this price-sensitivity insight, my new strategy might be to increase the number of handbags priced below $500 and decrease the number of handbags priced above $1,000. In order to get my team to support this change in action, I frame my data story with more impactful visual displays for the buyers and more satisfying store experiences for the sales associates. By presenting the data with my audience in mind, I can communicate effectively with all team members, especially those who may not be motivated by numbers or percentages but instead understand a story about meeting customer needs and delivering high-quality service.

Data storytelling is utilized across many industries and topics, borrowing from many story arches that we already know and love. In the TED Talk “The Math Behind Basketball’s Wildest Moves,” Rajiv Maheswaran employs the excitement and pace of a basketball game highlight reel to tell the data story of movement in our everyday lives. He grabs us from the start with high-tech visuals that turn pro-basketball plays into “moving” data “dots.” From these data points, he decodes the patterns found in game-winning plays with the excitement of a sportscaster.
By the end of the 12-minute presentation, he has not only convinced us of the importance of tracking movement to create more basketball wins, but that our movements — as regular people at work, at home, and beyond — can generate insights that create more wins for us in the game of life.

Data Storytelling at General Assembly

In General Assembly’s data-focused courses, students practice converting analysis results into compelling stories that drive business solutions using real-world data sets. In our part-time Data Analytics course, for example, students analyze open data from companies like Mozilla Firefox and Airbnb and use one of several storyboard frameworks to guide the arc of their data story. Students can also dive into the essentials of data storytelling in a self-paced, mentor-guided Data Analysis course, as well as part- and full-time Data Science programs.

GA’s project-based approach provides three key benefits:

  • Each presentation gives students an opportunity to test different storytelling frameworks, helping them learn what works best for different data situations, as well as what fits their personal style.
  • Hearing other students present from the same data set allows students to see how different approaches lead to different insights and different levels of effectiveness.
  • Receiving immediate presentation feedback from their instructor, instructional associate, and peers allows students to greatly improve their presentation skills within a short amount of time.
Ask a Question About Our Data Programs

Meet Our Expert

Alissa Livingston is a financial planning manager at Saks Fifth Avenue. When she’s not traveling to Paris or Milan to negotiate with the luxury brands we all covet, she’s training for half marathons, rain or shine. Alissa received a bachelor’s degree in mechanical engineering from Northwestern University in Chicago and an MBA from Columbia Business School in the City of New York. She’s an instructor for GA’s part-time Data Analytics course and weekend SQL Bootcamp.

Alissa Livingston, Data Analytics Instructor, General Assembly New York

Databases: The Fundamentals of Organizing and Linking Data

By

All around the globe, people are constantly tweeting, Googling, booking airline tickets, and banking online, among hundreds of other everyday internet activities. Each of these actions creates pieces of data — and all of these have to live somewhere. That’s where databases — put simply, a collection of data — come in.

Let’s look at LinkedIn as an example of how databases are used. When you first sign up for an account, you create a username and password. These are typically stored in some sort of database — usually one that’s encrypted in order to protect users’ privacy.

Once you’ve created an account, you can start updating your profile, sharing links to articles, and commenting on connections’ posts.

Here’s what happens when you interact with LinkedIn.

These links and comments eventually end up in a database. The main idea is that anyone with the proper permissions can then manage (search, see, comment, share, or like) these elements. All of these actions are usually performed by a piece of software that manages the database.

How Does a Database Work?

Let’s start by focusing on the first part of the word “database”: data. “Data” refers to some unstructured collection of known information.

For example, take a LinkedIn user named Joe whose email address is joe@someemail.com. Right now, we know two things about him: his name and his email address. These are two pieces of data.

Next, we need to organize related pieces of data. This is usually done through a structured format, such as a table. A table is composed of columns (also known as fields) and rows (also referred to as records).

Below, we see that our Joe data are now organized in a table called “Person”. Here, we have a record of Joe’s information: His name is in one field, his email is in another field, and we assign Joe a number (in a third field) for easy reference.

Person
person_number first_name email
1 Joe joe@someemail.com

As you might expect, in any database there can be many tables — one per related data collection. Simplifying our LinkedIn example, we might have a “Person” table, an “Education” table, and a “Comment” table as we collect more data points about an user and their activities.

Now, these tables can (optionally) be linked together to form some sort of relationship between them. For example, Joe may have listed the schools he attended, which could be represented by a relationship between the “Person” and “Education” tables. Thanks to this relationship, we know which schools in the “Education” table are Joe’s.

Usually, this step is when pieces of structured and related data are translated into information.

Any organization can have multiple databases — one for sales information, one for payroll information, and so on. To maintain these, they often turn to a type of software known as a database management system, or DBMS. There are many types of DBMS to choose from, including Oracle, Microsoft SQL Server, MySQL, and Postgres.

The database itself is housed in a piece of hardware — a physical machine that either resides on a company’s premises or is rented offsite through providers like Amazon Web Services, Google Cloud Platform, or Microsoft Azure Solution.

Last but not least, the data contained in the database needs to be accessible through some sort of admin tool or programming language. Analysts typically use a set of digital tools — including Microsoft Excel, IBM Cognos Analytics, pgAdmin, the R language, and Tableau — to examine this data for patterns and trends.

Data analysts can then use these patterns and trends to make informed decisions.

For example, if you’re a data analyst at a large company, you may be tasked with helping management determine a price for a new product. One approach you could take is looking at how much the product costs to produce — how much of people’s time and effort, as well as machinery, is needed to make and maintain the product. Let’s say you do that by analyzing the data sets of payroll and procurement and come up with a cost of $30. Then you’ll look at how much customers are willing to pay, and perhaps another data set can inform you that similar companies have charged up to $50 for a similar product.

But you can also see that the price might have a seasonal trend, meaning people buy more of this product in, say, December, than during the rest of the year. A data analyst could use any of the above-mentioned data tools to visualize these three data sets — production cost, competitors’ costs, and seasonal purchasing trends — to recommend to that the best price for the new product is $40.

When and Why Are Databases Used?

There are different types of databases and management solutions for different types of problems. Here are just a few reasons why you may need a database and what solutions you may choose in each situation:

  • Storing, processing, and searching large amounts of information: If you’re working for a company like Facebook that manages half a million comments every minute, a database could be used as a place to source reporting/analytics or run machine learning algorithms. The solution may be some sort of distributed data storage and processing framework, like Apache Hadoop or Spark.
  • Building a mobile app: If you’re creating an app, you’ll want to choose a database that is small and efficient for mobile devices, like SQLite or Couchbase.
  • Working at a startup or medium-sized business: If you’re on a tight budget or want a database that’s widely documented and used, then look to open-source database management systems like MySQL, PostgreSQL, MongoDB, or MariaDB.

Databases at General Assembly

At General Assembly, we empower students with the data tools and techniques needed to use data to answer business questions. In our full- and part-time data courses, students use databases to perform data analysis using real-world data.

In our part-time on-campus Data Analytics course or online Data Analysis program, students learn the fundamentals of data analysis and leverage data tools like Excel, PostgreSQL, pgAdmin, SQL, and Tableau. In our part-time Data Science course, students discover different types of databases, learn how to pull data from them, and more, and in the career-changing Data Science Immersive, they gather, store, and organize data with SQL, Git, and UNIX, while learning the skills to launch a career in data.

Ask a Question About Our Data Programs

Meet Our Expert

Gus Lopez is a tech lead with more than 20 years of experience in delivering IT projects around the world, including to many Fortune 500 companies. He teaches the part-time Data Analytics course at General Assembly’s Melbourne campus. Achievements include constantly delivering technically challenging back-end, front-end, and data science projects as well as managing multidisciplinary teams. Gus has a master’s degree in computer science from RICE University, an MBA from Melbourne Business School, and a Ph.D. with summa cum laude distinction in data science from Universidad Central de Venezuela. He is passionate about analyzing and searching for insights from data to improve processes and create competitive advantage for organizations.

Gus Lopez, Data Analytics Instructor, General Assembly Melbourne

Excel: Building the Foundation for Understanding Data Analytics

By

If learning data analytics is like trying to ride a bike, then learning Excel is like having a good set of training wheels. Although some people may want to jump right ahead without them, they’ll end up with fewer bruises and a smoother journey if they begin practicing with them on. Indeed, Excel provides an excellent foundation for understanding data analytics.

What exactly is data analytics? It’s more than just simply “crunching numbers,” for one. Data analytics is the art of analyzing and communicating insights from data in order to influence decision-making.

In the age of increasingly sophisticated analytical tools like Python and R, some seasoned analytics professionals may scoff at Excel, which was first released by Microsoft in 1987, as nothing more than petty spreadsheet software. Unfortunately, most people only touch the tip of the iceberg when it comes to fully leveraging this ubiquitous program’s power as a stepping stone into analytics.

Using Excel for Data Analysis: Management, Cleaning, Aggregation, and More

I refer to Excel as the gateway into analytics. Once you’ve learned the platform inside and out, throughout your data analytics journey you’ll continually say to yourself, “I used to do this in Excel. How do I do it in X or Y?” In today’s digital age, it may seem like there are new analytical tools and software packages coming out every day. As a result, many roles in data analytics today require an understanding of how to leverage and continuously learn multiple tools and packages across various platforms. Thankfully, learning Excel and its fundamentals will provide a strong bedrock of knowledge that you’ll find yourself frequently referring back to when learning newer, more sophisticated programs.

Excel is a robust tool that provides foundational knowledge for performing tasks such as:

  • Database management. Understanding the architecture of any data set is one of first steps of the data analytics workflow. In Excel, each worksheet can be thought of as a table in a database. Each row in a worksheet can then be considered a record while each column can be considered an attribute. As you continue to work with multiple worksheets and tables in Excel, you’ll learn that functions such as “VLOOKUP” and “INDEXMATCH” are similar to the “JOIN” clauses seen in SQL.
  • Data cleaning. Cleaning data is often one of the most crucial and time-intensive components of the data analytics workflow. Excel can be used to clean a data set using various string functions such as “TRIM”, “MID”, or “SUBSTITUTE”. Many of these functions cut across various programs and will look familiar when you learn similar functions in SQL and Tableau.
  • Data aggregation. Once the data’s been cleaned, you’ll need to summarize and compile it. Excel’s aggregation functions such as “COUNT”, “SUM”, “MIN”, or “MAX” can be used to summarize the data. Furthermore, Excel’s Pivot Tables can be leveraged to aggregate and filter data quickly and efficiently. As you continue to manipulate and aggregate data, you’ll begin to understand the underlying SQL queries behind each Pivot Table.
  • Statistics. Descriptive statistics and inferential statistics can be applied through Excel’s functions and add-ons to better understand our data. Descriptive statistics such as the “AVERAGE”, “MEDIAN”, or “STDEV” functions tell us about the central tendency and variability of our data. Additionally, inferential statistics such as correlation and regression can help to identify meaningful patterns in the data which can be further analyzed to make predictions and forecasts.
  • Dashboarding and visualization. One of the final steps of the data analytics workflow involves telling a story with your data. The combination of Excel’s Pivot Tables, Pivot Charts, and slicers offer the underlying tools and flexibility to construct dynamic dashboards with visualizations to convey your story to your audience. As you build dashboards in Excel, you’ll begin to uncover how the Pivot Table fields in Excel are the common denominator in almost any visualization software and are no different than the “Shelfs” used in Tableau to create visualizations.

If you want to jump into Excel but don’t have a data set to work with, why not analyze your own personal data? You could leverage Excel to keep track of your monthly budget and create a dashboard to see what your spending trends look like over time. Or if you have a fitness tracker, you could export the data from the device and create a dashboard to show your progress over time and identify any trends or areas for improvement. The best way to jump into Excel is to use data that’s personal and relevant — so your own health or finances can be a great start.

Excel at General Assembly

In GA’s part-time Data Analytics course and online Data Analysis course, Excel is the starting point for leveraging other analytical tools such as SQL and Tableau. Throughout the course, you’ll continually have “data déjà vu” as you tell yourself, “Oh this looks familiar.” Students will understand why Excel is considered a jack-of-all-trades by providing a great foundation in database management, statistics, and dashboard creation. However, as the saying goes, “A jack-of-all-trades is a master of none.” As such, students will also recognize the limitations of Excel and the point at which tools like SQL and Tableau offer greater functionality.

At GA, we use Excel to clean and analyze data from sources like the U.S. Census and Airbnb to formulate data-driven business decisions. During final capstone projects, students are encouraged to use data from their own line of work to leverage the skills they’ve learned. We partner with students to ensure that they are able to connect the dots along the way and “excel” in their data analytics journey.

Having a foundation in Excel will also benefit students in GA’s full-time Data Science Immersive program as they learn to leverage Python, machine learning, visualizations, and beyond, and those in our part-time Data Science course, who learn skills like statistics, data modeling, and natural language processing. GA also offers day-long Excel bootcamps across our campuses, during which students learn how to simplify complex tasks including math functions, data organization, formatting, and more.

Ask a Question About Our Data Programs

Meet Our Expert

Mathu A. Kumarasamy is a self-proclaimed analytics evangelist and aspiring data scientist. A believer in the saying that “data is the new oil,” Mathu leverages analytics to find, extract, refine, and distribute data in order to help clients make confident, evidence-based decisions. He is especially passionate about leveraging data analytics, technology, and insights from the field of behavioral economics to help establish a culture of evidence-based, value-driven health care in the United States. Mathu enjoys converting others into analytics geeks while teaching General Assembly’s part-time Data Analytics course in Atlanta.

Mathu A. Kumarasamy, Data Analytics Instructor, GA Atlanta

How Ridgeline Plots Visualize Data and Present Actionable Decisions

By

Organizations conduct survey research for any number of reasons: to decide which products to devote resources to, determine customer satisfaction, figure out who our next president will be, or determine which Game of Thrones characters are most attractive. But almost all surveys are conducted with samples of the target population and therefore are subject to sampling error.

Decision-makers need to understand this error to make the most of survey results, so it’s important for data scientists and analysts to communicate confidence intervals when visualizing estimated results. Confidence intervals are the range of values you could reasonably expect to see in your target population based on the results measured in your sample.

But traditional visuals (error bars) can lead to misperceptions, too. In situations where confidence intervals overlap by a small amount, we know there is really small chance of two values being equal — but overlapping error bars on a chart still signal danger. Ridgeline plots, which are essentially a series of density plots (or smoothed-out histograms), can help balance the need to communicate risk without overemphasizing error in situations where error bars only slightly overlap. Instead of showing an error bar, which is the same size from top to bottom, a ridgeline plot gets fatter to represent more likely values and thinner to represent less likely values. This way, a small amount of overlap doesn’t signal lack of statistical significance quite as loudly.

Calculating Confidence Intervals: Planning a Class

Consider, for example, an education startup that conducted a survey of 500 people on its email list to determine which of three classes respondents might want to enroll in. (For demonstration purposes, we’re assuming this is a random sample that’s representative of the target audience.) The options are Hackysack Maintenance, Underwater Basketweaving, and Finger Painting. Results are reported below:

Classes Results (%)
Hackysack Maintenance 24
Underwater Basketweaving 44
Finger painting 32

We could produce a bar plot of this result that makes Underwater Basketweaving appear to be the clear-cut winner.

Basketweaving Graph

But since this data comes from a representative sample, there is some margin of error for each of these point estimates. This post won’t go into calculating these confidence intervals except to say that we used the normal approximation method to calculate binomial confidence intervals for each of the three survey results at a 99.7% confidence level. Now our results look more like this:

Classes Results (%) Lower Conf. Int. (%) Upper Conf. Int. (%)
Hackysack Maintenance 24 18 30
Underwater Basketweaving 44 37 51
Finger painting 32 26 38

One common way to present these confidence intervals is by adding error bars to the plot. When we add these error bars, our plot looks like this:

Basketweaving Error Bars

Unfortunately, our error bars are now overlapping between Finger Painting and Underwater Basketweaving. This means there is some chance that the two courses are equally desirable — or that Finger Painting is actually the most desirable course of all! Decision-makers no longer have a clear-cut investment since the top two responses could be tied.

However, those error bars barely overlap. There’s a strong probability that Underwater Basketweaving really is the winner. The problem with this method of plotting error bars is that the visual treats every part of our confidence interval distribution as equally likely instead of the bell curve it should look like.

Enter the ridgeline plot.

What Is a Ridgeline Plot?

Ridgeline plots essentially stack density plots for multiple categorical variables on top of one another. Claus Wilke created ridgeline plots — originally named joy plots — in the summer of 2017, and the visual has rapidly gained popularity among users of the R programming language. They’ve been used to show the changing polarization of political partiessalary distributions, and patterns of breaking news.

By using a ridgeline plot rather than a bar plot, we can present our confidence intervals as the bell curves they are, rather than a flat line. Instead of a bar that implies a clear winner and some error bars that contradict that narrative, the ridgeline plot demonstrates that, indeed, the bulk of possible values for each class are basically different from one another. In the process, the ridgeline plot downplays the small amount of overlap between Finger Painting and Underwater Basketweaving.

Basketweaving Ridgeline Plot

By plotting only the confidence intervals in the form of individual density plots, the ridgeline plot demonstrates the small amount of risk that students really prefer a class on finger painting  without overemphasizing the magnitude of that risk. Our education startup can invest in curriculum development and promotion of the Underwater Basketweaving class with a strong degree of confidence that most of its potential students would be most interested in such a class.

Ridgeline Plots at General Assembly

In General Assembly’s full-time, career-changing Data Science Immersive program and part-time Data Science course, students learn about sampling, calculating confidence intervals, and using data visualizations to help make actionable decisions with data. Students can also learn about the programming language R and other key data skills through expert-led workshops and exclusive industry events across GA’s campuses.

Ask a Question About Our Data Programs

Meet Our Expert

Josh Yazman is a General Assembly Data Analytics alum and a data analyst with expertise in media analytics, survey research, and civic engagement. He now teaches GA’s part-time Data Analytics course in Washington, D.C. Josh spent five years working in Virginia for political candidates at all levels of government, from Blacksburg town council to president. Today, he is a data analyst with a national current-affairs magazine in Washington, D.C., a student at Northwestern University pursuing a master’s degree in predictive analytics, and the advocacy chair for the National Capital Area chapter of the Pancreatic Cancer Action Network. He occasionally writes about political and sports data on Medium and tweets at @jyazman2012.

“Data science as a field is in demand today — but the decision-making and problem-solving skills you’ll learn from studying it are broadly applicable and valuable in any field or industry.”

– Josh Yazman, Data Analytics Instructor, General Assembly Washington, D.C.

How Predictive Modeling Forecasts the Future and Influences Change

By

You know the scenario: You get to work in the morning and quickly check your personal email. Over on the side, you notice that your spam folder has a couple of items in it, so you look inside. You’re amazed — although some of them look like genuine emails, they’re not; these cleverly disguised ads are all correctly labeled as spam. What you’re seeing is natural language processing (NLP) in action. In this instance, the email service provider is using what’s known as predictive analytics to assess language data and determine which combinations of words are likely spam, filtering your email accordingly.

With the volume of data being created, collected, and stored increasing by the hour, the days of making decisions based solely on intuition are numbered. Companies collect data on their customers, nonprofits collect data on their donors, apps collect data on their users, all with the goal of finding opportunities to improve their products and services. More and more, decision-making is becoming data driven. People use information to understand what’s happening in the world around them and try to predict what will happen in the future. For this, we turn to predictive analytics.

Predictive analytics is the concept of using current information to forecast what will happen next time. This area of study covers a broad range of concepts and skills — oftentimes involving modeling techniques — that help turn data into insights and insights into action. These ideas are already in practice in industries like eCommerce, direct marketing, cybersecurity, financial services, and more. It’s likely that you’ve come across implementations of predictive analytics and modeling in your daily life and not even realized it.

Predictive Modeling in the Real World

Returning to our example, say that an email in your inbox reminds you that you wanted to buy a new whisk to make scrambled eggs this weekend. When you head to Amazon.com to make a purchase, you see some recommendations for items you might like on the home page. This component is what’s known in the data science world as a recommender system.

What Amazon's recommender system thinks your kitchen is missing.
What Amazon’s recommender system thinks your kitchen is missing.

To develop this, Amazon uses its vast data sets that detail what people are buying. Then, a machine learning engineer may use Python or R to pass this data through a k-means clustering algorithm. This will organize items into groups that are purchased together and allows Amazon to compare the results with what you’ve already bought to come up with recommendations. With this implementation, Amazon is looking at a combination of what you and others have purchased and/or viewed (current information) and using predictive modeling to anticipate what else you might like based on that data. This is a tremendously powerful tool! It helps a user find what they want faster, get new ideas, while also boosting Amazon sales as it shortens the path to purchase.

Say that, around lunch time, you decide to order pizza delivery — 20 minutes later, there it is. Wow! How did it get there so fast? Using another predictive analysis technique called clustering, the restaurant has analyzed where its orders are coming from and grouped them accordingly. For this project, a data analyst might have run a SQL query to find out which deliveries would take the longest. The analyst might then use a nearest neighbors algorithm in Python to find the optimal groupings and recommend placements for new restaurant locations at cross streets to minimize the distance to the orders.

Clustering for optimal pizza delivery.
Clustering for optimal pizza delivery.

Here, predictive modeling not only saves the company money on driving time and gas, it also cuts down the time between the customer and a hot pizza.

Predictive Modeling at General Assembly

Regardless of the industry, there’s growing opportunity to leverage predictive modeling to solve problems of all sizes. This is rapidly becoming a must-have skill, which is why we teach these techniques and more in our part-time and  full-time data science courses at General Assembly. Starting with simple analyses like linear regression and classification, students use tools like Python and SQL to work with real-world data, building the necessary skills to move on to more involved analyses like time series, clustering, and recommender systems. This gives them the toolbox they need to make data-driven decisions that influence change in the business, government, and nonprofit sectors — and beyond.

Ask a Question About Our Data Programs

Meet Our Expert

Amer Tadmori is a senior statistician at Wiland, where he uses data science to provide business intelligence and data-driven marketing solutions to clients. His passion for turning complex topics into easy-to-understand concepts is what led him to begin teaching. At GA’s Denver campus, Amer leads courses in SQLdata analytics, data visualization, and storytelling with data. He holds a bachelor’s degree in economics from Colgate University and a master’s degree in applied statistics from Colorado State University. In his free time, Amer loves hiking his way through the national parks and snowboarding down Colorado’s local hills.

“Now’s a great time to learn data analysis techniques. There’s an abundance of resources available to learn these skills, and an even greater abundance of places to use them.”

– Amer Tadmori, Data Analytics Instructor, General Assembly Denver

SQL: Using Data Science to Boost Business and Increase Efficiency

By

In today’s digital age, we’re constantly bombarded with information about new apps, transformative technologies, and the latest and greatest artificial intelligence system. While these technologies may serve very different purposes in our life, all of them share one thing in common: They rely on data. More specifically, they all use databases to capture, store, retrieve, and aggregate data. This begs the question: How do we actually interact with databases to accomplish all of this? The answer: We use Structured Query Language, or SQL (pronounced “sequel” or “ess-que-el”).

Put simply, SQL is the language of data — it’s a programming language that enables us to efficiently create, alter, request, and aggregate data from those mysterious things called databases. It gives us the ability to make connections between different pieces of information, even when we’re dealing with huge data sets. Modern applications are able to use SQL to deliver really valuable pieces of information that would otherwise be difficult for humans to keep track of independently. In fact, pretty much every app that stores any sort of information uses a database. This ubiquity means that developers use SQL to log, record, alter, and present data within the application, while analysts use SQL to interrogate that same data set in order to find deeper insights.

Finding SQL in Everyday Life

Think about the last time you looked up the name of a movie on IMDB. I’ll bet you quickly noticed an actress on the cast list and thought something like, “I didn’t realize she was in that,” then clicked a link to read her bio. As you were navigating through that app, SQL was responsible for returning the information you “requested” each time you clicked a link. This sort of capability is something we’ve come to take for granted these days.

Let’s look at another example that truly is cutting-edge, this time at the intersection of local government and small business. Many metropolitan cities are supporting open data initiatives in which public data is made easily accessible through access to the databases that store this information. As an example, let’s look at Los Angeles building permit data, business listings, and census data.

Imagine you work at a real estate investment firm and are trying to find the next up-and-coming neighborhood. You could use SQL to combine the permit, business, and census data in order to identify areas that are undergoing a lot of construction, have high populations, and contain a relatively low number of businesses. This might be a great opportunity to purchase property in a soon-to-be thriving neighborhood! For the first time in history, it’s easy for a small business to leverage quantitative data from the government in order to make a highly informed business decision.

Leveraging SQL to Boost Your Business and Career

There are many ways to harness SQL’s power to supercharge your business and career, in marketing and sales roles, and beyond. Here are just a few:

  • Increase sales: A sales manager could use SQL to compare the performance of various lead-generation programs and double down on those that are working.
  • Track ads: A marketing manager responsible for understanding the efficacy of an ad campaign could use SQL to compare the increase in sales before and after running the ad.
  • Streamline processes: A business manager could use SQL to compare the resources used by various departments in order to determine which are operating efficiently.

SQL at General Assembly

At General Assembly, we know businesses are striving to transform their data from raw facts into actionable insights. The primary goal of our data analytics curriculum, from workshops to full-time courses, is to empower people to access this data in order to answer their own business questions in ways that were never possible before.

To accomplish this, we give students the opportunity to use SQL to explore real-world data such as Firefox usage statistics, Iowa liquor sales, or Zillow’s real estate prices. Our full-time Data Science Immersive and part-time Data Analytics courses help students build the analytical skills needed to turn the results of those queries into clear and effective business recommendations. On a more introductory level, after just a couple of hours of in one of our SQL workshops, students are able to query multiple data sets with millions of rows.

Ask a Question About Our Data Programs

Meet Our Expert

Michael Larner is a passionate leader in the analytics space who specializes in using techniques like predictive modeling and machine learning to deliver data-driven impact. A Los Angeles native, he has spent the last decade consulting with hundreds of clients, including 50-plus Fortune 500 companies, to answer some of their most challenging business questions. Additionally, Michael empowers others to become successful analysts by leading trainings and workshops for corporate clients and universities, including General Assembly’s part-time Data Analytics course and SQL/Excel workshops in Los Angeles.

“In today’s fast-paced, technology-driven world, data has never been more accessible. That makes it the perfect time — and incredibly important — to be a great data analyst.”

– Michael Larner, Data Analytics Instructor, General Assembly Los Angeles

Using Apache Spark For High Speed, Large Scale Data Processing

By

Apache Spark is an open-source framework used for large-scale data processing. The framework is made up of many components, including four programming APIs and four major libraries. Since Spark’s release in 2014, it has become one of Apache’s fastest growing and most widely used projects of all time.

Spark uses an in-memory processing paradigm to speed up computation and run programs 10 to 100 times faster than other big data technologies like Hadoop MapReduce. According to the 2016 Apache Spark Survey, more than 900 companies, including IBM, Google, Netflix, Amazon, Microsoft, Intel, and Yahoo, use Spark in production for data processing and querying.

Apache Spark is important to the big data field because it represents the next generation of big data processing engines and is a natural successor to MapReduce. One of Spark’s advantages is that its use of four programming APIs — Scala, Python, R, and Java 8 — allows the user flexibility to work in the language of their choice. This makes the tool much more accessible to a wide range of programmers with different capabilities. Spark also has great flexibility in its ability to read all types of data from various locations such as Hadoop Distributed File Storage (HDFS), Amazon’s web-based Simple Storage Service (S3), or even the local filesystem.

Production-Ready and Scalable

Spark’s greatest advantage is that it maximizes the capabilities of data science’s most expensive resource: the data scientist. Computers and programs have become so fast, that we are no longer limited by what they can do as much as we are limited by human productivity. By providing a flexible language platform and having concise syntax, the data scientist can write more programs, iterate through their programs, and have them run much quicker. The code is production-ready and scalable, so there’s no need to hand off code requirements to a development team for changes.

It takes only a few minutes to write a word-count program in Spark, but would take much longer to write the same program in Java. Because the Spark code is so much shorter, there’s less of a need to debug or use version control tools.

Spark’s concise syntax can best be illustrated with the following examples. The Spark code is only four lines compared with almost 58 for Java.

Java vs. Spark

Faster Processing

Spark utilizes in-memory processing to speed up applications. The older big data frameworks, such as Hadoop, use many intermediate disc reads and writes to accomplish the same task. For small jobs on several gigabytes of data, this difference is not as pronounced, but for machine learning applications and more complex tasks such as natural language processing, the difference can be tremendous. Logistic regression, a technique taught in all of General Assembly’s full- and part-time data science courses, can be sped up over 100x.

Spark has four key libraries that also make it much more accessible and provide a wider set of tools for people to use. Spark SQL is ideal for leveraging SQL skills or work with data frames; Spark Streaming has functions for data processing, useful if you need to process data in near real time; and GraphX has pre-written algorithms that are useful if you have graph data or need to do graph processing. The library most useful to students in our Data Science Immersive, though, is the Spark MLlib machine learning library, which has prewritten distributed machine learning algorithms for use on data frames.

Spark at General Assembly

At GA, we teach both the concepts and the tools of data science. Because hiring managers from marketing, technology, and biotech companies, as well as guest speakers like company founders and entrepreneurs, regularly talk about using Spark, we’ve incorporated it into the curriculum to ensure students are fluent in the field’s most relevant skills. I teach Spark as part of our Data Science Immersive (DSI) course in Boston, and I previously taught two Spark courses for Cloudera and IBM. Spark is a great tool to teach because the general curriculum focuses mostly on Python, and Spark has a Python API/library called PySpark.

When we teach Spark in DSI, we cover resilient distributed data sets, directed acyclic graphs, closures, lazy execution, and reading JavaScript Object Notation (JSON), a common big data file format.

Ask a Question About Our Data Programs

Meet Our Expert

Joseph Kambourakis has over 10 years of teaching experience and over five years of experience teaching data science and analytics. He has taught in more than a dozen countries and has been featured in Japanese and Saudi Arabian press. He holds a bachelor’s degree in electrical and computer engineering from Worcester Polytechnic Institute and an MBA with a focus in analytics from Bentley University. He is a passionate Arsenal FC supporter and competitive Magic: The Gathering player. He currently lives with his wife and daughter in Needham, Massachusetts.

“GA students come to class motivated to learn. Throughout the Data Science Immersive course, I keep them on their path by being patient and setting up ideas in a simple way, then letting them learn from hands-on lab work.”

Joseph Kambourakis, Data Science Instructor, General Assembly Boston