data science Tag Archives - General Assembly Blog

Why You Should Consider a Career in Data Analytics

The world’s data reached an all-time high in 2021. 79 zettabytes of data – which is enough storage for 30 billion 4K movies – was generated last year alone.

This is a good thing – right? More data means more innovation, which means more advancements for society.

Beginner’s Python Cheat Sheet

Do you want to be a data scientist? Data Science and machine learning are rapidly becoming a vital discipline for all types of businesses. An ability to extract insight and meaning from a large pile of data is a skill set worth its weight in gold. Due to its versatility and ease of use, Python programming has become the programming language of choice for data scientists.

In this Python crash course, we will walk you through a couple of examples using two of the most-used data types: the list and the Pandas DataFrame. The list is self-explanatory; it’s a collection of values set in a one-dimensional array. A Pandas DataFrame is just like a tabular spreadsheet, it has data laid out in columns and rows.

Let’s take a look at a few neat things we can do with lists and DataFrames in Python!
Get the PDF here.

BEGINNER’SPython Cheat Sheet

Lists

Creating Lists

Let’s start this Python tutorial by creating lists. Create an empty list and use a for loop to append new values. What you need to do is:

my_list = []
for x in range(1,11):
my_list.append(x+2)

We can also do this in one step using list comprehension:

my_list = [x + 2 for x in range(1,11)]

Creating Lists with Conditionals

As above, we will create a list, but now we will only add 2 to the value if it is even.

#add two, but only if x is even
my_list = []
for x in range(1,11):
if x % 2 == 0:
my_list.append(x+2)
else:
my_list.append(x)

Using a list comp:

my_list = [x+2 if x % 2 == 0 else x \
for x in range(1,11)]

Selecting Elements and Basic Stats

Select elements by index.

#get the first/last element
first_ele = my_list[0]
last_ele = my_list[-1]

Some basic stats on lists:

#get max/min/mean value
biggest_val = max(my_list)
smallest_val = min(my_list)avg_val = sum(my_list) / len(my_list)

DataFrames

Reading in Data to a DataFrame

We first need to import the pandas module.

import pandas as pd

Then we can read in data from csv or xlsx files:

sep=’,’,
nrows=10)
xlsx = pd.ExcelFile(‘path/to/excel_file.xlsx’)

Slicing DataFrames

We can slice our DataFrame using conditionals.

df_filter = df[df[‘population’] > 1000000]
df_france = df[df[‘country’] == ‘France’]

Sorting values by a column:

df.sort_values(by=’population’,
ascending=False)

Filling Missing Values

Let’s fill in any missing values with that column’s average value.

df[‘population’] = df[‘population’].fillna(
value=df[‘population’].mean()
)

Applying Functions to Columns

Apply a custom function to every value in one of the DataFrame’s columns.

def fix_zipcode(x):
”’
make sure that zipcodes all have leading zeros
”’
return str(x).zfill(5)
df[‘clean_zip’] = df[‘zip code’].apply(fix_zipcode)

Ready to take on the world of machine learning and data science? Now that you know what you can do with lists and DataFrames using Python language, check out our other Python beginner tutorials and learn about other important concepts of the Python programming language.

What is Data Science?

It’s been anointed “the sexiest job of the 21st century”, companies are rushing to invest billions of dollars into it, and it’s going to change the world — but what do people mean when they mention “data science”? There’s been a lot of hype about data science and deservedly so, but the excitement has helped obfuscate the fundamental identity of the field. Anyone looking to involve themselves in data science needs to understand what it actually is and is not.

In this article, we’ll lay out a deep definition of the field, complete descriptions of the data science workflow, and data science tasks used in the real world. We hope that any would-be entrants into this line of work will come away reading this article with a nuanced understanding of data science that can help them decide to enter and navigate this exciting line of work.

So What Actually is Data Science?

A quick definition of data science might be articulated as an interdisciplinary field that primarily uses statistics and computer programming to derive insights from and base decisions from a collection of information represented as numerical figures. The “science” part in data science is quite apt because data science very much follows a scientific process that involves formulating a hypothesis and using a specific toolset to confirm or dispel that hypothesis. At the end of the day, data science is about turning a problem into a question and a question into an answer and/or solution.

Tackling the meaning of data science also means interrogating the meaning of data. Data can be easily described as “information encoded as numbers” but that doesn’t tell us why it’s important. The value of data stems from the notion that data is a tangible manifestation of the intangible. Data provides solid support to aid our interpretations of the world. For example, a weather app can tell you it’s cold outside but telling you that the temperature is 38 degrees fahrenheit provides you with a stronger and specific understanding of the weather.

Data comes in two forms: qualitative and quantitative.

Qualitative data is categorical data that does not naturally come in the form of numbers, such as demographic labels that you can select on a census form to indicate gender, state, and ethnicity.

Quantitative data is numerical data that can be processed through mathematical functions; for example stock prices, sports stats, and biometric information.

Quantitative can be subdivided into smaller categories such as ordinal, discrete, and continuous.

Ordinal: A sort of qualitative and quantitative hybrid variable in which the values have a hierarchical ranking. Any sort of star rating system of reviews is a perfect example of this; we know that a four-star review is greater than a three-star review, but can’t say for sure that a four- star review is twice as good as a two-star review.

Discrete: These are countable and finite values that often appear in the form of integers. Examples include number of franchises owned by a company and number of votes cast in an election. It’s important to remember discrete variables have a finite range of numbers and can never be negative.

Continuous: Unlike discrete variables, continuous can appear in decimal form and have an infinite range of possibilities. Things like company profit, temperature, and weight can all be described as continuous.

What Does Data Science Look Like?

Now that we’ve established a base understanding of data science, it’s time to delve into what data science actually looks like. To answer this question, we need to go over the data science workflow, which encapsulates what a data science project looks like from start to finish. We’ll touch on typical questions at the heart of data science projects and then examine an example data science workflow to see how data science was used to achieve success.

The Data Science Checklist

A good data science project is one that satisfies the following criteria:

Specificity: Derive a hypothesis and/or question that’s specific and to the point. Having a vague approach can often lead to a waste of time with no end product.

Attainability: Can your questions be answered? Do you have access to the required data? It’s easy to come up with an interesting question but if it can’t be answered then it has no value. The same goes for data, which is only useful if you can get your hands on it.

Measurability: Can what you’re applying data science to be quantified? Can the problem you’re addressing be represented in numerical form? Are there quantifiable benchmarks for success?

As previously mentioned, a core aspect of data science is the process of deriving a question, especially one that is specific and achievable. Typical data science questions ask things like, does X predict Y and what are the distinct groups in our data? To get a sense of data science questions, let’s take a look at some business-world-appropriate ones:

• What is the likelihood that a customer will buy this product?
• Did we observe an increase in sales after implementing a new policy?
• Is this a good or bad review?
• How much demand will there be for my service tomorrow?
• Is this the cheapest way to deliver our goods?
• Is there a better way to segment our marketing strategies?
• What groups of products are customers purchasing together?
• Can we automate this simple yes/no decision?

All eight of these questions are excellent examples of how businesses use data science to advance themselves. Each question addresses a problem or issue in a way that can be answered using data science.

The Data Science Workflow

Once we’ve established our hypothesis and questions, we can now move onto what I like to call the data science workflow, a step-by-step description of a typical data science project process.

After asking a question, the next steps are:

1. Get and Understand the Data. We obviously need to acquire data for our project, but sometimes that can be more difficult than expected if you need to scrape for it or if privacy issues are involved. Make sure you understand how the data was sampled and the population it represents. This will be crucial in the interpretation of your results.
1. Data Cleaning and Exploration. The dirty secret of data science is that data is often quite dirty so you can expect to do significant cleaning which often involves constructing your variables in a way that makes your project doable. Get to know your data through exploratory data analysis. Establish a base understanding of the patterns in your dataset through charts and graphs.
1. Modeling. This represents the main course of the data science process; it’s where you get to use the fancy powerful tools. In this part, you build a model that can help you answer a question such as can we predict future sales of a product from your dataset.
1. Presentation. Now it’s time to present the results of your findings. Did you confirm or dispel your hypothesis? What are the answers to the questions you started off with? How do your results advance our understanding of the issue at hand? Articulate your project in a clear and concise manner that makes it digestible for your audience, which could be another team in your company or your company’s executives.

Data Science Workflow Example: Predicting Neonatal Infection

Now let’s parse out an example of how data science can affect meaningful real-world impact, taken from the book Big Data: A Revolution That Will Transform How We Live, Work, and Think.

We start with a problem: Children born prematurely are at high risk of developing infections, many of which are not detected until after a child is sick.

Then we turn that problem into a question: Can we detect patterns in the data that accurately predict infection before it occurs?

Next, we gather relevant data: variables such as heart rate, respiration rate, blood pressure, and more.

Then we decide on the appropriate tool: a machine learning model that uses past data to predict future outcomes.

Finally, what impact do our methods have? The model is able to predict the onset of infection before symptoms appear, thus allowing doctors to administer treatment earlier in the infection process and increasing the chances of survival for patients.

This is a fantastic example of data science in action because every step in the process has a clear and easily understandable function towards a beneficial outcome.

Data scientists are basically Swiss Army knives, in that they possess a wide range of abilities — it’s why they’re so valuable. Let’s go over the specific tasks that data scientists typically perform on the job.

Data acquisition: For data scientists, this usually involves querying databases set up by their companies to provide easy access to reams of data. Data scientists frequently write SQL queries to retrieve data. Outside of querying databases, data scientists can use APIs or web scraping to acquire data.

Data cleaning: We touched on this before, but it can’t be emphasized enough that data cleaning will take up the vast majority of your time. Cleaning oftens means dealing with null values, dropping irrelevant variables, and feature engineering which means transforming data in a way so that it can be processed by a model.

Data visualization: Crafting and presenting visually appealing and understandable charts is a hugely valuable skill. Visualization has an uncanny ability to communicate important bits of information from a mass of data. Good data scientists will use data visualization to help themselves and their audiences better understand what’s going on.

Statistical analysis: Statistical tests are used to confirm and/or dispel a data scientist’s hypothesis. A t-test or chi-square are used to evaluate the existence of certain relationships. A/B testing is a popular use case of statistical analysis; if a team wants to know which of two website designs leads to more clicks, then an A/B test is the right solution.

Machine learning: This is where data scientists use models that make predictions based on past observations. If a bank wants to know which customers are likely to pay back loans, then they can use a machine learning model trained on past loans to answer that question.

Computer science: Data scientists need adequate computer programming skills because many of the tasks they undertake involve writing code. In addition, some data science roles require data scientists to function as software engineers because data scientists have to implement their methodologies into their company’s backend servers.

Communication: You can be a math and computer whiz, but if you can’t explain your work to a novice audience, your talents might as well be useless. A great data scientist can distill digestible insights from complex analyses for a non-technical audience, translating how a p-value or correlation score is relevant to a part of the company’s business. If your company is going to make a potentially costly or lucrative decision based on your data science work, then it’s incumbent on you to make sure they understand your process and results as much as possible.

Conclusion

We hope this article helped to demystify this exciting and increasingly important line of work. It’s pertinent to anyone who’s curious about data science — whether it’s a college student or an executive thinking about hiring a data science team — that they understand what this field is about and what it can and cannot do.

Data at Work: 3 Real-World Problems Solved by Data Science

At first glance, data science seems to be just another business buzzword — something abstract and ill-defined. While data can, in fact, be both of these things, it’s anything but a buzzword. Data science and its applications have been steadily changing the way we do business and live our day-to-day lives — and considering that 90% of all of the world’s data has been created in the past few years, there’s a lot of growth ahead of this exciting field.

While traditional statistics and data analysis have always focused on using data to explain and predict, data science takes this further. It uses data to learn — constructing algorithms and programs that collect from various sources and apply hybrids of mathematical and computer science methods to derive deeper actionable insights. Whereas traditional analysis uses structured data sets, data science dares to ask further questions, looking at unstructured “big data” derived from millions of sources and nontraditional mediums such as text, video, and images. This allows companies to make better decisions based on its customer data.

So how is this all manifesting in the market? Here, we look at three real-world examples of how data science drives business innovation across various industries and solves complex problems.

WHAT IS PYTHON?: AN INTRODUCTION

Python is one of the most popular and user-friendly programming languages out there. As a developer who’s learned a number of programming languages, Python is one of my favorites due to its simplicity and power. Whether I’m rapidly prototyping a new idea or developing a robust piece of software to run in production, Python is usually my language of choice.

The Python programming language is ideal for folks first learning to program. It abstracts away many of the more complicated elements of computer programming that can trip up beginners, and this simplicity gets you up-and-running much more quickly!

For instance, the classic “Hello world” program (it just prints out the words “Hello World!”) looks like this in C:

However, to understand everything that’s going on, you need to understand what #include means (am I excluding anyone?), how to declare a function, why there’s an “f” appended to the word “print,” etc., etc.

In Python, the same program looks like this:

Not only is this an easier starting point, but as the complexity of your Python programming grows, this simplicity will make sure you’re spending more time writing awesome code and less time tracking down bugs!

Since Python is popular and open-source, there’s a thriving community of Python application developers online with extensive forums and documentation for whenever you need help. No matter what your issue is, the answer is usually only a quick Google search away.

If you’re new to programming or just looking to add another language to your arsenal, I would highly encourage you to join our community.

What is Python?

Named after the classic British comedy troupe Monty Python, Python is a general-purpose, interpreted, object-oriented, high-level programming language with dynamic semantics. That’s a bit of a mouthful, so let’s break it down.

General-Purpose

Python is a general-purpose language which means it can be used for a wide variety of development tasks. Unlike a domain-specific language that can only be used for specific types of applications (think JavaScript and HTML/CSS for web development), a general-purpose language like Python can be used for:

Web applications: Popular frameworks like the Django web application and Flask are written in Python.

Desktop applications: The Dropbox client is written in Python.

Scientific and numeric computing: Python is the top choice for data science and machine learning.

Cybersecurity: Python is excellent for data analysis, writing system scripts that interact with an operating system, and communicating over network sockets.

Interpreted

Python is an interpreted language, meaning Python program code must be run using the Python interpreter.

Traditional programming languages like C/C++ are compiled, meaning that before it can be run, the human-readable code is passed into a compiler (special program) to generate machine code — a series of bytes providing specific instructions to specific types of processors. However, Python is different. Since it’s an interpreted programming language, each line of human-readable code is passed to an interpreter that converts it to machine code at run time.

In other words, instead of having to go through the sometimes complicated and lengthy process of compiling your code before running it, you just point the Python interpreter at your code, and you’re off!

Part of what makes an interpreted language great is how portable it is. Compiled languages must be compiled for the specific type of computer they’re run on (i.e. think your phone vs. your laptop). For Python, as long as you’ve installed the interpreter for your computer, the exact same code will run almost anywhere!

Object-Oriented

Python is an Object-Oriented Programming (OOP) language which means that all of its elements are broken down into things called objects. A Python object is very useful for software architecture and often makes it simpler to write large, complicated applications.

High-Level

Python is a high-level language which really just means that it’s simpler and more intuitive for a human to use. Low-level languages such as C/C++ require a much more detailed understanding of how a computer works. With a high-level language, many of these details are abstracted away to make your life easier.

For instance, say you have a list of three numbers — 1, 2, and 3 — and you want to append the number 4 to that list. In C, you have to worry about how the computer uses memory, understands different types of variables (i.e., an integer vs. a string), and keeps track of what you’re doing.

Implementing this in C code is rather complicated:

However, implementing this in Python code is much simpler:

Since a list in Python is an object, you don’t need to specifically define what the data structure looks like or explain to the computer what it means to append the number 4. You just say “list.append(4)”, and you’re good.

Under the hood, the computer is still doing all of those complicated things, but as a developer, you don’t have to worry about them! Not only does that make your code easier to read, understand, and debug, but it means you can develop more complicated programs much faster.

Dynamic Semantics

Python uses dynamic semantics, meaning that its variables are dynamic objects. Essentially, it’s just another aspect of Python being a high-level language.

In the list example above, a low-level language like C requires you to statically define the type of a variable. So if you defined an integer x, set x = 3, and then set x = “pants”, the computer will get very confused. However, if you use Python to set x = 3, Python knows x is an integer. If you then set x = “pants”, Python knows that x is now a string.

In other words, Python lets you assign variables in a way that makes more sense to you than it does to the computer. It’s just another way that Python programming is intuitive.

It also gives you the ability to do something like creating a list where different elements have different types like the list [1, 2, “three”, “four”]. Defining that in a language like C would be a nightmare, but in Python, that’s all there is to it.

It’s Popular. Like, Super Popular.

Being so powerful, flexible, and user-friendly, the Python language has become incredibly popular. Python’s popularity is important for a few reasons.

Python Programming is in Demand

If you’re looking for a new skill to help you land your next job, learning Python is a great move. Because of its versatility, Python is used by many top tech companies. Netflix, Uber, Pinterest, Instagram, and Spotify all build their applications using Python. It’s also a favorite programming language of folks in data science and machine learning, so if you’re interested in going into those fields, learning Python is a good first step. With all of the folks using Python, it’s a programming language that will still be just as relevant years from now.

Dedicated Community

Python developers have tons of support online. It’s open-source with extensive documentation, and there are tons of articles and forum posts dedicated to it. As a professional Python developer, I rely on this community everyday to get my code up and running as quickly and easily as possible.

There are also numerous Python libraries readily available online! If you ever need more functionality, someone on the internet has likely already written a library to do just that. All you have to do is download it, write the line “import <library>”, and off you go. Part of Python’s popularity in data science and machine learning is the widespread use of its libraries such as NumPy, Pandas, SciPy, and TensorFlow.

Conclusion

Python is a great way to start programming and a great tool for experienced developers. It’s powerful, user-friendly, and enables you to spend more time writing badass code and less time debugging it. With all of the libraries available, it will do almost anything you want it to.

The final answer to the question “What is Python”? Awesome. Python is awesome.

What It’s Really Like to Change Your Career Online

Going to work used to mean physically traveling to a workplace. Whether by foot, public transit, or car — a job was a specific location to which you commuted. But with the advent of the gig economy and advances in technology, telecommuting has become more and more prevalent. In fact, according to a 2018 study, approximately 70% of workers worldwide spend at least one day a week working from home.

So, why should education be any different? Learning from the comfort of home saves you the time and money you would’ve spent commuting, allows you to spend more time with loved ones, and encourages a much more comfortable, casual work environment.

That’s why we’re now offering all of our career-changing Immersives online. We’ve transformed over 11K+ careers — so whether you’re interested in becoming a software engineer, data scientist, or UX designer, you can trust our proven curriculum, elite instructors, and dedicated career coaches to set you up for professional success.

We sat down with three experts on GA’s Immersive Remote programs to better understand how they work — and more importantly — how they compare to the on-campus experience.

Breaking Barriers

GA Education Product Manager Lee Almegard explained the reasoning behind the move: “At GA, the ability to pay tuition, commute to class, or coordinate childcare shouldn’t be a barrier to launching a new career, she said. “Our new 100% remote Immersive programs are designed to ease these barriers.”

Obviously, saving yourself a trip to campus is appealing on many levels, but some interested students expressed concern that they wouldn’t receive enough personalized attention studying online as opposed to IRL. Instructor Matt Huntington reassures them, saying “Our lectures are highly interactive, and there is ample time to ask questions — not only of the teacher but also of other students.”

Staying Focused

It’s not always easy to stay focused in a traditional classroom, but when your fellow students have been replaced by a curious toddler or Netflix is only a click away, distraction is a real concern.

GA graduate Alex Merced shared these worries when he began his Software Engineering Immersive Remote program, but they quickly disappeared. “The clever use of Slack and Zoom really made the class engaging. It leverages the best features of both platforms, such as polls, private channels, and breakout rooms,” he said. “This kept the class kinetic, social, and engaging, versus traditional online training that usually consists of fairly non-interactive lectures over PowerPoint.”

If you’re concerned about staying focused, you can use these simple, impactful tips to stay motivated and on track to meet your goals:

• Plan ahead. Conquer homework by blocking off time on your calendar each week during the hours in which you focus best.
• Limit distractions. Find a quiet place to study, put your device on “Do Not Disturb” mode, or find a productivity app like Freedom to block time-consuming sites when studying or working independently.
• Listen to music. You might find that music helps you concentrate on homework. Some of our favorite Spotify playlists to listen to are Deep Focus, Cinematic Chillout, and Dreamy Vibes.
• Take breaks. Go for a short walk at lunch and change up the scenery, or grab a latte to power through an assignment.
• Ask for help. We’re here for you! Our instructional team is available for guidance, feedback, technical assistance, and more during frequent one-on-one check-ins and office hours.

Most importantly, listen to yourself. Everyone learns differently, so take stock of what works best for you. Find the strategies that fit your learning style, and you’ll be well on your way to new skills and new heights.

Getting Connected and Getting Hired

Another key component of learning is the camaraderie that comes from meeting and studying with like-minded students. How does that translate to a virtual classroom?

GA Career Coach Ruby Sycamore-Smith explains that both students and faculty can have meaningful, productive relationships without ever meeting in person. We’re a lot more intentional online,” she says. “You’re not able to just bump into each other in the corridor as you would on campus, but that means you’re able to be a lot more purposeful with your time when you do connect — way beyond a simple smile and a wave. Merced agrees. “Breakout sessions allowed me to assist and be assisted by my classmates, with whom I’ve forged valuable relationships. Now I have friends all over the world.” And as Huntington pointed out, “There is no back of the classroom when you’re online.” When you learn remotely, every seat is right next to all of your peers.

When we piloted the Software Engineering Remote bootcamp, we took extra care to make sure that our virtual classrooms felt exactly like the on-campus ones, with group labs and even special projects to ensure students are constantly working with each other,” Huntington explained. “A lot of our students form after-hours homework groups, and nighttime TAs create study hall video conferences so everyone can see and talk to each other.”

And with students from all over the country, you’re going to connect with people you never would’ve met within the confines of a classroom. These peers could even be the very contacts who help you get you hired.

By recruiting industry professionals who are also gifted instructors to lead courses, students are taught how to translate their knowledge into in-demand skill sets that employers need. Sycamore-Smith explains that the involvement of GA’s career coaches doesn’t end after graduation; they’re invested in their students’ long-term success.

She says, “Career preparation sessions are very discussion-based and collaborative, as all of our students have varied backgrounds. Some are recent college graduates, others may have had successful careers and experienced a number of job hunts previously. Everyone has unique ideas and insights to share, so we use these sessions to really connect and learn from one another.”

Merced is enthusiastic about his GA experience and quickly landed a great job as a developer. “Finding work was probably the area I was most insecure about going into the class,” he confessed. “But the prep sessions really made the execution and expectations of a job search much clearer and I was able to land firmly on my feet.

Conclusion? Make Yourself at Home

After years of teaching in front of a brick-and-mortar classroom, Huntington was a little wary about his move to digital instructor, but his misgivings quickly gave way.

I was surprised to feel just as close to my virtual students as I did to my on-campus students, he said. “Closing down our virtual classrooms and saying goodbye on the last day of class is so much more heart-wrenching online than it ever was for me when I taught on campus.”

Huntington’s advice to a student wondering if online learning is right for them: “Go for it! It’s just like in person, but there’s no commute and it’s socially acceptable to wear pajamas!”

SQL: Using Data to Boost Business and Increase Efficiency

In today’s digital age, we’re constantly bombarded with information about new apps, transformative technologies, and the latest and greatest artificial intelligence system. While these technologies may serve very different purposes in our life, all of them share one thing in common: They rely on data. More specifically, they all use databases to capture, store, retrieve, and aggregate data. This begs the question: How do we actually interact with databases to accomplish all of this? The answer: We use Structured Query Language, or SQL (pronounced “sequel” or “ess-que-el”).

Put simply, SQL is the language of data — it’s a programming language that enables us to efficiently create, alter, request, and aggregate data from those mysterious things called databases. It gives us the ability to make connections between different pieces of information, even when we’re dealing with huge data sets. Modern applications are able to use SQL to deliver really valuable pieces of information that would otherwise be difficult for humans to keep track of independently. In fact, pretty much every app that stores any sort of information uses a database. This ubiquity means that developers use SQL to log, record, alter, and present data within the application, while analysts use SQL to interrogate that same data set in order to find deeper insights.

Finding SQL in Everyday Life

Think about the last time you looked up the name of a movie on IMDB. I’ll bet you quickly noticed an actress on the cast list and thought something like, “I didn’t realize she was in that,” then clicked a link to read her bio. As you were navigating through that app, SQL was responsible for returning the information you “requested” each time you clicked a link. This sort of capability is something we’ve come to take for granted these days.

Let’s look at another example that truly is cutting-edge, this time at the intersection of local government and small business. Many metropolitan cities are supporting open data initiatives in which public data is made easily accessible through access to the databases that store this information. As an example, let’s look at Los Angeles building permit data, business listings, and census data.

Imagine you work at a real estate investment firm and are trying to find the next up-and-coming neighborhood. You could use SQL to combine the permit, business, and census data in order to identify areas that are undergoing a lot of construction, have high populations, and contain a relatively low number of businesses. This might be a great opportunity to purchase property in a soon-to-be thriving neighborhood! For the first time in history, it’s easy for a small business to leverage quantitative data from the government in order to make a highly informed business decision.

There are many ways to harness SQL’s power to supercharge your business and career, in marketing and sales roles, and beyond. Here are just a few:

• Increase sales: A sales manager could use SQL to compare the performance of various lead-generation programs and double down on those that are working.
• Track ads: A marketing manager responsible for understanding the efficacy of an ad campaign could use SQL to compare the increase in sales before and after running the ad.
• Streamline processes: A business manager could use SQL to compare the resources used by various departments in order to determine which are operating efficiently.

SQL at General Assembly

At General Assembly, we know businesses are striving to transform their data from raw facts into actionable insights. The primary goal of our data analytics curriculum, from workshops to full-time courses, is to empower people to access this data in order to answer their own business questions in ways that were never possible before.

To accomplish this, we give students the opportunity to use SQL to explore real-world data such as Firefox usage statistics, Iowa liquor sales, or Zillow’s real estate prices. Our full-time Data Science Immersive and part-time Data Analytics courses help students build the analytical skills needed to turn the results of those queries into clear and effective business recommendations. On a more introductory level, after just a couple of hours of in one of our SQL workshops, students are able to query multiple data sets with millions of rows.

Meet Our Expert

Michael Larner is a passionate leader in the analytics space who specializes in using techniques like predictive modeling and machine learning to deliver data-driven impact. A Los Angeles native, he has spent the last decade consulting with hundreds of clients, including 50-plus Fortune 500 companies, to answer some of their most challenging business questions. Additionally, Michael empowers others to become successful analysts by leading trainings and workshops for corporate clients and universities, including General Assembly’s part-time Data Analytics course and SQL/Excel workshops in Los Angeles.

“In today’s fast-paced, technology-driven world, data has never been more accessible. That makes it the perfect time — and incredibly important — to be a great data analyst.”

– Michael Larner, Data Analytics Instructor, General Assembly Los Angeles

Harnessing the Power of Data for Disaster Relief

Data is the engine driving today’s digital world. From major companies to government agencies to nonprofits, business leaders are hunting for talent that can help them collect, sort, and analyze vast amounts of data — including geodata — to tackle the world’s biggest challenges.

In the case of emergency management, disaster preparedness, response, and recovery, this means using data to expertly identify, manage, and mitigate the risks of destructive hurricanes, intense droughts, raging wildfires, and other severe weather and climate events. And the pressure to make smarter data-driven investments in disaster response planning and education isn’t going away anytime soon — since 1980, the U.S. has suffered 246 weather and climate disasters that topped over \$1 billion in losses according to the National Centers for Environmental Information.

Employing creative approaches for tackling these pressing issues is a big reason why New Light Technologies (NLT), a leading company in the geospatial data science space, joined forces with General Assembly’s (GA) Data Science Immersive (DSI) course, a hands-on intensive program that fosters job-ready data scientists. Global Lead Data Science Instructor at GA, Matt Brems, and Chief Scientist and Senior Consultant at NLT, Ran Goldblatt, recognized a unique opportunity to test drive collaboration between DSI students and NLT’s consulting work for the Federal Emergency Management Agency (FEMA) and the World Bank.

The goal for DSI students: build data solutions that address real-world emergency preparedness and disaster response problems using leading data science tools and programming languages that drive visual, statistical, and data analyses. The partnership has so far produced three successful cohorts with nearly 60 groups of students across campuses in Atlanta, Austin, Boston, Chicago, Denver, New York City, San Francisco, Los Angeles, Seattle, and Washington, D.C., who learn and work together through GA’s Connected Classroom experience.

Taking on Big Problems With Smart Data

DSI students present at NLT’s Washington, D.C. office.

“GA is a pioneering institution for data science, so many of its goals coincide with ours. It’s what also made this partnership a unique fit. When real-world problems are brought to an educational setting with students who are energized and eager to solve concrete problems, smart ideas emerge,” says Goldblatt.

Over the past decade, NLT has supported the ongoing operation, management, and modernization of information systems infrastructure for FEMA, providing the agency with support for disaster response planning and decision-making. The World Bank, another NLT client, faces similar obstacles in its efforts to provide funding for emergency prevention and preparedness.

These large-scale issues served as the basis for the problem statements NLT presented to DSI students, who were challenged to use their newfound skills — from developing data algorithms and analytical workflows to employing visualization and reporting tools — to deliver meaningful, real-time insights that FEMA, the World Bank, and similar organizations could deploy to help communities impacted by disasters. Working in groups, students dived into problems that focused on a wide range of scenarios, including:

• Using tools such as Google Street View to retrieve pre-disaster photos of structures, allowing emergency responders to easily compare pre- and post-disaster aerial views of damaged properties.
• Optimizing evacuation routes for search and rescue missions using real-time traffic information.
• Creating damage estimates by pulling property values from real estate websites like Zillow.
• Extracting drone data to estimate the quality of building rooftops in Saint Lucia.

“It’s clear these students are really dedicated and eager to leverage what they learned to create solutions that can help people. With DSI, they don’t just walk away with an academic paper or fancy presentation. They’re able to demonstrate they’ve developed an application that, with additional development, could possibly become operational,” says Goldblatt.

Students who participated in the engagements received the opportunity to present their work — using their knowledge in artificial intelligence and machine learning to solve important, tangible problems — to an audience that included high-ranking officials from FEMA, the World Bank, and the United States Agency for International Development (USAID). The students’ projects, which are open source, are also publicly available to organizations looking to adapt, scale, and implement these applications for geospatial and disaster response operations.

“In the span of nine weeks, our students grew from learning basic Python to being able to address specific problems in the realm of emergency preparedness and disaster response,” says Brems. “Their ability to apply what they learned so quickly speaks to how well-qualified GA students and graduates are.”

Here’s a closer look at some of those projects, the lessons learned, and students’ reflections on how GA’s collaboration with NLT impacted their DSI experience.

Leveraging Social Media to Map Disasters

The NLT engagements feature student work that uses social media to identify “hot spots” for disaster relief.

During disasters, one of the biggest challenges for disaster relief organizations is not only mapping and alerting users about the severity of disasters but also pinpointing hot spots where people require assistance. While responders employ satellite and aerial imagery, ground surveys, and other hazard data to assess and identify affected areas, communities on the ground often turn to social media platforms to broadcast distress calls and share status updates.

Cameron Bronstein, a former botany and ecology major from New York, worked with group members to build a model that analyzes and classifies social media posts to determine where people need assistance during and after natural disasters. The group collected tweets related to Hurricane Harvey of 2017 and Hurricane Michael of 2018, which inflicted billions of dollars of damage in the Caribbean and Southern U.S., as test cases for their proof-of-concept model.

“Since our group lacked premium access to social media APIs, we sourced previously collected and labeled text-based data,” says Bronstein. “This involved analyzing and classifying several years of text language — including data sets that contained tweets, and transcribed phone calls and voice messages from disaster relief organizations.”

Contemplating on what he enjoyed most while working on the NLT engagement, Bronstein states, “Though this project was ambitious and open to interpretation, overall, it was a good experience and introduction to the type of consulting work I could end up doing in the future.”

Quantifying the Economic Impact of Natural Disasters

Students use interactive data visualization tools to compile and display their findings.

Prior to enrolling in General Assembly’s DSI course in Washington D.C., Ashley White learned early in her career as a management consultant how to use data to analyze and assess difficult client problems. “What was central to all of my experiences was utilizing the power of data to make informed strategic decisions,” states White.

It was White’s interest in using data for social impact that led her to enroll in DSI where she could be exposed to real-world applications of data science principles and best practices. Her DSI group’s task: developing a model for quantifying the economic impact of natural disasters on the labor market. The group selected Houston, Texas as its test case for defining and identifying reliable data sources to measure the economic impact of natural disasters such as Hurricane Harvey.

As they tackled their problem statement, the group focused on NLT’s intended goal, while effectively breaking their workflow into smaller, more manageable pieces. “As we worked through the data, we discovered it was hard to identify meaningful long-term trends. As scholarly research shows, most cities are pretty resilient post-disaster, and the labor market bounces back quickly as the city recovers,” says White.

The team compiled their results using the analytics and data visualization tool Tableau, incorporating compelling visuals and story taglines into a streamlined, dynamic interface. For version control, White and her group used GitHub to manage and store their findings, and share recommendations on how NLT could use the group’s methodology to scale their analysis for other geographic locations. In addition to the group’s key findings on employment fluctuations post-disaster, the team concluded that while natural disasters are growing in severity, aggregate trends around unemployment and similar data are becoming less predictable.

Cultivating Data Science Talent in Future Engagements

Due to the success of the partnership’s three engagements, GA and NLT have taken steps to formalize future iterations of their collaboration with each new DSI cohort. Additionally, mutually beneficial partnerships with leading organizations such as NLT present a unique opportunity to uncover innovative approaches for managing and understanding the numerous ways data science can support technological systems and platforms. It’s also granted aspiring data scientists real-world experience and visibility with key decision-makers who are at the forefront of emergency and disaster management.

“This is only the beginning of a more comprehensive collaboration with General Assembly,” states Goldblatt. “By leveraging GA’s innovative data science curriculum and developing training programs for capacity building that can be adopted by NLT clients, we hope to provide students with essential skills that prepare them for the emerging, yet competitive, geospatial data job market. Moreover, students get the opportunity to better understand how theory, data, and algorithms translate to actual tools, as well as create solutions that can potentially save lives.”

***

New Light Technologies, Inc. (NLT) provides comprehensive information technology solutions for clients in government, commercial, and non-profit sectors. NLT specializes in DevOps enterprise-scale systems integration, development, management, and staffing and offers a unique range of capabilities from Infrastructure Modernization and Cloud Computing to Big Data Analytics, Geospatial Information Systems, and the Development of Software and Web-based Visualization Platforms.

In today’s rapidly evolving technological world, successfully developing and deploying digital geospatial software technologies and integrating disparate data across large complex enterprises with diverse user requirements is a challenge. Our innovative solutions for real-time integrated analytics lead the way in developing highly scalable virtualized geospatial microservices solutions. Visit our website to find out more and contact us at https://NewLightTechnologies.com.

The Study of Data Science Lags in Gender and Racial Representation

In the past few years, much attention has been drawn to the dearth of women and people of color in tech-related fields. A recent article in Forbes noted, “Women hold only about 26% of data jobs in the United States. There are a few reasons for the gender gap: a lack of STEM education for women early on in life, lack of mentorship for women in data science, and human resources rules and regulations not catching up to gender balance policies, to name a few.” Federal civil rights data further demonstrate that “black and Latino high school students are being shortchanged in their access to high-level math and science courses that could prepare them for college” and for careers in fields like data science.

As an education company offering tech-oriented courses at 20 campuses across the world, General Assembly is in a unique position to analyze the current crop of students looking to change the dynamics of the workplace.

Looking at GA data for our part-time programs (which typically reach students who already have jobs and are looking to expand their skill set as they pursue a promotion or a career shift), here’s what we found: While great strides have been made in fields like web development and user experience (UX) design, data science — a relatively newer concentration — still has a ways to go in terms of gender and racial equality.

Using Apache Spark For High Speed, Large Scale Data Processing

Apache Spark is an open-source framework used for large-scale data processing. The framework is made up of many components, including four programming APIs and four major libraries. Since Spark’s release in 2014, it has become one of Apache’s fastest growing and most widely used projects of all time.

Spark uses an in-memory processing paradigm to speed up computation and run programs 10 to 100 times faster than other big data technologies like Hadoop MapReduce. According to the 2016 Apache Spark Survey, more than 900 companies, including IBM, Google, Netflix, Amazon, Microsoft, Intel, and Yahoo, use Spark in production for data processing and querying.

Apache Spark is important to the big data field because it represents the next generation of big data processing engines and is a natural successor to MapReduce. One of Spark’s advantages is that its use of four programming APIs — Scala, Python, R, and Java 8 — allows the user flexibility to work in the language of their choice. This makes the tool much more accessible to a wide range of programmers with different capabilities. Spark also has great flexibility in its ability to read all types of data from various locations such as Hadoop Distributed File Storage (HDFS), Amazon’s web-based Simple Storage Service (S3), or even the local filesystem.

Spark’s greatest advantage is that it maximizes the capabilities of data science’s most expensive resource: the data scientist. Computers and programs have become so fast, that we are no longer limited by what they can do as much as we are limited by human productivity. By providing a flexible language platform and having concise syntax, the data scientist can write more programs, iterate through their programs, and have them run much quicker. The code is production-ready and scalable, so there’s no need to hand off code requirements to a development team for changes.

It takes only a few minutes to write a word-count program in Spark, but would take much longer to write the same program in Java. Because the Spark code is so much shorter, there’s less of a need to debug or use version control tools.

Spark’s concise syntax can best be illustrated with the following examples. The Spark code is only four lines compared with almost 58 for Java.

Faster Processing

Spark utilizes in-memory processing to speed up applications. The older big data frameworks, such as Hadoop, use many intermediate disc reads and writes to accomplish the same task. For small jobs on several gigabytes of data, this difference is not as pronounced, but for machine learning applications and more complex tasks such as natural language processing, the difference can be tremendous. Logistic regression, a technique taught in all of General Assembly’s full- and part-time data science courses, can be sped up over 100x.

Spark has four key libraries that also make it much more accessible and provide a wider set of tools for people to use. Spark SQL is ideal for leveraging SQL skills or work with data frames; Spark Streaming has functions for data processing, useful if you need to process data in near real time; and GraphX has pre-written algorithms that are useful if you have graph data or need to do graph processing. The library most useful to students in our Data Science Immersive, though, is the Spark MLlib machine learning library, which has prewritten distributed machine learning algorithms for use on data frames.

Spark at General Assembly

At GA, we teach both the concepts and the tools of data science. Because hiring managers from marketing, technology, and biotech companies, as well as guest speakers like company founders and entrepreneurs, regularly talk about using Spark, we’ve incorporated it into the curriculum to ensure students are fluent in the field’s most relevant skills. I teach Spark as part of our Data Science Immersive (DSI) course in Boston, and I previously taught two Spark courses for Cloudera and IBM. Spark is a great tool to teach because the general curriculum focuses mostly on Python, and Spark has a Python API/library called PySpark.

When we teach Spark in DSI, we cover resilient distributed data sets, directed acyclic graphs, closures, lazy execution, and reading JavaScript Object Notation (JSON), a common big data file format.

Meet Our Expert

Joseph Kambourakis has over 10 years of teaching experience and over five years of experience teaching data science and analytics. He has taught in more than a dozen countries and has been featured in Japanese and Saudi Arabian press. He holds a bachelor’s degree in electrical and computer engineering from Worcester Polytechnic Institute and an MBA with a focus in analytics from Bentley University. He is a passionate Arsenal FC supporter and competitive Magic: The Gathering player. He currently lives with his wife and daughter in Needham, Massachusetts.

“GA students come to class motivated to learn. Throughout the Data Science Immersive course, I keep them on their path by being patient and setting up ideas in a simple way, then letting them learn from hands-on lab work.”

Joseph Kambourakis, Data Science Instructor, General Assembly Boston