Data Category Archives - General Assembly Blog | Page 5

A Machine Learning Guide for Beginners

By

Ever wonder how apps, websites, and machines seem to be able to predict the future? Like how Amazon knows what your next purchase may be, or how self-driving cars can safely navigate a complex traffic situation?

The answer lies in machine learning.

Machine learning is a branch of artificial intelligence (AI) that often leverages Python to build systems that can learn from and make decisions based on data. Instead of explicitly programming the machine to solve the problem, we show it how it was solved in the past and the machine learns the key steps that are required to do the same task on its own.

Machine learning is revolutionizing every industry by bringing greater value to companies’ years of saved data. Leveraging machine learning enables organizations to make more precise decisions instead of following intuition.

There’s an explosive amount of innovation around machine learning that’s being used within organizations, especially given that the technology is still in its early days. Many companies have invested heavily in building recommendation and personalization engines for their customers. But, machine learning is also being applied in a huge variety of back-office use cases as well, like to forecast sales, identify production bottlenecks, build efficient traffic routing systems, and more.

Machine learning algorithms fall into two categories: supervised and unsupervised learning.

Supervised Learning

Supervised learning tries to predict a future value by relying on training from past data. For instance, Netflix’s movie-recommendation engine is most likely supervised. It uses a user’s past movie ratings to train the model, then predicts what their rating would likely be for movies they haven’t seen and recommends the ones that score highly.

Supervised learning enjoys more commercial success than unsupervised learning. Some common use cases include fraud detection, image recognition, credit scoring, product recommendation, and malfunction prediction.

Unsupervised Learning

Unsupervised learning is about uncovering hidden structures within data sets. It’s helpful in identifying segments or groups, especially when there is no prior information available about them. These algorithms are commonly used in market segmentation. They enable marketers to identify target segments in order to maximize revenue, create anomaly detection systems to identify suspicious user behavior, and more.

For instance, Netflix may know how many customers it has, but wants to understand what kind of groupings they fall into in order to offer services targeted to them. The streaming service may have 50 or more different customer types, aka, segments, but its data team doesn’t know this yet. If the company knows that most of its customers are in the “families with children” segment, it can invest in building specific programs to meet those customer needs. But, without that information, Netflix’s data experts can’t create a supervised machine learning system.

So, they build an unsupervised machine learning algorithm instead, which identifies and extracts various customer segments within the data and allows them to identify groups such as “families with children” or “working professionals.”

How Python, SQL, and Machine Learning Work Together

To understand how SQLPython, and machine learning relate to one another, let’s think of them as a factory. As a concept, a factory can produce anything if it has the right tools. More often than not, the tools used in factories are pretty similar (e.g., hammers and screwdrivers).

What’s amazing is that there can be factories that use those same tools but produce completely different products (e.g., tables versus chairs). The difference between these factories is not the tools, but rather how the factory workers use their expertise to leverage these tools and produce a different result.

In this case, our goal would be to produce a machine learning model, and our tools would be SQL and Python. We can use SQL to extract data from a database and Python to shape the data and perform the analyses that ultimately produce a machine learning model. Your knowledge of machine learning will ultimately enable you to achieve your goal.

To round out the analogy, an app developer, with no understanding of machine learning, might choose to use SQL and Python to build a web app. Again, the tools are the same, but the practitioner uses their expertise to apply them in a different way.

Machine Learning at Work

A wide variety of roles can benefit from machine learning know-how. Here are just a few:

  • Data scientist or analyst: Data scientists or analysts use machine learning to answer specific business questions for key stakeholders. They might help their company’s user experience (UX) team determine which website features most heavily drive sales.
  • Machine learning engineer: A machine learning engineer is a software engineer specifically responsible for writing code that leverages machine learning models. For example, they might build a recommendation engine that suggests products to customers.
  • Research scientist: A machine learning research scientist develops new technologies like computer vision for self-driving cars or advancements in neural networks. Their findings enable data professionals to deliver new insights and capabilities.

Machine Learning in Everyday Life: Real-World Examples

While machine learning-powered innovations like voice-activated robots seem ultra-futuristic, the technology behind them is actually widely used today. Here are some great examples of how machine learning impacts your daily life:

  • Recommendation engines: Think about how Spotify makes music recommendations. The recommendation engine peeks at the songs and albums you’ve listened to in the past, as well as tracks listened to by users with similar tastes. It then starts to learn the factors that influence your music preferences and stores them in a database, recommending similar music that you haven’t listened to — all without writing any explicit rules!
  • Voice-recognition technology: We’ve seen the emergence of voice assistants like Amazon’s Alexa and Google’s Assistant. These interactive systems are based entirely on voice-recognition technology powered by machine learning models.
  • Risk mitigation and fraud prevention: Insurers and creditors use machine learning to make accurate predictions on fraudulent claims based on previous consumer behavior, rather than relying on traditional analysis or human judgement. They also can use these analyses to identify high-risk customers. Both of these analyses help companies process requests and claims more quickly and at a lower cost.
  • Photo identification via computer vision: Machine learning is common among photo-heavy services like Facebook and the home-improvement site Houzz. Each of these services use computer vision — an aspect of machine learning — to automatically tag objects in photos without human intervention. For Facebook, these tend to be faces, whereas Houzz seeks to identify individual objects and link to a place where users can purchase them.

Why You and Your Business Need to Understand Data Science

As the world becomes increasingly data-driven, learning to leverage key technologies like machine learning — along with the programming languages Python (which helps power machine learning algorithms) and SQL — will create endless possibilities for your career and your organization. There are many pathways into this growing field, as detailed by our Data Science Standards Board, and now’s a great time to dive in.

In our paper A Beginner’s Guide to SQL, Python, and Machine Learning, we break down these three data sectors. These skills go beyond data to bring delight, efficiency, and innovation to countless industries. They empower people to drive businesses forward with a speed and precision previously unknown.

Individuals can use data know-how to improve their problem-solving skills, become more cross-functional, build innovative technology, and more. For companies, leveraging these technologies means smarter use of data. This can lead to greater efficiency, employees who are empowered to use data in innovative ways, and business decisions that drive revenue and success.

Download the paper to learn more.

Boost your business and career acumen with data.
Find out why machine learning, Python, and SQL are the top technologies to know.

The Study of Data Science Lags in Gender and Racial Representation

By

data science gender race disparity

In the past few years, much attention has been drawn to the dearth of women and people of color in tech-related fields. A recent article in Forbes noted, “Women hold only about 26% of data jobs in the United States. There are a few reasons for the gender gap: a lack of STEM education for women early on in life, lack of mentorship for women in data science, and human resources rules and regulations not catching up to gender balance policies, to name a few.” Federal civil rights data further demonstrate that “black and Latino high school students are being shortchanged in their access to high-level math and science courses that could prepare them for college” and for careers in fields like data science.

As an education company offering tech-oriented courses at 20 campuses across the world, General Assembly is in a unique position to analyze the current crop of students looking to change the dynamics of the workplace.

Looking at GA data for our part-time programs (which typically reach students who already have jobs and are looking to expand their skill set as they pursue a promotion or a career shift), here’s what we found: While great strides have been made in fields like web development and user experience (UX) design, data science — a relatively newer concentration — still has a ways to go in terms of gender and racial equality.

Continue reading

Using Apache Spark For High Speed, Large Scale Data Processing

By

Apache Spark is an open-source framework used for large-scale data processing. The framework is made up of many components, including four programming APIs and four major libraries. Since Spark’s release in 2014, it has become one of Apache’s fastest growing and most widely used projects of all time.

Spark uses an in-memory processing paradigm to speed up computation and run programs 10 to 100 times faster than other big data technologies like Hadoop MapReduce. According to the 2016 Apache Spark Survey, more than 900 companies, including IBM, Google, Netflix, Amazon, Microsoft, Intel, and Yahoo, use Spark in production for data processing and querying.

Apache Spark is important to the big data field because it represents the next generation of big data processing engines and is a natural successor to MapReduce. One of Spark’s advantages is that its use of four programming APIs — Scala, Python, R, and Java 8 — allows the user flexibility to work in the language of their choice. This makes the tool much more accessible to a wide range of programmers with different capabilities. Spark also has great flexibility in its ability to read all types of data from various locations such as Hadoop Distributed File Storage (HDFS), Amazon’s web-based Simple Storage Service (S3), or even the local filesystem.

Production-Ready and Scalable

Spark’s greatest advantage is that it maximizes the capabilities of data science’s most expensive resource: the data scientist. Computers and programs have become so fast, that we are no longer limited by what they can do as much as we are limited by human productivity. By providing a flexible language platform and having concise syntax, the data scientist can write more programs, iterate through their programs, and have them run much quicker. The code is production-ready and scalable, so there’s no need to hand off code requirements to a development team for changes.

It takes only a few minutes to write a word-count program in Spark, but would take much longer to write the same program in Java. Because the Spark code is so much shorter, there’s less of a need to debug or use version control tools.

Spark’s concise syntax can best be illustrated with the following examples. The Spark code is only four lines compared with almost 58 for Java.

Java vs. Spark

Faster Processing

Spark utilizes in-memory processing to speed up applications. The older big data frameworks, such as Hadoop, use many intermediate disc reads and writes to accomplish the same task. For small jobs on several gigabytes of data, this difference is not as pronounced, but for machine learning applications and more complex tasks such as natural language processing, the difference can be tremendous. Logistic regression, a technique taught in all of General Assembly’s full- and part-time data science courses, can be sped up over 100x.

Spark has four key libraries that also make it much more accessible and provide a wider set of tools for people to use. Spark SQL is ideal for leveraging SQL skills or work with data frames; Spark Streaming has functions for data processing, useful if you need to process data in near real time; and GraphX has pre-written algorithms that are useful if you have graph data or need to do graph processing. The library most useful to students in our Data Science Immersive, though, is the Spark MLlib machine learning library, which has prewritten distributed machine learning algorithms for use on data frames.

Spark at General Assembly

At GA, we teach both the concepts and the tools of data science. Because hiring managers from marketing, technology, and biotech companies, as well as guest speakers like company founders and entrepreneurs, regularly talk about using Spark, we’ve incorporated it into the curriculum to ensure students are fluent in the field’s most relevant skills. I teach Spark as part of our Data Science Immersive (DSI) course in Boston, and I previously taught two Spark courses for Cloudera and IBM. Spark is a great tool to teach because the general curriculum focuses mostly on Python, and Spark has a Python API/library called PySpark.

When we teach Spark in DSI, we cover resilient distributed data sets, directed acyclic graphs, closures, lazy execution, and reading JavaScript Object Notation (JSON), a common big data file format.

Meet Our Expert

Joseph Kambourakis has over 10 years of teaching experience and over five years of experience teaching data science and analytics. He has taught in more than a dozen countries and has been featured in Japanese and Saudi Arabian press. He holds a bachelor’s degree in electrical and computer engineering from Worcester Polytechnic Institute and an MBA with a focus in analytics from Bentley University. He is a passionate Arsenal FC supporter and competitive Magic: The Gathering player. He currently lives with his wife and daughter in Needham, Massachusetts.

“GA students come to class motivated to learn. Throughout the Data Science Immersive course, I keep them on their path by being patient and setting up ideas in a simple way, then letting them learn from hands-on lab work.”

Joseph Kambourakis, Data Science Instructor, General Assembly Boston

How Data Maps Reveal Inequality and Equity in Atlanta

By

Housing Map of Atlanta provided by Neighborhood Nexus.

Map of Atlanta provided by Neighborhood Nexus.

Mapping the communities of tomorrow requires a hard look at the topographies of today. Mike Carnathan, project director at Neighborhood Nexus, synthesizes big data into visual stories that chart the social, political, and economic conditions across the city of Atlanta. Part data miner, part cultural cartographer, Carnathan creates demographic maps that local leaders, advocates, and everyday citizens use to help understand and change their lives.

Continue reading

Measuring What Matters: General Assembly’s First Student Outcomes Report

By

ga_outcomes-email-blog

Since founding General Assembly in 2011, I’ve heard some incredible stories from our students and graduates. One of my favorites is about Jerome Hardaway. Jerome came to GA after five years in the United States Air Force. He dreamed of tackling persistent diversity gaps in the technology sector by breaking down barriers for other veterans and people of color.

In 2014, with the help of General Assembly’s Opportunity Fund scholarship, Jerome began one of our full-time Web Development Immersive courses. After graduation, he had the opportunity to pitch President Obama at the first-ever White House Demo Day and has launched a nonprofit in Nashville, Vets Who Code, which helps veterans navigate the transition to civilian life through technology skills training.

Exceptional stories like Jerome’s embody GA’s mission of “empowering people to pursue the work they love.” It’s a mission that motivates our instructional designers, faculty, mentors, and career coaches. It also inspired the development of an open source reporting framework which defined GA’s approach to measuring student outcomes and now, our first report with verified student outcomes metrics.

Continue reading

The Skills and Tools Every Data Scientist Must Master

By

women of color in tech

Photo by WOC in Tech.

“Data scientist” is one of today’s hottest jobs.

In fact, Glassdoor calls it the best job of 2017, with a median base salary of $110,000. This fact shouldn’t be big news. In 2011, McKinsey predicted there would be a shortage of 1.5 million managers and analysts “with the know-how to use the analysis of big data to make effective decisions.” Today, there are more than 38,000 data scientist positions listed on Glassdoor.com.

It makes perfect sense that this job is both new and popular, since every move you make online is actively creating data somewhere for something. Someone has to make sense of that data and discover trends in the data to see if the data is useful. That is the job of the data scientist. But how does the data scientist go about the job? Here are the three skills and three tools that every data scientist should master.

Continue reading

Announcing General Assembly’s New Data Science Immersive

By

DataImmersive_EmailArt_560x350_v1

Data science is “one of the hottest and best-paid professions in the U.S. More than ever, companies need analytical minds who can compile data, analyze it, and drive everything from marketing forecasts to product launches with compelling predictions. Their work drives the core strategies of modern business — so much so that, by 2018, data-related job openings will total 1.5 million. That’s why we’ve worked hard to develop classes, workshops, and courses to confront the data science skills gap. The latest addition to our proud family of data education is the new Data Science Immersive program.

Launching for the first time in San Francisco and Washington, D.C. on April 11, this full-time Immersive program will equip you with the tools and techniques you need to become a data pro in just 12 weeks.

Continue reading

What It Means to Be Data Literate

By

Data-Driven-UX-Design

The Data Journalists handbook defines data literacy as, “the ability to consume for knowledge, produce coherently and think critically about data.” It goes on to say that “data literacy includes statistical literacy but also understanding how to work with large data sets, how they were produced, how to connect various data sets and how to interpret them.”

At General Assembly, we’d like to imagine a world where you don’t need a Ph.D. in Statistics to have a data-informed conversation about your business, your health, or your life in general. Over the past year, we’ve embarked on the journey to build a more data literate world through education offerings that meet the diverse needs of our students.

In building these courses, we’ve sought advice from data scientists, analysts, and hiring managers to determine the critical skills you need to become data literate in today’s workforce. We discovered that it isn’t just a concrete list of skills, but a mindset geared towards data—a way of approaching problems beyond “gut instincts.” 

Here, we’ve proposed a few simple questions that will help you start to view the world through the lens of data.

Continue reading

3 Ways that Data Affects Mass Media

By

media-data-blog-picjumbo

If you thought the introduction of the commercial Internet changed mass media, take a look at what’s in front of you today. Behind the sites of your favorite newspapers and blogs (yes, even this one), publishers are using data to create better audience experiences. For anyone who has ever considered working with data as part of their career, there are now more opportunities than ever to bring media and data together. Here are some of the most important technologies to have on your radar.

Continue reading

How Can UX Design Make Sense of Big Data?

By

ux-data-blog-picjumbo

Big data is just what it sounds like; data so big that it’s not easily processed through conventional methods. However, once this large data set is eventually distilled down, user experience can play a huge role in making sense of the reports and leading the charge for user-centered solutions.

User experience (UX) is the bridge between big data analytics and the end user. The richness of big data being collected by all types of companies has unleashed a treasure trove of information for user experience designers. UX designers can create more robust solutions for users by analyzing these enormous data sets.

Continue reading