George McIntire, Author at General Assembly Blog

15 Data Science Projects to get you Started

By

When it comes to getting a job in data science, data scientists need to think like Creatives. Yes, that’s correct. Those looking to enter this field need to have a data science portfolio of previously completed data science projects, similar to those in Creative professions. What better way to prove to your future data science team that you’re capable of being a data scientist than proving you can do the work?

A common problem for data science entrants is that employers want candidates with experience, but how do you get experience without having access to experience? Suppose you’re looking to get that first foot in the door. It will behoove you to undertake a couple of data science projects to show future employers you’ve got what it takes to use big data to identify opportunities and succeed in the field.

The good news is that we live in a time of open and abundant data. Websites like Kaggle offer a treasure trove of free data for deep learning on everything from crime statistics to Pokemon to Bitcoin and more. However, the wealth of easily accessible data can be overwhelming, which is why we’ve taken it upon ourselves to present 15 data science projects you can execute in Python to showcase and improve your skills in data analytics. Our data science project ideas cover various topics, from Spotify songs to fake news to fraud detection and techniques such as clustering, regression, and natural language processing.

Before you dive in, be sure to adhere to these four guidelines no matter which data science project idea you choose:

1. Articulate the Problem and/or Scenario

It’s not enough to do a project where you use “X” to predict “Y”; you need to add some context to your work because data science does not occur in a vacuum. Tell us what you’re trying to solve and how data science can address that. Employers want to know if you can turn a problem into a question and a question into a solution. A good place to start is to depict a real-world scenario in which your data project would be useful.

2. Publish & Explain Your Work

Create a GitHub repository where you can upload your Jupyter Notebooks and data. Write a blog post in which you narrate your project from start to finish. Talk about the problem or question at the heart of the project, and explain your decision to clean the data in a certain way or why you decided to use a certain algorithm. Why all this? Potential employers need to understand your methodology.

3. Use Domain Expertise

If you’re trying to break into a specific field such as finance, health, or sports, use your knowledge of this area to enhance your project. This could mean deriving a useful question to a pressing problem or articulating a well-thought-out interpretation of your project’s results. For example, if you’re looking to become a data scientist in the finance sector, it would be worthwhile to show how your methods can generate a return on investment.

4. Be Creative & Different

Anyone can copy and paste code that trains a machine learning algorithm. If you want to stand out, review existing data science projects that use the same data and fill in the gaps left by them. If you’re working on a prediction project, try coming up with an unexpected variable that you think would be beneficial.

Data Science Projects

1. Titanic Data

Working on the Titanic dataset is a rite of passage in data science. It’s a useful dataset that beginners can work with to improve their feature engineering and classification skills. Try using a decision tree to visualize the relationships between the features and the probability of surviving the Titanic.

2. Spotify Data

Spotify has an amazing API that provides access to rich data on their entire catalog of songs. You can grab cool attributes such as a song’s acoustics, danceability, and energy. The great thing about this data source is that the project possibilities are almost endless. You can use these features to try to predict genre or popularity. One fun idea would be to better understand your music by training a machine learning classifier on two sets of songs; songs you like and songs you do not.

3. Personality Data Clustering

You’ve probably heard the phrase, “There are X types of people.” Well, now you can actually find out how many types of people there really are. Using this dataset of almost 20k responses to the Big Five Personality Test, you can actually answer this question. Throw this data into a clustering algorithm such as KMeans and sort this into K number of groups. Once you decide on the optimal number of clusters, it’s incumbent on you to define each cluster. Come up with labels that add meaning to each group, and don’t be afraid to use plenty of charts and graphs to support your interpretation.

4. Fake News

If you are interested in natural language processing, building a classifier to differentiate between fake and real news is a great way to demonstrate that. Fake news is a problem that social media platforms have been struggling with for the past several years and a project that tackles this problem is a great way to show you care about solving real-world problems. Use your classifier to identify interesting insights about the patterns in fake versus real news; for example, tell us which words or phrases are most associated with fake news articles.

5. COVID-19 Dataset

There probably isn’t a more relevant use of data science than a project analyzing COVID-19. This dataset provides a wealth of information related to the pandemic. It provides a great opportunity to show off your exploratory data analysis chops. Take a deep dive into this data, and through data visualization unearth patterns about the rate of COVID infection by county, state, and country.

6. Telco Customer Churn

If you’re looking for a straightforward project that is extremely applicable to the business world, then this one’s for you. Use this dataset to train a classifier that predicts customer churn. If you can show employers you know how to prevent customers from leaving their business, you’ll most definitely grab their attention. Pro tip: this is a great projection to show your understanding of classification metrics besides accuracies, such as precision and recall.

7. Lending Club Loans

Like the Telco project, the Lending Club loan dataset is extremely relevant to the business world. Here you can train a classifier that predicts whether or not a Lending Club loanee will pay back a loan using a wealth of information such as credit score, loan amount, and loan purpose. There are a lot of variables at your disposal, so I’d recommend starting with a handful of features and working your way up from there. See how far you can get with just the basics.

Also, this is a fairly untidy dataset that will require extensive cleaning and feature engineering, which is a good thing because that is often the case with real-world data. Be sure to explain your methodology behind preparing your dataset for the machine learning algorithm — this informs the audience of your domain expertise.

8. Breast Cancer Detection

This dataset provides a simpler classification scenario in which you can use health-related variables to predict instances of breast cancer. If you’re looking to apply your data science skills to the medical field, this is certainly worth a shot.

9. Housing Regression

If classification isn’t your thing, then might I recommend this ready-made regression project in which you can predict home prices using variables like square footage, number of bedrooms, and year built. A project such as this can help you understand the factors driving home sales and let you get creative in your feature engineering. Try to involve outside data that can serve as proxies for quality of life, education, and other things that might influence home prices. And if you want to show off your scraping skills, you can always create your dataset by scraping Zillow.

10. Seeds Clustering

The seeds dataset from UCI provides a simple opportunity to use clustering. Use the seven attributes to sort the 210 seeds into K number of groups. If you’re looking to go beyond KMeans, try using hierarchical clustering, which can be useful for this dataset because the low number of samples can be easily visualized with a dendrogram.

11. Credit Card Fraud Detection

Another project idea for those of you intent on using business world data is to train a classifier to predict instances of credit card fraud. The value of this project to you comes from the fact that it’s an imbalanced dataset, meaning that one class vastly outweighs the other (in this case, non-fraudulent transactions versus fraudulent). Training a model that is 99% accurate is essentially useless, so it’s up to you to use non-accuracy metrics to demonstrate the success of your model.

12. AutoMPG

This is a great beginner regression project in which you can use car features to predict their fuel efficiency. Given that this data is from the past, an interesting idea you can use is to see how well this model does on data from recent cars to show how car fuel efficiency has evolved over the years.

13. World Happiness

Using data science to unlock what’s behind happiness? Maybe you can with this dataset on world happiness rankings. You can go a number of ways with this project; you can use regression to predict happiness scores, cluster countries based on socio-economic characteristics, or visualize the change in happiness throughout the world from 2015 to 2019.

14. Political Identity

The Nationscape Data Set is an absolute goldmine of data on the demographics and political identities of Americans. If you’re a politics junkie, it’ll be sure to satisfy your fix. Their most recent round of data features over 300,000 instances of data collected from extensive surveys of Americans. If you’re interested in using demographic information for political ideology or party identification this is the dataset for you. This is an especially great project to flex your domain expertise in study design, research, and conclusion. Political analysis is replete with shoddy interpretations that lack empirical data analysis, and you could use this dataset to either confirm or dispel them. But be warned that this data will require plenty of cleaning, which you’ll need to get used to, given that’s the majority of the job.

15. Box Office Prediction

If you’re a movie buff, then we’ve got you covered with the TMDB dataset. See if you can build a workable box office revenue prediction model trained on 5000 movies worth of data. Does genre actually correlate with box office success? Can we use runtime and language to help explain the variation in the revenue? Find out the answers to those questions and more with this project.

Explore Data Workshops

What is Data Science?

By

It’s been anointed “the sexiest job of the 21st century”, companies are rushing to invest billions of dollars into it, and it’s going to change the world — but what do people mean when they mention “data science”? There’s been a lot of hype about data science and deservedly so, but the excitement has helped obfuscate the fundamental identity of the field. Anyone looking to involve themselves in data science needs to understand what it actually is and is not.

In this article, we’ll lay out a deep definition of the field, complete descriptions of the data science workflow, and data science tasks used in the real world. We hope that any would-be entrants into this line of work will come away reading this article with a nuanced understanding of data science that can help them decide to enter and navigate this exciting line of work.

So What Actually is Data Science?

A quick definition of data science might be articulated as an interdisciplinary field that primarily uses statistics and computer programming to derive insights from and base decisions from a collection of information represented as numerical figures. The “science” part in data science is quite apt because data science very much follows a scientific process that involves formulating a hypothesis and using a specific toolset to confirm or dispel that hypothesis. At the end of the day, data science is about turning a problem into a question and a question into an answer and/or solution.

Tackling the meaning of data science also means interrogating the meaning of data. Data can be easily described as “information encoded as numbers” but that doesn’t tell us why it’s important. The value of data stems from the notion that data is a tangible manifestation of the intangible. Data provides solid support to aid our interpretations of the world. For example, a weather app can tell you it’s cold outside but telling you that the temperature is 38 degrees fahrenheit provides you with a stronger and specific understanding of the weather.

Data comes in two forms: qualitative and quantitative.

Qualitative data is categorical data that does not naturally come in the form of numbers, such as demographic labels that you can select on a census form to indicate gender, state, and ethnicity.

Quantitative data is numerical data that can be processed through mathematical functions; for example stock prices, sports stats, and biometric information.

Quantitative can be subdivided into smaller categories such as ordinal, discrete, and continuous.

Ordinal: A sort of qualitative and quantitative hybrid variable in which the values have a hierarchical ranking. Any sort of star rating system of reviews is a perfect example of this; we know that a four-star review is greater than a three-star review, but can’t say for sure that a four- star review is twice as good as a two-star review.

Discrete: These are countable and finite values that often appear in the form of integers. Examples include number of franchises owned by a company and number of votes cast in an election. It’s important to remember discrete variables have a finite range of numbers and can never be negative.

Continuous: Unlike discrete variables, continuous can appear in decimal form and have an infinite range of possibilities. Things like company profit, temperature, and weight can all be described as continuous. 

What Does Data Science Look Like?

Now that we’ve established a base understanding of data science, it’s time to delve into what data science actually looks like. To answer this question, we need to go over the data science workflow, which encapsulates what a data science project looks like from start to finish. We’ll touch on typical questions at the heart of data science projects and then examine an example data science workflow to see how data science was used to achieve success.

The Data Science Checklist

A good data science project is one that satisfies the following criteria:

Specificity: Derive a hypothesis and/or question that’s specific and to the point. Having a vague approach can often lead to a waste of time with no end product.

Attainability: Can your questions be answered? Do you have access to the required data? It’s easy to come up with an interesting question but if it can’t be answered then it has no value. The same goes for data, which is only useful if you can get your hands on it.

Measurability: Can what you’re applying data science to be quantified? Can the problem you’re addressing be represented in numerical form? Are there quantifiable benchmarks for success? 

As previously mentioned, a core aspect of data science is the process of deriving a question, especially one that is specific and achievable. Typical data science questions ask things like, does X predict Y and what are the distinct groups in our data? To get a sense of data science questions, let’s take a look at some business-world-appropriate ones:

  • What is the likelihood that a customer will buy this product?
  • Did we observe an increase in sales after implementing a new policy?
  • Is this a good or bad review?
  • How much demand will there be for my service tomorrow?
  • Is this the cheapest way to deliver our goods?
  • Is there a better way to segment our marketing strategies?
  • What groups of products are customers purchasing together?
  • Can we automate this simple yes/no decision?

All eight of these questions are excellent examples of how businesses use data science to advance themselves. Each question addresses a problem or issue in a way that can be answered using data science.

The Data Science Workflow

Once we’ve established our hypothesis and questions, we can now move onto what I like to call the data science workflow, a step-by-step description of a typical data science project process.

After asking a question, the next steps are:

  1. Get and Understand the Data. We obviously need to acquire data for our project, but sometimes that can be more difficult than expected if you need to scrape for it or if privacy issues are involved. Make sure you understand how the data was sampled and the population it represents. This will be crucial in the interpretation of your results.
  1. Data Cleaning and Exploration. The dirty secret of data science is that data is often quite dirty so you can expect to do significant cleaning which often involves constructing your variables in a way that makes your project doable. Get to know your data through exploratory data analysis. Establish a base understanding of the patterns in your dataset through charts and graphs.
  1. Modeling. This represents the main course of the data science process; it’s where you get to use the fancy powerful tools. In this part, you build a model that can help you answer a question such as can we predict future sales of a product from your dataset.
  1. Presentation. Now it’s time to present the results of your findings. Did you confirm or dispel your hypothesis? What are the answers to the questions you started off with? How do your results advance our understanding of the issue at hand? Articulate your project in a clear and concise manner that makes it digestible for your audience, which could be another team in your company or your company’s executives.

Data Science Workflow Example: Predicting Neonatal Infection

Now let’s parse out an example of how data science can affect meaningful real-world impact, taken from the book Big Data: A Revolution That Will Transform How We Live, Work, and Think.

We start with a problem: Children born prematurely are at high risk of developing infections, many of which are not detected until after a child is sick.

Then we turn that problem into a question: Can we detect patterns in the data that accurately predict infection before it occurs?

Next, we gather relevant data: variables such as heart rate, respiration rate, blood pressure, and more.

Then we decide on the appropriate tool: a machine learning model that uses past data to predict future outcomes.

Finally, what impact do our methods have? The model is able to predict the onset of infection before symptoms appear, thus allowing doctors to administer treatment earlier in the infection process and increasing the chances of survival for patients.

This is a fantastic example of data science in action because every step in the process has a clear and easily understandable function towards a beneficial outcome.

Data Science Tasks

Data scientists are basically Swiss Army knives, in that they possess a wide range of abilities — it’s why they’re so valuable. Let’s go over the specific tasks that data scientists typically perform on the job.

Data acquisition: For data scientists, this usually involves querying databases set up by their companies to provide easy access to reams of data. Data scientists frequently write SQL queries to retrieve data. Outside of querying databases, data scientists can use APIs or web scraping to acquire data.

Data cleaning: We touched on this before, but it can’t be emphasized enough that data cleaning will take up the vast majority of your time. Cleaning oftens means dealing with null values, dropping irrelevant variables, and feature engineering which means transforming data in a way so that it can be processed by a model.

Data visualization: Crafting and presenting visually appealing and understandable charts is a hugely valuable skill. Visualization has an uncanny ability to communicate important bits of information from a mass of data. Good data scientists will use data visualization to help themselves and their audiences better understand what’s going on.

Statistical analysis: Statistical tests are used to confirm and/or dispel a data scientist’s hypothesis. A t-test or chi-square are used to evaluate the existence of certain relationships. A/B testing is a popular use case of statistical analysis; if a team wants to know which of two website designs leads to more clicks, then an A/B test is the right solution.

Machine learning: This is where data scientists use models that make predictions based on past observations. If a bank wants to know which customers are likely to pay back loans, then they can use a machine learning model trained on past loans to answer that question.

Computer science: Data scientists need adequate computer programming skills because many of the tasks they undertake involve writing code. In addition, some data science roles require data scientists to function as software engineers because data scientists have to implement their methodologies into their company’s backend servers.

Communication: You can be a math and computer whiz, but if you can’t explain your work to a novice audience, your talents might as well be useless. A great data scientist can distill digestible insights from complex analyses for a non-technical audience, translating how a p-value or correlation score is relevant to a part of the company’s business. If your company is going to make a potentially costly or lucrative decision based on your data science work, then it’s incumbent on you to make sure they understand your process and results as much as possible.

Conclusion

We hope this article helped to demystify this exciting and increasingly important line of work. It’s pertinent to anyone who’s curious about data science — whether it’s a college student or an executive thinking about hiring a data science team — that they understand what this field is about and what it can and cannot do.

Explore Data Workshops