When it comes to getting a job in data science, aspiring data scientists need to act like artists. Yes, that’s correct, and what I mean by that is those looking to enter this field need to have a data science portfolio of previously completed data science projects. What better way to prove to your future data science team that you’re capable of being a data scientist than proving you can do the work?
A common problem for data science entrants is that employers want candidates with experience; but how do you get experience without having experience? If you’re looking to get that first foot in the door, it would behoove you to undertake a couple of data science projects to show future employers you’ve got what it takes to use big data to identify opportunities and succeed in the field.
The good news is that we live in a time of open and abundant data. Websites like Kaggle offer a treasure trove of free data on everything from crime statistics to Pokemon to Bitcoin and more. However, the wealth of easily accessible data can be overwhelming, which is why we’ve taken it upon ourselves to present 15 data science projects you can execute in Python to showcase and improve your skills. Our diverse collection of project ideas covers a variety of topics from Spotify songs to fake news to fraud detection and techniques such as clustering, regression, and natural language processing.
Before you dive in, be sure to adhere to these four guidelines no matter which data science projects you choose:
1. Articulate the Problem and/or Scenario
It’s not enough to do a project where you use “X” to predict “Y”; you need to add some context to your work because data science does not occur in a vacuum. Tell us what you’re trying to solve and how data science can address that. Employers want to know if you can turn a problem into a question and a question into a solution. A good place to start is to depict a real-world scenario in which your project would be useful.
2. Publish and Explain Your Work
Create a GitHub repository where you can upload your Jupyter Notebooks and data. Write a blog post in which you narrate your project from start to finish, talk about the problem or question at the heart of the project, explain your decision to clean the data in a certain way or why you decided to use a certain algorithm. Potential employers need to understand your methodology.
3. Use Domain Expertise
If you’re trying to break into a specific field such as finance, health, or sports, use your knowledge of this area to enhance your project. This could mean deriving a useful question to a pressing problem or articulating a well-thought-out interpretation of your project’s results. For example, if you’re looking to become a data scientist in the finance sector, then it would be worthwhile to show how your methods can generate a return on investment.
4. Be Creative and Different
Anyone can copy and paste code that trains a machine learning algorithm. If you want to stand out, review existing data science projects that use the same data and fill in the gaps left by them. If you’re working on a prediction project, try coming up with an unexpected variable that you think would be beneficial.
Data Science Projects
1. Titanic Data
Working on the Titanic dataset is a rite of passage in data science. It’s a useful dataset that beginners can work with to improve their feature engineering and classification skills. Try using a decision tree so you can visualize the relationships between the features and the probability of surviving the Titanic.
2. Spotify Data
Spotify has an amazing API that provides access to rich data on their entire catalog of songs. You can grab cool attributes such as a song’s acousticness, danceability, and energy. The great thing about this data source is that the project possibilities are almost endless. You can use these features to try to predict genre or popularity. One fun idea would be to try to better understand your own music, training a machine learning classifier on two sets of songs; songs you like and songs you do not.
3. Personality Data Clustering
You’ve probably heard the phrase, “There are X types of people.” Well, now you can actually find out how many types of people there really are. Using this dataset of almost 20k responses to the Big Five Personality Test, you can actually answer this question. Throw this data into a clustering algorithm such as KMeans and sort this into K number of groups. Once you decide on the optimal number of clusters, it’s incumbent on you to define each cluster. Come up with labels that add meaning to each group and don’t be afraid to use plenty of charts and graphs to support your interpretation.
4. Fake News
If you have an interest in natural language processing, building a classifier to differentiate between fake and real news is a great way to demonstrate that. Fake news is a problem that social media platforms have been struggling with for the past several years and a project that tackles this problem is a great way to show you care about solving real-world problems. Use your classifier to identify interesting insights about the patterns in fake versus real news; for example, tell us which words or phrases are most associated with fake news articles.
5. COVID-19 Dataset
There probably isn’t a more relevant use of data science than a project analyzing COVID-19. This dataset provides a wealth of information related to the pandemic. It provides a great opportunity to show off your exploratory data analysis chops. Take a deep dive into this data and through the use of data visualization unearth patterns about the rate of Covid infection by county, state, and by country.
If you’re looking for a straightforward project that is extremely applicable to the business world, then this one’s for you. Use this dataset to train a classifier that predicts customer churn. If you can show employers you know how to prevent customers from leaving their business you’ll most definitely grab their attention. Pro tip: this is a great projection to show your understanding of classification metrics besides accuracy such as precision and recall.
7. Lending Club Loans
Like the Telco project, the Lending Club loan dataset is extremely relevant to the business world. Here you can train a classifier that predicts whether or not a Lending Club loanee will pay back a loan using a wealth of information such as credit score, loan amount, and loan purpose. There are a lot of variables at your disposal, so I’d recommend starting with a handful of features and working your way up from there. See how far you can get with just the basics.
Also, this is a fairly untidy dataset that will require extensive cleaning and feature engineering, which is a good thing because that is often the case with real-world data. Be sure to explain your methodology behind preparing your dataset for the machine learning algorithm — this informs the audience of your domain expertise.
8. Breast Cancer Detection
This dataset provides a simpler classification scenario in which you can use health-related variables to predict instances of breast cancer. If you’re looking to apply your data science skills to the medical field, this is certainly worth a shot.
9. Housing Regression
If classification isn’t your thing, then might I recommend this ready-made regression project in which you can predict home prices using variables like square footage, number of bedrooms, and year built. A project such as this can help you understand the factors driving home sales and let you get creative in your feature engineering. Try to involve outside data that can serve as proxies for quality of life, education, and other things that might influence home prices. And if you want to show off your scraping skills, then you can always create your own dataset by scraping Zillow.
10. Seeds Clustering
The seeds dataset from UCI provides a simple opportunity to use clustering. Use the seven attributes to sort the 210 seeds into K number of groups. If you’re looking to go beyond KMeans, try using hierarchical clustering, which can be useful for this dataset because the low number of samples can be easily visualized with a dendrogram.
11. Credit Card Fraud Detection
Another project idea for those of you intent on using business world data is to train a classifier to predict instances of credit card fraud. The value of this project to you comes from the fact that it’s an imbalanced dataset, meaning that one class vastly outweighs the other (in this case, non-fraudulent transactions versus fraudulent). Training a model that is 99% accurate is essentially useless so it’s up to you to use non-accuracy metrics to demonstrate the success of your model.
This is a great beginner regression project in which you can use car features to predict their fuel efficiency. Given that this data is from the past, an interesting idea you can use is to see how well this model does on data from recent cars, as a way to show how car fuel efficiency has evolved over the years.
13. World Happiness
Using data science to unlock what’s behind happiness? Maybe you can with this dataset on world happiness rankings. You can go a number of ways with this project; you can use regression to predict happiness score, cluster countries based on socio-economic characteristics, or visualize the change in happiness throughout the world from the years 2015 to 2019.
14. Political Identity
The Nationscape Data Set is an absolute goldmine of data on the demographics and political identities of Americans. If you’re a politics junkie it’ll be sure to satisfy your fix. Their most recent round of data features over 300,000 instances of data collected from extensive surveys of Americans. If you’re interested in using demographic information for political ideology or party identification this is the dataset for you. This is an especially great project to flex your domain expertise in study design, research, and conclusion. Political analysis is replete with shoddy interpretations that lack empirical data analysis and you could use this dataset to either confirm or dispel them. But be warned that this data will require plenty of cleaning, which is something you’ll need to get used to, given that’s the majority of the job.
15. Box Office Prediction
If you’re a movie buff, then we’ve got you covered with the TMDB dataset. See if you can build a workable box office revenue prediction model trained on 5000 movies worth of data. Does genre actually correlate with box office success? Can we use runtime and language to help explain the variation in the revenue? Find out the answers to those questions and more with this project.