It’s been anointed “the sexiest job of the 21st century”, companies are rushing to invest billions of dollars into it, and it’s going to change the world — but what do people mean when they mention “data science”? There’s been a lot of hype about data science and deservedly so, but the excitement has helped obfuscate the fundamental identity of the field. Anyone looking to involve themselves in data science needs to understand what it actually is and is not.
In this article, we’ll lay out a deep definition of the field, complete descriptions of the data science workflow, and data science tasks used in the real world. We hope that any would-be entrants into this line of work will come away reading this article with a nuanced understanding of data science that can help them decide to enter and navigate this exciting line of work.
So What Actually is Data Science?
A quick definition of data science might be articulated as an interdisciplinary field that primarily uses statistics and computer programming to derive insights from and base decisions from a collection of information represented as numerical figures. The “science” part in data science is quite apt because data science very much follows a scientific process that involves formulating a hypothesis and using a specific toolset to confirm or dispel that hypothesis. At the end of the day, data science is about turning a problem into a question and a question into an answer and/or solution.
Tackling the meaning of data science also means interrogating the meaning of data. Data can be easily described as “information encoded as numbers” but that doesn’t tell us why it’s important. The value of data stems from the notion that data is a tangible manifestation of the intangible. Data provides solid support to aid our interpretations of the world. For example, a weather app can tell you it’s cold outside but telling you that the temperature is 38 degrees fahrenheit provides you with a stronger and specific understanding of the weather.
Data comes in two forms: qualitative and quantitative.
Qualitative data is categorical data that does not naturally come in the form of numbers, such as demographic labels that you can select on a census form to indicate gender, state, and ethnicity.
Quantitative data is numerical data that can be processed through mathematical functions; for example stock prices, sports stats, and biometric information.
Quantitative can be subdivided into smaller categories such as ordinal, discrete, and continuous.
Ordinal: A sort of qualitative and quantitative hybrid variable in which the values have a hierarchical ranking. Any sort of star rating system of reviews is a perfect example of this; we know that a four-star review is greater than a three-star review, but can’t say for sure that a four- star review is twice as good as a two-star review.
Discrete: These are countable and finite values that often appear in the form of integers. Examples include number of franchises owned by a company and number of votes cast in an election. It’s important to remember discrete variables have a finite range of numbers and can never be negative.
Continuous: Unlike discrete variables, continuous can appear in decimal form and have an infinite range of possibilities. Things like company profit, temperature, and weight can all be described as continuous.
What Does Data Science Look Like?
Now that we’ve established a base understanding of data science, it’s time to delve into what data science actually looks like. To answer this question, we need to go over the data science workflow, which encapsulates what a data science project looks like from start to finish. We’ll touch on typical questions at the heart of data science projects and then examine an example data science workflow to see how data science was used to achieve success.
The Data Science Checklist
A good data science project is one that satisfies the following criteria:
Specificity: Derive a hypothesis and/or question that’s specific and to the point. Having a vague approach can often lead to a waste of time with no end product.
Attainability: Can your questions be answered? Do you have access to the required data? It’s easy to come up with an interesting question but if it can’t be answered then it has no value. The same goes for data, which is only useful if you can get your hands on it.
Measurability: Can what you’re applying data science to be quantified? Can the problem you’re addressing be represented in numerical form? Are there quantifiable benchmarks for success?
As previously mentioned, a core aspect of data science is the process of deriving a question, especially one that is specific and achievable. Typical data science questions ask things like, does X predict Y and what are the distinct groups in our data? To get a sense of data science questions, let’s take a look at some business-world-appropriate ones:
- What is the likelihood that a customer will buy this product?
- Did we observe an increase in sales after implementing a new policy?
- Is this a good or bad review?
- How much demand will there be for my service tomorrow?
- Is this the cheapest way to deliver our goods?
- Is there a better way to segment our marketing strategies?
- What groups of products are customers purchasing together?
- Can we automate this simple yes/no decision?
All eight of these questions are excellent examples of how businesses use data science to advance themselves. Each question addresses a problem or issue in a way that can be answered using data science.
The Data Science Workflow
Once we’ve established our hypothesis and questions, we can now move onto what I like to call the data science workflow, a step-by-step description of a typical data science project process.
After asking a question, the next steps are:
- Get and Understand the Data. We obviously need to acquire data for our project, but sometimes that can be more difficult than expected if you need to scrape for it or if privacy issues are involved. Make sure you understand how the data was sampled and the population it represents. This will be crucial in the interpretation of your results.
- Data Cleaning and Exploration. The dirty secret of data science is that data is often quite dirty so you can expect to do significant cleaning which often involves constructing your variables in a way that makes your project doable. Get to know your data through exploratory data analysis. Establish a base understanding of the patterns in your dataset through charts and graphs.
- Modeling. This represents the main course of the data science process; it’s where you get to use the fancy powerful tools. In this part, you build a model that can help you answer a question such as can we predict future sales of a product from your dataset.
- Presentation. Now it’s time to present the results of your findings. Did you confirm or dispel your hypothesis? What are the answers to the questions you started off with? How do your results advance our understanding of the issue at hand? Articulate your project in a clear and concise manner that makes it digestible for your audience, which could be another team in your company or your company’s executives.
Data Science Workflow Example: Predicting Neonatal Infection
Now let’s parse out an example of how data science can affect meaningful real-world impact, taken from the book Big Data: A Revolution That Will Transform How We Live, Work, and Think.
We start with a problem: Children born prematurely are at high risk of developing infections, many of which are not detected until after a child is sick.
Then we turn that problem into a question: Can we detect patterns in the data that accurately predict infection before it occurs?
Next, we gather relevant data: variables such as heart rate, respiration rate, blood pressure, and more.
Then we decide on the appropriate tool: a machine learning model that uses past data to predict future outcomes.
Finally, what impact do our methods have? The model is able to predict the onset of infection before symptoms appear, thus allowing doctors to administer treatment earlier in the infection process and increasing the chances of survival for patients.
This is a fantastic example of data science in action because every step in the process has a clear and easily understandable function towards a beneficial outcome.
Data Science Tasks
Data scientists are basically Swiss Army knives, in that they possess a wide range of abilities — it’s why they’re so valuable. Let’s go over the specific tasks that data scientists typically perform on the job.
Data acquisition: For data scientists, this usually involves querying databases set up by their companies to provide easy access to reams of data. Data scientists frequently write SQL queries to retrieve data. Outside of querying databases, data scientists can use APIs or web scraping to acquire data.
Data cleaning: We touched on this before, but it can’t be emphasized enough that data cleaning will take up the vast majority of your time. Cleaning oftens means dealing with null values, dropping irrelevant variables, and feature engineering which means transforming data in a way so that it can be processed by a model.
Data visualization: Crafting and presenting visually appealing and understandable charts is a hugely valuable skill. Visualization has an uncanny ability to communicate important bits of information from a mass of data. Good data scientists will use data visualization to help themselves and their audiences better understand what’s going on.
Statistical analysis: Statistical tests are used to confirm and/or dispel a data scientist’s hypothesis. A t-test or chi-square are used to evaluate the existence of certain relationships. A/B testing is a popular use case of statistical analysis; if a team wants to know which of two website designs leads to more clicks, then an A/B test is the right solution.
Machine learning: This is where data scientists use models that make predictions based on past observations. If a bank wants to know which customers are likely to pay back loans, then they can use a machine learning model trained on past loans to answer that question.
Computer science: Data scientists need adequate computer programming skills because many of the tasks they undertake involve writing code. In addition, some data science roles require data scientists to function as software engineers because data scientists have to implement their methodologies into their company’s backend servers.
Communication: You can be a math and computer whiz, but if you can’t explain your work to a novice audience, your talents might as well be useless. A great data scientist can distill digestible insights from complex analyses for a non-technical audience, translating how a p-value or correlation score is relevant to a part of the company’s business. If your company is going to make a potentially costly or lucrative decision based on your data science work, then it’s incumbent on you to make sure they understand your process and results as much as possible.
We hope this article helped to demystify this exciting and increasingly important line of work. It’s pertinent to anyone who’s curious about data science — whether it’s a college student or an executive thinking about hiring a data science team — that they understand what this field is about and what it can and cannot do.