What Is: Data Science



Name: Thomson Nguyen (@itsthomson)
Occupation: I’m the Lead Data Scientist at Causes. I’m also a Visiting Scholar at the Courant Institute for Mathematical Sciences at NYU.

1. In 140 characters or less, what is Data Science?

Data science is the art of using data to create products, obtain actionable insights, and communicate decisions to non-technical people.

2. What are some practical applications of data science?

Most recommendation engines you see on a shopping sites use machine learning and data science. Any data visualization you thought was cool in the last year or so was probably the result of hacking together data. In short, we use data science to help us discover new bands, secure our mobile phones, and ultimately tell a clear story from large, unclear amounts of data.

3. What is the relation between data science and machine learning?

Data science is the process of turning data into awesome things — like a product recommender on Amazon, a list of high-value precincts to call for your presidential campaign, or a self-driving car that navigates any terrain in any situation. Most of the time (though not all the time) data scientists make their jobs a bit easier with “machine learning”, where they teach a computer to teach itself how to process data.

4. What are some of your favorite books, links, resources, for someone interested in getting started in data science?

  • Machine Learning for Hackers is a great book by Drew Conway and John Myles White — it’s a thorough introduction to making products people care about through machine learning.
  • Data Mining with R: Learning with Case Studies is a more in-depth treatment of data mining with specific case studies. It’s a good follow-up book for formalizing the process of taking data and turning them into analyses.
  • O’Reilly Radar has a couple of really good articles on data science. DJ Patil has two of my favorites: Building Data Science Teams and Data Jujitsu: The art of turning data into product. Mike Loukides has also written a concise introduction called What is data science?
  • Lastly, Kaggle is a great way to benchmark your current progress in hacking data sets and making models by competing against other data scientists and data practitioners.

5. Any advice for an aspiring data scientist?

As much as I hate to say it, focus less on the theory and mathematics in the models you use, and more on the applications and products you can create when starting out. The sooner you can actually create something (not just a model, but a web product, a visualization, a paper, anything that extends your initial models), the sooner you’ll realize that data science is more than just machine learning. It’s knowledge debt though — definitely spend time learning the theory behind your models once you’re confident in turning models into products. Black-box machine learning is only a little bit better than throwing darts at your datasets.

6. What data science projects inspire you?

Ben Fry has a neat visualization of zip codes in the United States. And “Infinite Gangnam Style” is definitely my favorite data-driven music project, even if I’m not the biggest fan of the song. The person who made this used Echo Nest, which has an analyzer that takes songs and splits them up into individual beats (using signal processing and filtering techniques). Once you’ve figured out the key and the phrasing of each beat, you can stitch them together in a procedurally generated, infinitely looping K-pop song.

7. What are some misconceptions about data science?

I have a few:

  • Data scientists are rebranded academics from science and mathematics. Machine learning can be a very academic field, and as a result some data scientists spend 90% of their time on machine learning, and 10% on the other things I mentioned above. That’s not a bad thing — I just think the ideal data scientist is someone who knows enough of everything to be dangerous. This venn diagram (courtesy of drewconway.com) is a good visualization of what the ideal data scientist is.
  • Data science is a reinvented term for data analysis. Ultimately I see the ideal function of a data scientist as a key member of your company’s product team — they know just enough to create data-driven products and analyses that enable decisions.
  • Data science is a big trend and you need to hire a scientist today! You don’t need a data scientist anytime soon for your company — it’s a difficult search, they’re expensive, and they require a lot of planning on your end for them to be valuable to your business. Make sure you’ve talked to someone before making the leap into hiring. That said, it’s still a really great market for data scientists.

Learn More About Data Science at GA