From my past experiences at both large tech companies and small startups, there is a tendency to assign data projects to rockstar developers and let them run with it. The results are often interesting from a data science standpoint, but nonfunctional from a product standpoint.
The problem with this approach is twofold. First, how the developers approach the data and how the data integrates into the product are fundamentally different. Second, datasets in the real world are messy—inaccurate, imprecise, and unstructured—and this can render them unusable in their initial format.
At VeryApt, every data project is allocated developer hours, analyst hours, and budget for external resources (such as tools and supporting datasets). With these constraints established, our analysts and developers work together to create clear goals for the product and its associated site integration. This allows the team to narrowly focus their efforts and determine if the data available can produce useable results.
Collectively Defining Goals
One of our recent projects was to rank apartments by proximity to a grocery store. A software developer built a quick prototype that searched our places database for the nearest a grocery and scored the apartment based on walking distance. Not accurate, not precise, but it was a starting point and confirmed that our servers could handle the load. Concurrently, our data analyst tested models that solve the problem, rather than provide an arbitrary ranking.
Our developer was surprised by the myriad questions that our analyst asked while building models. Are you willing to travel further for high-end grocery stores? Will you stop at the corner store a block away to pick up milk rather than walk the five blocks to the supermarket? Does having a 24-hour drugstore nearby make grocery shopping easier for you?
All of this meant that we didn’t build just one model, we built six! We have one for each grocery category (supermarkets, drug stores, and convenience stores), one to balance the priority between the three, one to define data trustworthiness, and one to rate the quality of each store.
By correctly identifying scope early in the process, we allocated the additional resources needed for a larger project than originally specified, our developer scaled out the design for handling the processing needs of the additional models, and our analyst guaranteed the data was being applied properly.
Having the Correct Toolbox
Excel is a great product for analyzing data. Same goes for SQL queries. But what if you are analyzing unstructured data or data that does not work well in spreadsheets?
Before we start any data project, we identify what software is needed to allow our analysts to succeed. For our mapping projects, we build data analysis layers over OpenStreetMaps. For our text-heavy analysis, we have a custom UI for cross-referencing and highlighting commonly used phrases from other portions of the dataset. Almost every internal tool we have creates Excel data dumps for further structured data inspection and metrics reporting. With the right tools in their toolbox, our data analysts are able to focus on answering hard questions, rather than asking the data to reveal itself.
Cleaning Data: A Shared Priority
In many software shops, data cleaning is exclusively an analyst job. At VeryApt, our developers and analysts work together, ensuring both parties are aware of the pain points in the datasets and are able to react to them quickly.
In our case, the analysts:
- Provide scope for and drive the cleaning process,
- Identify additional, easy to access data sources that our developers can integrate with our primary data set in order to provide a higher level of confidence in our data,
- Determine where the most common error cases occur.
Meanwhile, the developers:
- Identify gaps and inconsistencies that exist in the datasets,
- Create automated rules to filter out common error cases at scale,
- Build out tools to support the analyst investigation.
In addition, we utilize resources like Mechanical Turk to clean the portions of our data that cannot be efficiently done via automation.
Solving Problems Together
We have tons of review content from apartment renters across the country. These reviews need to be summarized into what we call “review snippets,” which highlight key features of a given apartment. One apartment might be highlighted as having great views and amenities; another, spacious closets and friendly staff.
Our developers built a natural language processor to extract snippets, but quickly hit a wall due to the bin packing problem: if a review was long and highlighting multiple features, determining the best snippet required a tremendous amount of processing power. Short reviews were analyzed in a few seconds, whereas reviews longer than a five-sentence paragraph could take hours.
Enter our data analysts. Using modeling tools like Decision Trees and Naïve Bayes Classification, thousands of rules were generated for classifying phrases and key terms. Now, we run our rules engine to extract snippets, with the natural language processing as a fallback when no snippets can be found. The end result? Our entire corpus used to take weeks to process; now processing takes under an hour! Moreover, the snippets were of higher quality, both in terms of accuracy and descriptiveness.
As VeryApt has continued to grow in terms of scope and content, we’re continually faced with increasingly challenging large-scale data projects. While our approach to tackling these projects is continually evolving, our framework has remained the same: (1) Set realistic and narrow set of product goals, (2) Gather or generate a reliable data set, (3) Facilitate teamwork across your analyst and development team to ensure your project is set up for success, and (4) Revisit large data projects periodically as your team grows to see if you have additional data or resources to improve on prior work.