The Double-Edged Sword of Data

One of my favorite parts of data science is using previously untapped data sources to inform decision making in ways that were not possible just a few years ago. However, a common pitfall of data-driven decision making is to try to replace intuition and common sense with scatter plots and p-Values, inevitably leading to incorrect interpretations and sub-optimal decisions being made.

For example, imagine you work for a nonprofit organization, and are responsible for assigning street teams to specific New York City subway stops to canvas for your employer's upcoming gala. Being the in-the-know analyst you are, you are aware that MTA makes their turnstile count data, reporting cumulative subway entrances and exits by station and time, publicly available.

"How nifty!", you think. I just have to tally up the number of entrances and exits for each station over my analysis period, and send street teams to the most heavily trafficked stations. Problem solved. And then, to show off your matplotlib chops, you create a fancy bar graph like the one below, which shows average hourly traffic over the period 4/11/2015 through 5/30/2015. Using your analysis, you recommend that your boss send canvassers to 42nd St-Grand Central Station and 34th Street-Herald Square, by far the busiest stations on the map.

However, while Grand Central and Herald Square are certainly the most heavily trafficked stations overall, are they truly the optimal stations to send teams to? Will sending street teams to the busiest stations optimize your organization's exposure, or will your street teams simply be overwhelmed by the throngs passing through those stations? Further, is reaching the most number of people the ultimate priority for your organization? Or is it more important to reach a specific type of person, such as a potential donor or someone more likely to be interested in your organization's work?

The map below shows the top 5 busiest stations overall (yellow), top 5 most affluent stations, as measured by median income in the census tract where the station is located (red), and top 5 busiest stations close to universities (blue) and tech hubs (green). (Our nonprofit organization deals with women in the technology industry, so I'm assuming stations close to tech hubs and universities are more likely to attract potentially interested persons).

As shown, there is very little overlap between the busiest stations overall, most affluent stations, and busiest stations close to tech hubs and universities. No single station makes the "top five" in more than one category. Further, the wide geographic spread of the top five ranked stations in each category indicates the optimal street placement will depend heavily on the organization's priorities. If attracting affluent potential donors is the most important priority of your organization, sending teams to midtown as opposed to SoHo, Tribeca and Greenwich Village is probably not going to be the most effective strategy.

This example illustrates the double-edged sword of data. Data is a powerful tool for informing decision making, but can also lead to bad decisions when it is used as a substitute for careful thought, common sense, and a clear understanding of the problem you are trying to solve. In the brave new world of big data, it is as important as ever to think before you do.

Take a look at the GitHub repo here as well.