Traditionally Vehicle Census data, collected by seventeen states in the United States on a periodic basis, is grouped by county and state to estimate vehicle registrations. The aggregated data, while useful to derive trends, is of limited use to decision makers. Affordable data sharing technologies have made it possible for states to anonymize and share their datasets. Though more data from different sources has become available, the resources to analyze the data and use it to inform decisions haven’t kept pace. To overcome this limitation and to use the latest analysis techniques, some states have opened their datasets to the developer community.

Massachusetts opened its Vehicle Census dataset to the developer community as part of its 37 Billion Miles Data Challenge (“the challenge”) to help make informed decisions. Its vehicle census dataset contained anonymized information about the age, model, estimated mileage, fuel-efficiency, and zip-code of passenger and commercial vehicles registered in the state from 2008 to 2011.

Our team prepared a series of online dynamic maps to explain where driving places the biggest burden on family budgets. From the hot spot analysis using the maps, we deduced that the Fuel Burden (FB) Index (percent household income spent on fuel) depends on location: those who live in some zip codes spend proportionally more of their household income on fuel. We suggested that selectively targeting these zip-codes and smaller grids within these areas will have best chance of reducing total miles driven. A composite index (combining miles driven, carbon-dioxide emissions, and FB Index) at the grid level has the best chance of success. Households with higher FB Index may be more amenable to incentives for using transit or to carpool, so investments in these modes may be desirable. Our suggestions won the “Dollars and Sense” and “People’s Choice” awards in the challenge.

In the presentation, we will discuss writing programs to glean information from large datasets when traditional software’s do not work (as we did for the “challenge”). We will discuss our methodology and how novel methodologies can be devised that combine data from large disparate datasets.