Transportation professionals must rely heavily on statistical evidence to make important planning and infrastructure decisions. The advent of traffic analysis zones and other spatial analysis units are partially an artifact of this and thousands of planning decisions have already been made based on their geographic convenience. However, these analysis zones and other aggregated geographic definitions can create statistical problems commonly unknown or ignored by many transportation professionals.
A typical approach with polygon data of varying size or population density is to normalize the information, so more appropriate comparisons can be made. For example, a common strategy is to divide the value of interest by the polygon’s population, area, intersection count, or any other seemingly appropriate measure. Once this is done, the mistake is to presume the resulting normalized list can be compared without trouble. To make matters worse, everyday software tools make this (and only this) normalization procedure readily accessible. These lists are quite commonly ranked and subsequent planning decisions are made. However, an important statistical property of this procedure is that low density areas, such as the more rural census blocks or TAZs, will be overrepresented in the tails of the calculated distribution. This is because we expect higher variance in places with less information, which characterizes low density areas. Low density zones are equivalent to small samples and, thus, the typical concerns with small samples are lurking once again.
We discuss why this simple algebraic step can produce erroneous patterns and lead to misguided planning decisions. Using practical examples, we offer several steps that address this problem directly and help bring the important and common polygon-based planning back to solid statistical footing.
In addition to focusing on the statistical problem described above, we discuss ways to avoid a connected visualization issue faced by the those attempting to make decisions based on information aggregated to polygons: the common use of chloropleth maps to visualize these data leads to disproportionate focus on the areas of the map with larger polygons, which in many cases are the same polygons that receive the most color variation due to the higher variance problem.