library(here)
source("scripts/setup.R")
library(magick)
library(pdftools)
library(grDevices)
CITY <- "Simi_Valley"
CITY2 <- "Simi Valley"
CONVERT_PDF <- FALSE
NUM_COLORS <- 30
DROP_COLORS <- c("TRANSPARENT", "#FFFFFFFF")
map_path <- paste0("raw/Simi_Valley_ZN.pdf")
raw_img_path <- paste0("raw/", CITY, "_raw.pdf")
png_out_path <- paste0("raw/", CITY, ".png")
edited_png_path <- paste0("raw/", CITY, "_edited.png")
The Terner Center Land Use Survey (TCRLUS) asked respondents to estimate the share of land in their municipality that is zoned to permit multifamily housing. However, visually comparing some respondents’ published zoning maps with their survey responses suggested that there might be some discrepancies between the actual land share and the share reported on the survey. To investigate this potential discrepancy, I used computer image analysis on a couple of selected zoning maps to generate estimates of the area zoned for multifamily housing on those maps as a way of validating the survey responses. This document describes the image analysis methodology and provides illustrative findings for Simi Valley, CA, where I found notable differences between the land share reported in the survey and the land share indicated by the image analysis.
In TCRLUS, Simi Valley reported that 76 to 95 percent of its land allows the construction of multifamily units. Their publicly available zoning map is below. It is not precisely clear which zoning categories the survey respondents included in their estimate of the land share zoned for multifamily housing, but the beige of the low and medium density categories seem to dominate the map. However, from a static image where the different zones are interspersed with each other, it is hard to accurately visually assess how much area is occupied by any given zone.
In computer graphics, colors are commonly represented as a combination of red, green, and blue in what is known as the RGB color space. Essentially, color in a digital image is actually a combination of red, green, and blue lights that are so close together that the human eye blends them, producing the color we actually perceive. Computers define individual colors using a unique identifier of three numbers (or another sort of code that can be converted into those three numbers), respectively indicating the intensity of red, green, and blue light that creates it.
A digital image is a collection of tiny points, called pixels, each of which has an associated color code. Pixels are usually too small for humans to pick out individually; our eyes blur adjacent pixels together, producing the impression of solid color. What looks like a solid color may actually be a collection of very similarly-colored points. A higher pixel density can make an image seem more “realistic”, with sharper details and colors closer to what we would see in the real world, but such high-resolution images are more computationally expensive to store or transfer.
To reduce this expense, computer scientists have developed methods to reduce the complexity of the colors in a digital image while preserving as much of the original perceived color as possible. One such method is color quantization. The general idea is to use some form of clustering algorithm to divide colors into groups of similar colors, and then to use each group’s “average” color in place of every group member’s original color.
Since zoning maps are composed of only a handful of fairly distinct colors, I thought I could use color quantization to identify what share of the pixels in an image belong to each color. In R, the Rmagick
packages makes this process remarkably simple: essentially, one can read in an image as one would a dataset and then apply a pre-packaged function to partition the colors in the image into a specified number of groups. The result can be converted into a table containing the number of pixels associated with each color code, at which point computing the share of pixels associated with each code is straightforward.
Here is the R code to compute such shares:
#===============================================================================#
# QUANTIZE AND COMPUTE COLOR SHARES
#===============================================================================#
# read in the edited image - contains just the zoning area
map_png <- image_read(edited_png_path)
# quantize the map image into the specified number of colors, return a df w/pixel
# counts for each color
quantize_map <- function(png, max_color){
img_q <- image_quantize(map_png, max = max_color)
img_r <- as.raster(img_q)
img_tab <- table(img_r)
img_df <- data.frame(Color = names(img_tab), Count = as.integer(img_tab))
return(img_df)
}
# compute shares of pixels of each color, with and without specified colors
compute_color_shares <- function(df, excluded_colors = c("TRANSPARENT", "#FFFFFFFF")) {
tot <- sum(df$Count)
tot_excluded <- sum(df$Count[!toupper(df$Color) %in% excluded_colors])
df %<>% mutate(Color = toupper(Color),
color_share = Count / tot,
color_share2 = Count / tot_excluded,
color_hex = str_remove(Color, "FF"))
return(df)
}
img_df <- quantize_map(map_png, NUM_COLORS)
img_df %<>% compute_color_shares(excluded_colors = DROP_COLORS)
Though this approach is simple to implement, there are a few important methodological issues when applying it to zoning maps:
1. Zoning maps typically have a lot of explanatory text and annotations that are not part of the image, and they often don’t use solid colors.
The color quantization algorithm will count the number of pixels of each color, and does not distinguish between text and the map itself. Black text will be included in the count of black pixels, and white space will be included in the count of white pixels. Consequently, I remove white, black, and empty (aka “transparent” color) pixels from the color shares.
Similarly, zoning maps often used hash lines or dot patterns to denote special areas. The algorithm cannot tell that these patterns cover a specific area, and will treat them as distinct groups. As a result, this approach works best on solid-colored maps.
2. The “correct” number of color groups is not obvious.
The algorithm will split the colors in the image into as many groups as one specifies, but it’s not clear exactly how many “true” color groups are present in a zoning map. The number of colors in a legend is a good starting point. However, since colored areas in digital images may be composed of groups of similarly- but not identically-colored pixels, it’s not an exact match.
3. The algorithm will produce similar color groups but may not match the original colors exactly.
Again, this issue stems from the fact that what appears to a human eye as a solid color may be composed of pixels from a group of many closely related colors. The algorithm will essentially “average” the colors it assigns to a group, but this average likely won’t be identical to the original color specified by the zoning map creator. In maps where similar shades of color are used to denote related categories, the algorithm might not be as effective at distinguishing groups or matching colors.
To analyze the Simi Valley map, I took a couple of steps to address the issues described above. First, I found a version of the zoning map that used solid colors and lacked annotation text, and then I used an online photo editing tool to crop the image so it only contained the area of Simi Valley itself. Here is the input image:
I tried a few different numbers of partitions, the results of which are plotted in the graphs below. In each plot, each bar shows the share of pixels in the image associated with its color. The labels beneath the bar are the RGB codes defining the colors.
It seems evident that 10 groups is too few, and there is still substantial change between 20 and 30, but the results for 30 and 40 are fairly similar. Notably, there is some stability in the color groups as well: notice that some of the more common color groups, like green, are labeled with the same color code across different partition sizes.
Regardless of the specific partition choice, the graphs indicate that Simi Valley’s self-reported multifamily land share of more than 75% does not align with the published map.
#===============================================================================#
# MAKE PLOTS
#===============================================================================#
make_one_plot <- function(map_png, NUM_COLORS, excluded_colors = DROP_COLORS){
img_df <- quantize_map(map_png, NUM_COLORS) %>%
compute_color_shares(excluded_colors = DROP_COLORS) %>%
filter(!Color %in% DROP_COLORS) %>%
arrange(color_share2)
color_df <- data.frame(t(col2rgb(img_df$Color))) %>%
mutate(rgb_color = paste0("R", red, "-", green, "-", blue)) %>%
bind_cols(select(img_df, Color))
img_df %<>% left_join(select(color_df, Color, rgb_color), by = "Color") %>%
mutate(partition_N = NUM_COLORS)
return(img_df)
}
# try color partitions of size 10, 20, 30, and 40, bind into single data frame
partitions <- map(c(10, 20, 30, 40),
function(N) make_one_plot(map_png, N)) %>%
bind_rows()
ggplot(partitions) +
geom_col(aes(x = reorder(rgb_color, color_share2), y = color_share2),
fill = partitions$color_hex) +
scale_y_continuous(labels = scales::percent) +
facet_wrap(facets = vars(partition_N), scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(color = "gray85", linetype = "dotted"),
panel.grid.minor = element_line(color = "gray85", linetype = "dotted"),
legend.position = "none",
axis.ticks = element_blank()) +
labs(title = paste0("Share of pixels of each color in ", CITY2, "\nexcluding white"),
x = "Color", y = "Pixel Share")