Exploring the Oddities of Public School Districts

geospatial

education

Published

October 1, 2022

Several weeks ago, I came across the tweet below from Kyle Walker who shared an example of how to pull a large amount of geospatial data with a few lines of code using the tigris package that he developed. The school district data that he showed in his example stood out for the complex boundaries of some states (i.e., Nebraska) compared to bordering states. In this post, I’m going to go through a brief exploration of American public school district boundaries using the tigris and sf packages, along with a tidyverse-style workflow, and identify how some districts are drawn much more differently than others.

Whoa, Nebraska really stands out with all the school district exclaves. pic.twitter.com/3hX9BMYPf2

— Josh Fangmeier (@joshfangmeier) August 4, 2022

Load school district data

To retrieve the school district boundary data, we will use the tigris package, which downloads and loads TIGER/Line shapefiles from the US Census bureau. The school_districts function in tigris downloads school district data, which the Census bureau collects from state education officials. For our analysis, we will use data on unified and elementary districts as of 2021, which combined gives us very good coverage.

library(tidyverse)
library(sf)
library(tigris)
library(mapview)
library(rmapshaper)
library(nngeo)
library(gt)
library(gtExtras)
library(measurements)

Not all states have elementary districts, so we use the safely workflow from the purrr package to keep states that do.

sch_uni_sf <- 
  map_dfr(
    c(state.abb, "DC"),
    ~school_districts(
      state = .x,
      type = "unified",
      year = 2021,
      progress_bar = F
    )
  )

safe_school_districts <- safely(school_districts)

sch_ele_sf <- 
  map(
    c(state.abb, "DC"),
    ~safe_school_districts(
      state = .x,
      type = "elementary",
      year = 2021,
      progress_bar = F
    )
  ) %>% 
  map("result") %>% 
  compact() %>% 
  bind_rows()

fips <- 
  tigris::fips_codes %>% 
  select(
    STATEFP = state_code,
    STUSPS = state,
    STATE_NAME = state_name) %>% 
  distinct()

sch_sf <- 
  bind_rows(
    sch_uni_sf,
    sch_ele_sf
    ) %>% 
  inner_join(
    fips,
    by = "STATEFP"
  )

glimpse(sch_sf)
## Rows: 12,805
## Columns: 18
## $ STATEFP    <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",~
## $ UNSDLEA    <chr> "00001", "00003", "00005", "00006", "00007", "00008", "0001~
## $ GEOID      <chr> "0100001", "0100003", "0100005", "0100006", "0100007", "010~
## $ NAME       <chr> "Fort Rucker School District", "Maxwell AFB School District~
## $ LSAD       <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "00",~
## $ LOGRADE    <chr> "KG", "KG", "KG", "PK", "KG", "PK", "KG", "PK", "KG", "KG",~
## $ HIGRADE    <chr> "12", "12", "12", "12", "12", "12", "12", "12", "12", "12",~
## $ MTFCC      <chr> "G5420", "G5420", "G5420", "G5420", "G5420", "G5420", "G542~
## $ SDTYP      <chr> "B", "B", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ FUNCSTAT   <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", "E", "E", "E",~
## $ ALAND      <dbl> 232948114, 8697481, 69770754, 1265301295, 124178343, 786179~
## $ AWATER     <dbl> 2801707, 372428, 258708, 103562192, 2505910, 339435, 601265~
## $ INTPTLAT   <chr> "+31.4097368", "+32.3809438", "+34.2631303", "+34.3821918",~
## $ INTPTLON   <chr> "-085.7458071", "-086.3637490", "-086.2106600", "-086.30505~
## $ ELSDLEA    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ STUSPS     <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ STATE_NAME <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((-85.86572 3..., MULTIPOLYGON (~

We can see that resulting data frame includes a geometry column of the ‘MULTIPOLYGON’ class. This means that each row (school district) may contains more than one land parcel.

To preview the school district boundaries, we can create a simplified map with ms_simplify from the rmapshaper package and view it with the mapView function.

sch_simp <- 
  ms_simplify(
    sch_sf, 
    sys = T, 
    sys_mem = 4)

m <- mapview(sch_simp)
m

m@map %>% 
  leaflet::setView(-96, 39, zoom = 4)

Split districts into separate polygons

Since I’m interested in understanding the boundaries of school districts and how many non-contiguous parcels some districts contains, we will split the districts into separate polygons. We can use the ms_explode function from the rmapshaper package to convert the data frame from a ‘MULTIPOLYGON’ geometry type to ‘POLYGON’. We will also add a row identifier to help keep track of each polygon.

sch_poly_sf <- 
  ms_explode(
    sch_sf, 
    sys = T, 
    sys_mem = 4) %>% 
  group_by(NAME, STUSPS) %>% 
  mutate(ID = row_number()) %>% 
  ungroup()

glimpse(sch_poly_sf)
## Rows: 23,458
## Columns: 19
## $ STATEFP    <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",~
## $ UNSDLEA    <chr> "00001", "00003", "00005", "00005", "00005", "00005", "0000~
## $ GEOID      <chr> "0100001", "0100003", "0100005", "0100005", "0100005", "010~
## $ NAME       <chr> "Fort Rucker School District", "Maxwell AFB School District~
## $ LSAD       <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "00",~
## $ LOGRADE    <chr> "KG", "KG", "KG", "KG", "KG", "KG", "PK", "PK", "PK", "PK",~
## $ HIGRADE    <chr> "12", "12", "12", "12", "12", "12", "12", "12", "12", "12",~
## $ MTFCC      <chr> "G5420", "G5420", "G5420", "G5420", "G5420", "G5420", "G542~
## $ SDTYP      <chr> "B", "B", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ FUNCSTAT   <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", "E", "E", "E",~
## $ ALAND      <dbl> 232948114, 8697481, 69770754, 69770754, 69770754, 69770754,~
## $ AWATER     <dbl> 2801707, 372428, 258708, 258708, 258708, 258708, 103562192,~
## $ INTPTLAT   <chr> "+31.4097368", "+32.3809438", "+34.2631303", "+34.2631303",~
## $ INTPTLON   <chr> "-085.7458071", "-086.3637490", "-086.2106600", "-086.21066~
## $ ELSDLEA    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ STUSPS     <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ STATE_NAME <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ geometry   <POLYGON [°]> POLYGON ((-85.78931 31.4909..., POLYGON ((-86.37655~
## $ ID         <int> 1, 1, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13~

We can see above that the geometry column is now a ‘POLYGON’ geometry type, and we can now quickly calculate the median number of non-contiguous land parcels in each district. We can see below that Hawaii has one district for the whole state with 17 parcels (islands separated by water in this case). However, Nebraska (a very landlocked state) has a median number of 16 parcels per district, much higher than neighboring South Dakota with 3 parcels per district.

sch_poly_count <- 
  sch_poly_sf %>% 
  st_drop_geometry() %>% 
  count(STUSPS, NAME)

sch_poly_count %>% 
  group_by(STUSPS) %>% 
  summarize(
    SCH_DISTRICTS = n(),
    MEDIAN_PARCELS = median(n)) %>% 
  arrange(desc(MEDIAN_PARCELS)) %>% 
  head(10)
## # A tibble: 10 x 3
##    STUSPS SCH_DISTRICTS MEDIAN_PARCELS
##    <chr>          <int>          <dbl>
##  1 HI                 1             17
##  2 NE               244             16
##  3 SD               149              3
##  4 AL               140              2
##  5 MN               331              2
##  6 AK                53              1
##  7 AR               233              1
##  8 AZ               202              1
##  9 CA               855              1
## 10 CO               178              1

Measure the ‘compactness’ of each district

Along with calculating the number of land parcels, we can also measure the degree of ‘compactness’ that a school district has. Compactness measures are often used to determine the amount of gerrymandering of political districts. One measure is the convex hull score, which is the ratio of the area of the district to the area of the smallest convex polygon that could envelop the district.

Here is a quick illustration of a convex hull drawn around a district (in red). Convex hull scores range from 0 to 1, with scores closer to 1 indicating more compactness.

ggplot() +
  geom_sf(data = sch_sf %>% slice(1), fill = "red") +
  geom_sf(data = sch_sf %>% slice(1) %>% st_convex_hull(), fill = NA, size = 1) +
  theme_void()

To calculate convex hull scores for all the districts, we can write a function that calculates the score for a district after the areas of the district and convex hull are calculated.

ch_compactness <- function(geo_column) {
  dist_area <- st_area(geo_column)
  ch_area <- st_area(st_convex_hull(geo_column))
  
  ch_ratio <- as.numeric(dist_area / ch_area)
  
  return(ch_ratio)
}

This ch_compactness function is then applied to all the districts using the map_dbl function from purrr, since we are expecting numeric values for our scores. We can see again that Hawaii, Nebraska, and South Dakota are the states with the least amount of district compactness, since their median scores are the lowest.

sch_compactness <- 
  sch_sf %>% 
  mutate(CH_COMPACT = map_dbl(geometry, ch_compactness))

sch_compactness %>% 
  st_drop_geometry() %>% 
  group_by(STUSPS) %>% 
  summarize(
    SCH_DISTRICTS = n(),
    MED_CH_COMPACT = median(CH_COMPACT)) %>% 
  arrange(MED_CH_COMPACT) %>% 
  head(10)
## # A tibble: 10 x 3
##    STUSPS SCH_DISTRICTS MED_CH_COMPACT
##    <chr>          <int>          <dbl>
##  1 HI                 1         0.0670
##  2 NE               244         0.662 
##  3 SD               149         0.754 
##  4 AL               140         0.754 
##  5 OK               510         0.779 
##  6 MN               331         0.779 
##  7 AR               235         0.779 
##  8 OR               197         0.781 
##  9 MO               516         0.784 
## 10 ND               174         0.784

Identify district enclaves

The final metric that we will explore is how many enclaves each school district has. Wikipedia has a helpful explainer of what an enclave is, but in short we are looking for parcels of a district or whole district that are completely surrounded by another school district. This could be explained by some land being carved out of a district for a nearby district or a city district that is surrounded by its county district neighbor. Here is a quick plot showing the enclave that is Tuscaloosa City, Alabama (surrounded by Tuscaloosa County).

ggplot() +
  geom_sf(
    data = sch_sf %>% filter(GEOID %in% c("0103390", "0103360")),
    aes(fill = NAME)) +
  scale_fill_viridis_d() +
  labs(fill = NULL) +
  theme_void() +
  theme(legend.position = 'bottom')

To identify enclaves, we process each of the polygons to remove any holes or empty spaces within their boundaries using the st_remove_holes function from the nngeo package.

sch_rem_holes_sf <- 
  sch_poly_sf %>% 
  select(
    ENCLAVED_BY_NAME = NAME, 
    ENCLAVED_BY_STUSPS = STUSPS
  ) %>% 
  st_remove_holes()

The next step is to join the original polygon data frame with the new ‘hole-free’ data frame to see which polygons fit within the boundaries of other polygons. To do this, we can use the st_join function from the sf package with the st_within join option that looks for matches where the polygon fits within the polygons of the joined data frame.

sch_within_sf <- 
  sch_poly_sf %>%
  st_join(
    sch_rem_holes_sf,
    join = st_within
  ) %>% 
  filter(
    NAME != ENCLAVED_BY_NAME,
    !is.na(ENCLAVED_BY_NAME)
  ) %>% 
  distinct(
    NAME,
    STUSPS,
    ID,
    .keep_all = T
  )

sch_within_sf %>% 
  st_drop_geometry() %>% 
  count(STUSPS, sort = T)
## # A tibble: 41 x 2
##    STUSPS     n
##    <chr>  <int>
##  1 NE      3227
##  2 AL      1743
##  3 SD       560
##  4 MI       500
##  5 WI       448
##  6 GA       435
##  7 MN       363
##  8 OK       162
##  9 ND       142
## 10 TN       138
## # ... with 31 more rows

We can see that Nebraska has the most district enclaves and nearly twice as many enclaves as second place Alabama. We can also plot the enclaves to see the state-by-state variation.

states_sf <- 
  states(cb = TRUE, progress_bar = F) %>% 
  filter(STUSPS %in% c(state.abb, "DC")) %>% 
  shift_geometry()

ggplot() +
  geom_sf(
    data = states_sf,
    fill = "grey10") +
  geom_sf(
    data = sch_within_sf %>% shift_geometry(), 
    fill = "orange",
    color = NA) + 
  labs(
    title = "School District Enclaves"
  ) +
  theme_void()

We can also zoom in on Nebraska and its bordering states to see the contrast more clearly.

states_of_interest <- c("NE", "KS", "CO", "WY", "SD", "IA", "MO")

ggplot() +
  geom_sf(
    data = states_sf %>% 
      filter(STUSPS %in% states_of_interest),
    fill = "grey10") +
  geom_sf(
    data = sch_within_sf %>% 
      shift_geometry() %>% 
      filter(STUSPS %in% states_of_interest), 
    fill = "orange",
    color = NA) + 
  labs(
    title = paste0(
      "School District Enclaves in ",
      glue::glue_collapse(states_of_interest, sep = ", ", last = ", and ")
    )
  ) +
  theme_void()

Wrapping up with summary table

Finally, we can identify the districts with the most enclaves and bring together the other metrics we calculated together into one table. First, we can create a new data frame that brings the enclave count, district area, enclave area, parcel count, and compactness scores together.

sch_enclaves <- 
  sch_within_sf %>% 
  st_drop_geometry() %>% 
  count(NAME, STUSPS, sort = T)

sch_enclaves_area <- 
  sch_within_sf %>% 
  group_by(NAME, STUSPS) %>% 
  summarize() %>% 
  ungroup() %>% 
  mutate(ENCLAVE_AREA = st_area(geometry)) %>% 
  st_drop_geometry()

sch_tbl_df <- 
  sch_sf %>% 
  inner_join(
    sch_enclaves %>% 
      rename(ENCLAVES = n),
    by = c("NAME", "STUSPS")
  ) %>% 
  inner_join(
    sch_compactness %>% 
      st_drop_geometry() %>% 
      select(
        NAME,
        STUSPS,
        CH_COMPACT
      ),
    by = c("NAME", "STUSPS")
  ) %>% 
  inner_join(
    sch_poly_count %>% 
      rename(PARTS = n),
    by = c("NAME", "STUSPS")
  ) %>% 
  inner_join(
    sch_enclaves_area,
    by = c("NAME", "STUSPS")
  ) %>% 
  mutate(TOTAL_AREA = st_area(geometry))

Then we can create a formatted table using the gt package and apply additional styling using the gtExtras package. This table includes the 20 districts with the most enclaves. We can embed plots of each district on the same row as other values, and we can also include bar charts to visualize the data even further.

sch_plot <- 
  sch_tbl_df %>% 
  arrange(desc(ENCLAVES)) %>% 
  head(n = 20) %>% 
  mutate(
    ENCLAVE_RATIO = as.numeric(ENCLAVE_AREA) / as.numeric(TOTAL_AREA),
    TOTAL_AREA_MI = conv_unit(as.numeric(TOTAL_AREA), from = "m2", to = "mi2"),
    PLOT = map(
      geometry,
      ~(ggplot(data = .x) +
        geom_sf(fill = "purple") +
        theme_void()) %>% 
        ggplot_image(height = px(125))
      )) %>% 
  st_drop_geometry() %>% 
  select(
    PLOT,
    NAME,
    STATE_NAME,
    TOTAL_AREA_MI,
    ENCLAVES,
    ENCLAVE_RATIO,
    CH_COMPACT
  )

sch_plot %>% 
  gt() %>% 
  cols_label(
    PLOT = "District Boundaries",
    NAME = "District Name",
    STATE_NAME = "State",
    TOTAL_AREA_MI = "Total Area (square miles)",
    ENCLAVES = "Number of Enclaves",
    ENCLAVE_RATIO = "Share of Area in Enclaves",
    CH_COMPACT = "District Compactness"
  ) %>%
  fmt_markdown(c(PLOT)) %>%
  fmt_number(
    c(ENCLAVES),
    decimals = 0
  ) %>% 
  fmt_number(
    c(TOTAL_AREA_MI),
    decimals = 1
  ) %>% 
  fmt_percent(
    c(ENCLAVE_RATIO)
  ) %>% 
  gt_theme_pff(
    divider = TOTAL_AREA_MI
  ) %>% 
  gt_plt_bar(
    column = CH_COMPACT,
    target = 1,
    width = 45,
    palette = c("purple", "black")
  ) %>% 
  tab_header(
    title = "School Districts with Most Enclaves",
    subtitle = "Area, enclave, and compactness stats"
  ) %>% 
  gt_highlight_cols(columns = ENCLAVES, fill = "#e4e8ec")

District Boundaries	District Name	State	Total Area (square miles)	Number of Enclaves	Share of Area in Enclaves
School Districts with Most Enclaves
Area, enclave, and compactness stats
	Jefferson County School District	Alabama	806.6	403	1.29%
	Sylacauga City School District	Alabama	20.4	169	100.00%
	Madison County School District	Alabama	610.7	157	2.53%
	North Huron School District	Michigan	182.7	124	0.80%
	Shelby County School District	Alabama	717.1	97	1.21%
	Lee County School District	Alabama	487.9	96	1.11%
	Humboldt Table Rock Steinauer Public Schools	Nebraska	384.0	93	5.10%
	Limestone County School District	Alabama	515.3	92	2.67%
	Gordon County School District	Georgia	342.7	91	0.52%
	Walker County School District	Alabama	775.8	84	0.19%
	Tuscaloosa County School District	Alabama	1,280.5	70	0.49%
	Norfolk Public Schools	Nebraska	144.9	62	7.40%
	East Butler Public Schools	Nebraska	289.8	53	3.25%
	Columbus City School District	Ohio	137.2	51	3.30%
	Hall County School District	Georgia	390.2	50	0.49%
	Langford School District 45-5	South Dakota	421.3	50	5.07%
	Bartow County School District	Georgia	440.8	49	0.17%
	Whitfield County School District	Georgia	269.9	49	0.21%
	Cozad Community Schools	Nebraska	240.6	47	4.81%
	Montgomery County School District	Alabama	763.4	46	0.06%