Working with ACS microdata in R

Using ipumsr, geocorr, and survey to analyze a small-geography in IPUMS data.
R
Data Management
Author

Peter Amerkhanian

Published

December 29, 2024

Code
pacman::p_load(dplyr,
               ggplot2,
               # Statistics
               modelsummary,
               srvyr,
               survey,
               # Webscraping
               httr,
               rvest,
               readr,
               glue,
               # Census
               tidycensus,
               ipumsr,
               # XML Parsing
               xml2,
               purrr,
               stringr)

The following are my notes on how to use American Community Survey (ACS) microdata, leveraging the University of Minnesota’s Integrated Public Use Microdata Series (IPUMS).1 I cover:

  1. Retrieval: how to choose which ACS product is relevant, how to submit a request to IPUMS, and how to filter down to relevant levels of geography/granularity.
  2. Analysis: how to properly weight ACS data using sample and replication weights for accurate analysis and uncertainty estimation.

I’ll start with a motivating question: What was the median household income in Oakland, California in 2022?

Aggregate data with tidycensus

Answering that question is straightforward using the Census’ aggregate data – pre-calculated descriptive statistics for aggregate geographies. I typically retrieve aggregate data via the Census API.2 In R, the tidycensus package provides an easy-to-use wrapper for that API. Note that I set up an API key for the U.S. Census and I’m storing it in my .Renviron file as census_api_key="yourkeyhere".

census_api_key(Sys.getenv("census_api_key"))

I query B19013_001, the median household income variable, using the 2022 1-year American Community Survey sample and I filter down to Oakland’s GEOID, 0653000, which is a combination of the state code for California, 06, and the place code for Oakland, 53000.3 I’ll throw in the total population variable for good measure:

oakland_stats <- get_acs(
  geography = "place",
  variables = c(
    median_hh_income = "B19013_001",
    total_pop = "B17001_001"
  ),
  state = "CA",
  year = 2022,
  survey = "acs1"
)
oakland_stats <- oakland_stats %>% filter(GEOID == '0653000')
oakland_stats %>% select(c(variable, estimate, moe))
Table 1: Total Population and Median Household Income in Oakand, CA via Aggregate Data
variable estimate moe
total_pop 426323 811
median_hh_income 93146 6232

Done! We now know the population and median household income for Oakland in 2022, along with a margin of error. See (Walker 2023) for a comprehensive treatment of working with aggregate census data.

Microdata with IPUMS/ipumsr

What if I wanted to have access to the underlying data used to calculate the median? Maybe I want to try a different standard error specification for that statistic, or calculate other statistics, like the average household income. These tasks would all entail accessing census microdata – household and/or individual level census data.

One of the most popular sources for downloading census microdata is the University of Minnesota’s Integrated Public Use Microdata Series (IPUMS). The IPUMS team provides a centralized API for downloading census microdata, comprehensive documentation for working with the data, and harmonized variables across time (Walker 2023, chap. 9).

The easiest way to access IPUMS data in R is with the ipumsr package, which the IPUMS team maintains and which allows users to submit API requests to IPUMS directly from R (Greg Freedman Ellis, Derek Burk, and Finn Roberts 2024) . To get set up, I registered for an IPUMS API key, stored the key in my .Renviron file, and will configure the key in ipumsr as follows:

set_ipums_api_key(Sys.getenv("ipums_api_key"))

The ipumsr website provides details on what survey products the project currently supports, as does the ipums_data_collections() function:

ipums_data_collections() %>%
  filter(api_support == TRUE) %>% 
  arrange(desc(collection_type))
Table 2: IPUMS API Collections
collection_name collection_type code_for_api api_support
IPUMS USA microdata usa TRUE
IPUMS CPS microdata cps TRUE
IPUMS International microdata ipumsi TRUE
IPUMS ATUS microdata atus TRUE
IPUMS AHTUS microdata ahtus TRUE
IPUMS MTUS microdata mtus TRUE
IPUMS NHIS microdata nhis TRUE
IPUMS MEPS microdata meps TRUE
IPUMS NHGIS aggregate data nhgis TRUE

I’ll look at IPUMS USA since my motivating question involves median household income for a year (2022), and IPUMS USA offers annual data from decennial censuses 1790-2010 and American Community Surveys (ACS) 2000-present (Ruggles et al. 2024). We can check out the newest products they have in the USA collection as follows:

get_sample_info(collection="usa") %>%
  arrange(desc(name)) %>%
  head(5)
Table 3: IPUMS USA Products
name description
us2023b 2023 PRCS
us2023a 2023 ACS
us2022d 2018-2022, PRCS 5-year
us2022c 2018-2022, ACS 5-year
us2022b 2022 PRCS

Note that “PRCS” refers to the Puerto Rico Community Survey (an ACS equivalent specifically tailored to Puerto Rico). We are principally interested in the ACS, which comes in either one-year (e.g. 2023 ACS) or five-year (e.g. 2018-2022, ACS 5-year) estimates. The differences between these two estimates are described in detail in Census Data: An Overview in (Walker 2023). One differentiating point is that one-year estimates come from a smaller, but more contemporary sample. In our case we’ll use the one-year to get the best sense of the 2022 income dynamics.

Let’s return to the motivating question for this post: What was the median household income in Oakland, California in 2022?

To answer that we will:

  1. Get income data from the 2022 1-year ACS (Section 3).
  2. Filter our data down to households in the city of Oakland (Section 4 and Section 5).
  3. Calculate the median (Section 6 and Section 7).

Step 1: Retrieving data

For the first task, I’ll define several helper functions:

  • retrieve_sample() retrieves a list of variables from a given ACS sample from the IPUMS API. This is the only truly necessary code.
  • check_description() and ipums_data() are functions for checking if the target extract already exists locally. Retrieving an extract can take some time, and I’d like to avoid doing it repeatedly when I’ve already downloaded the extract in the past.

Each of these are defined in the following folded code block.

Code
retrieve_sample <- function(sample, variables, description) {
  extract <- define_extract_micro(
    description = description,
    collection = "usa",
    samples = c(sample),
    variables = variables
  )
  data_path <- extract %>%
    submit_extract() %>%
    wait_for_extract() %>%
    download_extract(download_dir = here::here("data"),
                     overwrite = TRUE)
  data <- read_ipums_micro(data_path)
  return(data)
}

check_description <- function(file, description) {
  xml_text <- readLines(file, warn = FALSE) %>% paste(collapse = "\n")
  user_description <- str_match(xml_text, "User-provided description:\\s*(.*)]]")[, 2]
  # Check if it matches the target
  if (str_detect(user_description, fixed(description, ignore_case = TRUE))) {
    return(file)
  } else {
    return(NULL)
  }
}

ipums_data <- function(sample, variables, description) {
  local_ipums_extracts <- list.files(
    path = here::here('data'),
    pattern = "\\.xml$",
    full.names = TRUE
  )
  matching_files <- compact(map(local_ipums_extracts, \(x) check_description(x, description)))
  # If there is a match, open the first matching file
  if (length(matching_files) > 0) {
    matched_file <- matching_files[[1]]
    message("Opening: ", matched_file)
    data <- read_ipums_micro(matched_file)
  } else {
    message("No matching dataset found.")
    data <- retrieve_sample(sample, variables, description)
  }
}

I’ll proceed to define a list of variables that I want, including HHINCOME (household income) and INCTOT (individual income). Some of these variables refer to census-specific language – i.e. PUMA , REPWT , REPWTP. I’ll cover exactly what each of these represent later in the post, but in the meantime I’ll also note that the IPUMS website has full reference materials for all available variables.

variables <- list(
  "PUMA",
  "AGE",
  "SEX",
  "EDUC",
  "HHINCOME",
  "INCTOT",
  "REPWT",
  "REPWTP",
  var_spec("STATEFIP",
           case_selections = "06")
)

Note the line, var_spec("STATEFIP", case_selections = "06"). This selects the variable STATEFIP, while also specifying that we want to restrict our request to data where STATEFIP=='06' (California). Using var_spec() is important, as accidentally/unnecessarily downloading an unfiltered, full sample of the entire U.S. is time-consuming.

Anyways, I can request these variables from the 2022 1-year ACS via myipums_data() function.

data <- ipums_data("us2022a", variables, "Incomes by PUMA")

Here are the resulting data, the 2022 1-year ACS for California.

data %>% head()
Table 4: Unfiltered 2022 1-year ACS data for California
YEAR SAMPLE SERIAL CBSERIAL HHWT REPWT CLUSTER STATEFIP PUMA STRATA GQ HHINCOME REPWT1 REPWT2 REPWT3 REPWT4 REPWT5 REPWT6 REPWT7 REPWT8 REPWT9 REPWT10 REPWT11 REPWT12 REPWT13 REPWT14 REPWT15 REPWT16 REPWT17 REPWT18 REPWT19 REPWT20 REPWT21 REPWT22 REPWT23 REPWT24 REPWT25 REPWT26 REPWT27 REPWT28 REPWT29 REPWT30 REPWT31 REPWT32 REPWT33 REPWT34 REPWT35 REPWT36 REPWT37 REPWT38 REPWT39 REPWT40 REPWT41 REPWT42 REPWT43 REPWT44 REPWT45 REPWT46 REPWT47 REPWT48 REPWT49 REPWT50 REPWT51 REPWT52 REPWT53 REPWT54 REPWT55 REPWT56 REPWT57 REPWT58 REPWT59 REPWT60 REPWT61 REPWT62 REPWT63 REPWT64 REPWT65 REPWT66 REPWT67 REPWT68 REPWT69 REPWT70 REPWT71 REPWT72 REPWT73 REPWT74 REPWT75 REPWT76 REPWT77 REPWT78 REPWT79 REPWT80 PERNUM PERWT REPWTP FAMUNIT RELATE RELATED SEX AGE EDUC EDUCD INCTOT REPWTP1 REPWTP2 REPWTP3 REPWTP4 REPWTP5 REPWTP6 REPWTP7 REPWTP8 REPWTP9 REPWTP10 REPWTP11 REPWTP12 REPWTP13 REPWTP14 REPWTP15 REPWTP16 REPWTP17 REPWTP18 REPWTP19 REPWTP20 REPWTP21 REPWTP22 REPWTP23 REPWTP24 REPWTP25 REPWTP26 REPWTP27 REPWTP28 REPWTP29 REPWTP30 REPWTP31 REPWTP32 REPWTP33 REPWTP34 REPWTP35 REPWTP36 REPWTP37 REPWTP38 REPWTP39 REPWTP40 REPWTP41 REPWTP42 REPWTP43 REPWTP44 REPWTP45 REPWTP46 REPWTP47 REPWTP48 REPWTP49 REPWTP50 REPWTP51 REPWTP52 REPWTP53 REPWTP54 REPWTP55 REPWTP56 REPWTP57 REPWTP58 REPWTP59 REPWTP60 REPWTP61 REPWTP62 REPWTP63 REPWTP64 REPWTP65 REPWTP66 REPWTP67 REPWTP68 REPWTP69 REPWTP70 REPWTP71 REPWTP72 REPWTP73 REPWTP74 REPWTP75 REPWTP76 REPWTP77 REPWTP78 REPWTP79 REPWTP80
2022 202201 74692 2.02201e+12 14 1 2.022001e+12 6 6509 650906 4 9999999 12 14 14 12 14 14 12 12 11 12 13 13 11 13 13 11 13 12 12 14 13 13 13 13 11 11 13 12 14 13 10 12 13 10 13 13 15 12 14 14 12 11 12 12 10 14 13 13 12 12 11 13 12 11 12 14 13 12 14 12 13 13 12 12 12 12 12 12 12 11 12 13 13 11 14 12 11 11 13 10 1 14 1 1 12 1270 2 56 6 64 14500 12 14 14 12 14 14 12 12 11 12 13 13 11 13 13 11 13 12 12 14 13 13 13 13 11 11 13 12 14 13 10 12 13 10 13 13 15 12 14 14 12 11 12 12 10 14 13 13 12 12 11 13 12 11 12 14 13 12 14 12 13 13 12 12 12 12 12 12 12 11 12 13 13 11 14 12 11 11 13 10
2022 202201 74693 2.02201e+12 27 1 2.022001e+12 6 6501 650106 3 9999999 27 28 48 49 5 47 26 26 6 7 6 28 26 48 6 28 47 27 26 44 50 50 25 28 27 27 6 47 25 27 29 6 49 27 27 6 29 6 6 26 28 28 6 6 51 6 26 27 48 46 47 28 27 6 51 27 6 26 29 4 6 6 28 27 29 27 48 5 27 29 27 46 6 25 27 48 27 43 47 25 1 27 1 1 13 1301 1 52 0 2 0 27 28 48 49 5 47 26 26 6 7 6 28 26 48 6 28 47 27 26 44 50 50 25 28 27 27 6 47 25 27 29 6 49 27 27 6 29 6 6 26 28 28 6 6 51 6 26 27 48 46 47 28 27 6 51 27 6 26 29 4 6 6 28 27 29 27 48 5 27 29 27 46 6 25 27 48 27 43 47 25
2022 202201 74694 2.02201e+12 70 1 2.022001e+12 6 8101 810106 3 9999999 19 11 11 59 71 71 71 71 78 91 89 91 19 57 80 89 70 78 90 11 20 12 11 58 70 71 71 71 80 91 90 91 18 60 79 89 70 79 89 92 80 92 90 90 71 72 69 71 20 11 11 59 80 92 19 58 71 18 58 90 79 90 90 90 70 69 70 71 20 12 10 57 78 91 20 58 71 20 56 12 1 70 1 1 13 1301 1 61 7 71 80 19 11 11 59 71 71 71 71 78 91 89 91 19 57 80 89 70 78 90 11 20 12 11 58 70 71 71 71 80 91 90 91 18 60 79 89 70 79 89 92 80 92 90 90 71 72 69 71 20 11 11 59 80 92 19 58 71 18 58 90 79 90 90 90 70 69 70 71 20 12 10 57 78 91 20 58 71 20 56 12
2022 202201 74695 2.02201e+12 22 1 2.022001e+12 6 8303 830306 4 9999999 22 26 29 36 2 32 36 27 3 2 4 22 18 55 3 22 43 18 27 4 2 3 19 26 20 20 46 5 21 23 20 38 2 27 23 45 20 31 53 20 20 20 54 42 6 49 16 19 3 3 4 21 27 33 4 25 37 29 18 2 4 3 29 18 27 28 34 2 26 23 22 38 5 19 21 35 25 56 30 24 1 22 1 1 12 1270 2 26 7 71 9000 22 26 29 36 2 32 36 27 3 2 4 22 18 55 3 22 43 18 27 4 2 3 19 26 20 20 46 5 21 23 20 38 2 27 23 45 20 31 53 20 20 20 54 42 6 49 16 19 3 3 4 21 27 33 4 25 37 29 18 2 4 3 29 18 27 28 34 2 26 23 22 38 5 19 21 35 25 56 30 24
2022 202201 74696 2.02201e+12 8 1 2.022001e+12 6 6712 671206 3 9999999 9 15 10 8 8 2 2 8 2 15 1 16 8 14 15 10 8 7 9 10 14 9 3 16 16 9 10 1 7 8 8 8 3 9 10 16 15 1 1 8 16 9 2 15 15 9 8 1 8 8 8 8 1 9 8 15 16 3 2 2 8 15 8 8 8 1 1 9 2 16 3 15 8 14 17 9 8 8 8 3 1 8 1 1 13 1301 2 38 6 63 48000 9 15 10 8 8 2 2 8 2 15 1 16 8 14 15 10 8 7 9 10 14 9 3 16 16 9 10 1 7 8 8 8 3 9 10 16 15 1 1 8 16 9 2 15 15 9 8 1 8 8 8 8 1 9 8 15 16 3 2 2 8 15 8 8 8 1 1 9 2 16 3 15 8 14 17 9 8 8 8 3
2022 202201 74697 2.02201e+12 49 1 2.022001e+12 6 7301 730106 4 9999999 52 51 47 5 79 45 45 4 49 5 3 103 91 53 49 49 46 5 102 88 47 48 55 4 98 49 47 5 44 5 3 93 104 48 54 46 53 4 103 3 46 50 51 93 4 48 50 77 49 93 108 5 4 46 47 51 57 89 4 4 54 42 56 101 4 45 55 94 48 96 99 5 4 42 49 47 50 75 4 90 1 49 1 1 12 1270 1 23 6 63 24000 52 51 47 5 79 45 45 4 49 5 3 103 91 53 49 49 46 5 102 88 47 48 55 4 98 49 47 5 44 5 3 93 104 48 54 46 53 4 103 3 46 50 51 93 4 48 50 77 49 93 108 5 4 46 47 51 57 89 4 4 54 42 56 101 4 45 55 94 48 96 99 5 4 42 49 47 50 75 4 90

Step 2: Using Geocorr to identify small geographies

We now have microdata for all of California, but we need to filter down to just Oakland. Unfortunately, this isn’t as simple as just running filter(CITY == 'Oakland') – ACS microdata does not include a field for explicitly identifying cities (note that a city is typically referred to as a “place” in census data).

The smallest geographic area explicitly identified in the microdata is something called a public use microdata area (PUMA) (Pastoor 2024). PUMAS are unique geographies that always aggregate to the state-level (e.g. California can be constructed with a collection of PUMAs), but only sometimes aggregate to other small geographic areas, such as city, metro area, and county (See Census Hierarchies in Walker 2023, chap. 1).

Code
library(tigris)
library(sf)
library(cowplot)
options(tigris_use_cache = TRUE)

bay_area_counties <- c(
  "Alameda", "Contra Costa", "Marin", "Napa",
  "San Francisco", "San Mateo", "Santa Clara", "Solano", "Sonoma"
)

bay_area <- counties(state = "CA", class = "sf", year=2022) %>%
  filter(NAME %in% bay_area_counties) %>%
  st_union()  # Merge counties into a single boundary

# Get all California Places & PUMAs
places <- places(state = "CA", class = "sf", year=2022)
pumas <- pumas(state = "CA", class = "sf",  year=2022)
# Filter only those in the Bay Area
places_bay <- places[st_intersects(places, bay_area, sparse = FALSE), ]
pumas_bay <- pumas[st_intersects(pumas, bay_area, sparse = FALSE), ]

pumas_bbox <- st_bbox(bay_area)

crop_theme <- theme(
    panel.spacing = unit(0, "lines"),      
    plot.margin = margin(0, 0, 0, 0, "cm")
)

map_places <- ggplot() +
  geom_sf(data = places_bay, fill = "lightblue", color = "darkgrey", size = 0.1) +
  geom_sf(data = bay_area, fill = NA, color = "red", size = 4) +  
  coord_sf(ylim = c(pumas_bbox["ymin"], pumas_bbox["ymax"]), xlim = c(pumas_bbox["xmin"], pumas_bbox["xmax"]), expand=FALSE) +  # Apply PUMAs y-limits
  crop_theme +
  theme_minimal() +
  ggtitle("Places")

map_pumas <- ggplot() +
  geom_sf(data = pumas_bay, fill = "lightblue", color = "darkgrey", size = 0.1) +
  geom_sf(data = bay_area, fill = NA, color = "red", size = 4) +  
  coord_sf(ylim = c(pumas_bbox["ymin"], pumas_bbox["ymax"]), xlim = c(pumas_bbox["xmin"], pumas_bbox["xmax"]), expand=FALSE) +  # Ensure y-axis is the same
  crop_theme +
  theme_minimal() +
  ggtitle("PUMAs")


plot_grid(map_places, map_pumas, nrow = 1, align = "h", axis="t", vjust = 0)
Places and PUMAS in the Bay Area (region border in red)

To find out if a city corresponds to a collection of PUMAs and which PUMAs those are, we’ll use Geocorr (geographic correspondence engine), an application that generates correlation lists showing relationships between two or more geographic coverages in the United States (Mihalik, Rice, and Hesser 2022). Geocorr is a sponsored program of the Missouri State library and published by the University of Missouri Center for Health Policy.4

Geocorr 2022: Geographic Correspondence Engine

To use Geocorr, I’ll define a function, geocorr_2022() that queries Geocorr 2022 and retrieves a .csv file establishing the relationships between two sets of geographies within a given state. See the following folded code chunk for the specifics:

Code
geocorr_2022 <- function(state, geo_1, geo_2, weight_var) {
  base_url <- "https://mcdc.missouri.edu"
  params <- glue(
    "cgi-bin/broker?_PROGRAM=apps.geocorr2022.sas&",
    "_SERVICE=MCDC_long&_debug=0&",
    "state={state}&g1_={geo_1}&g2_={geo_2}&wtvar={weight_var}&",
    "nozerob=1&fileout=1&filefmt=csv&lstfmt=txt&title=&",
    "counties=&metros=&places=&oropt=&latitude=&longitude=&",
    "distance=&kiloms=0&locname=")
  initial_url <- params %>% url_absolute(base = base_url)
  initial_response <- GET(initial_url)
  html_content <- content(initial_response, as = "text")
  parsed_html <- read_html(html_content)
  # Extract the one link
  csv_url <- parsed_html %>%
    html_node("a") %>%
    html_attr("href") %>%
    stringr::str_trim() %>%
    url_absolute(base = base_url)
  csv_data <- read_csv(csv_url)
  return(csv_data)
}

We’ll use that function to establish the relationships between California’s 2022 PUMAs and its “places,” using individual population as measured in the 2020 Decenial Census to weight the relationships.

csv_data <- geocorr_2022("Ca06", "puma22", "place", "pop20")

With that, we can see which PUMAs correspond to the City of Oakland.

csv_data %>%
  select(-c(state, stab, place, PUMA22name)) %>%
  filter(PlaceName == 'Oakland city, CA')
Table 5: PUMA to Place correspondence for Oakand, CA
puma22 PlaceName pop20 afact
00111 Oakland city, CA 106433 1
00112 Oakland city, CA 106896 1
00113 Oakland city, CA 125840 1
00114 Oakland city, CA 9 0
00123 Oakland city, CA 101468 1

The AFACT (allocation factor) column shows the proportion of the source area contained in the target area – in this case the proportion of the PUMA population that belongs to Oakland. In this case, 100% of the populations in PUMAs 111, 112, 113, and 123 belong to Oakland, and 0% of PUMA 114. GEOCORR does believe that 9 individuals from 114 live in Oakland, but based on the AFACT of 0 and the fact that I know the PUMA overlays the neighboring city of Piedmont, I’ll feel comfortable dropping that PUMA.5

Note that when you plot place to PUMA equivalence, you may find slight differences due to the fact that places can include un-populated areas, such as bodies of water in the case of Oakland.

Code
# Filter PUMAs for Oakland
oakland_pumas <- pumas_bay %>% filter(PUMACE20 %in% c('00111', '00112', '00113', '00123'))
oakland_place <- places_bay %>% filter(NAME == 'Oakland')
# Get bounding box for Oakland PUMAs
oakland_bbox <- st_bbox(oakland_place)

crop_theme <- theme(
    panel.spacing = unit(0, "lines"),      
    plot.margin = margin(0, 0, 0, 0, "cm"),
    axis.title.y = element_blank(),
    axis.title.x = element_blank()
)
# Oakland Places Map
map_places <- ggplot() +
  geom_sf(data = oakland_place, fill = "lightblue", color = "darkgrey", size = 0.1) +
  geom_sf(data = oakland_place, fill = NA, color = "red", size = 4) +  
  coord_sf(ylim = c(oakland_bbox["ymin"], oakland_bbox["ymax"]), 
           xlim = c(oakland_bbox["xmin"], oakland_bbox["xmax"]), 
           expand = FALSE) +
  theme_minimal() +
  crop_theme +
  ggtitle("Oakland (Place)")


# Oakland PUMAs Map
map_pumas <- ggplot() +
  geom_sf(data = oakland_pumas, fill = "lightblue", color = "darkgrey", size = 0.1) +
  geom_sf_text(data = oakland_pumas, aes(label = PUMACE20)) +
  geom_sf(data = oakland_place, fill = NA, color = "red", size = 4) +
  #geom_sf(data = oakland_pumas, fill = NA, color = "red", size = 4) +  
  coord_sf(ylim = c(oakland_bbox["ymin"], oakland_bbox["ymax"]), 
           xlim = c(oakland_bbox["xmin"], oakland_bbox["xmax"]), 
           expand = FALSE) +
  theme_minimal() +
  crop_theme +
  ggtitle("Oakland (PUMAs)")

plot_grid(map_places, map_pumas, nrow = 1, align = "h", axis = "t", vjust = 0)
Place to PUMAs Correspondence in Oakland (place border in red)

Filtering to those PUMAs gets us the 2022 1-year ACS microdata for the City of Oakland (I use haven::zap_labels() just to remove some unnecessary formatting that comes with the data).

oakland_pumas <- c(111, 112, 113, 123)
oak <- data %>%
  filter(PUMA %in% oakland_pumas) %>% 
  haven::zap_labels()
oak %>% head()
Table 6: 2022 1-year ACS data for Oakland, CA
YEAR SAMPLE SERIAL CBSERIAL HHWT REPWT CLUSTER STATEFIP PUMA STRATA GQ HHINCOME REPWT1 REPWT2 REPWT3 REPWT4 REPWT5 REPWT6 REPWT7 REPWT8 REPWT9 REPWT10 REPWT11 REPWT12 REPWT13 REPWT14 REPWT15 REPWT16 REPWT17 REPWT18 REPWT19 REPWT20 REPWT21 REPWT22 REPWT23 REPWT24 REPWT25 REPWT26 REPWT27 REPWT28 REPWT29 REPWT30 REPWT31 REPWT32 REPWT33 REPWT34 REPWT35 REPWT36 REPWT37 REPWT38 REPWT39 REPWT40 REPWT41 REPWT42 REPWT43 REPWT44 REPWT45 REPWT46 REPWT47 REPWT48 REPWT49 REPWT50 REPWT51 REPWT52 REPWT53 REPWT54 REPWT55 REPWT56 REPWT57 REPWT58 REPWT59 REPWT60 REPWT61 REPWT62 REPWT63 REPWT64 REPWT65 REPWT66 REPWT67 REPWT68 REPWT69 REPWT70 REPWT71 REPWT72 REPWT73 REPWT74 REPWT75 REPWT76 REPWT77 REPWT78 REPWT79 REPWT80 PERNUM PERWT REPWTP FAMUNIT RELATE RELATED SEX AGE EDUC EDUCD INCTOT REPWTP1 REPWTP2 REPWTP3 REPWTP4 REPWTP5 REPWTP6 REPWTP7 REPWTP8 REPWTP9 REPWTP10 REPWTP11 REPWTP12 REPWTP13 REPWTP14 REPWTP15 REPWTP16 REPWTP17 REPWTP18 REPWTP19 REPWTP20 REPWTP21 REPWTP22 REPWTP23 REPWTP24 REPWTP25 REPWTP26 REPWTP27 REPWTP28 REPWTP29 REPWTP30 REPWTP31 REPWTP32 REPWTP33 REPWTP34 REPWTP35 REPWTP36 REPWTP37 REPWTP38 REPWTP39 REPWTP40 REPWTP41 REPWTP42 REPWTP43 REPWTP44 REPWTP45 REPWTP46 REPWTP47 REPWTP48 REPWTP49 REPWTP50 REPWTP51 REPWTP52 REPWTP53 REPWTP54 REPWTP55 REPWTP56 REPWTP57 REPWTP58 REPWTP59 REPWTP60 REPWTP61 REPWTP62 REPWTP63 REPWTP64 REPWTP65 REPWTP66 REPWTP67 REPWTP68 REPWTP69 REPWTP70 REPWTP71 REPWTP72 REPWTP73 REPWTP74 REPWTP75 REPWTP76 REPWTP77 REPWTP78 REPWTP79 REPWTP80
2022 202201 74718 2.02201e+12 5 1 2.022001e+12 6 111 11106 3 9999999 5 4 5 4 5 5 5 2 6 5 5 3 5 2 2 5 4 4 4 5 4 4 5 5 4 5 5 2 2 2 5 4 5 3 3 4 5 5 4 3 5 5 4 5 5 5 4 2 5 4 4 4 5 2 3 4 5 5 4 5 5 4 5 5 4 5 5 2 2 3 3 4 4 2 2 5 4 4 5 4 1 5 1 1 13 1301 1 20 7 71 0 5 4 5 4 5 5 5 2 6 5 5 3 5 2 2 5 4 4 4 5 4 4 5 5 4 5 5 2 2 2 5 4 5 3 3 4 5 5 4 3 5 5 4 5 5 5 4 2 5 4 4 4 5 2 3 4 5 5 4 5 5 4 5 5 4 5 5 2 2 3 3 4 4 2 2 5 4 4 5 4
2022 202201 74737 2.02201e+12 56 1 2.022001e+12 6 111 11106 3 9999999 6 81 56 66 67 81 65 7 55 80 79 11 55 83 43 12 6 42 56 56 7 82 56 65 66 79 65 7 56 81 80 12 55 83 42 12 6 43 55 55 7 82 55 68 68 81 67 6 56 78 80 12 55 82 43 11 6 42 56 54 6 82 56 66 67 78 67 6 56 81 80 12 55 82 44 12 7 43 54 55 1 56 1 1 13 1301 1 56 6 63 480 6 81 56 66 67 81 65 7 55 80 79 11 55 83 43 12 6 42 56 56 7 82 56 65 66 79 65 7 56 81 80 12 55 83 42 12 6 43 55 55 7 82 55 68 68 81 67 6 56 78 80 12 55 82 43 11 6 42 56 54 6 82 56 66 67 78 67 6 56 81 80 12 55 82 44 12 7 43 54 55
2022 202201 74738 2.02201e+12 15 1 2.022001e+12 6 113 11306 4 9999999 15 13 15 15 15 15 14 15 13 15 15 15 14 15 15 13 14 13 15 17 13 15 17 15 14 13 14 14 17 15 14 14 15 12 14 15 16 15 15 16 13 15 15 15 14 15 13 14 13 13 12 14 15 12 13 15 16 15 15 12 13 15 15 15 15 14 15 15 14 15 15 15 15 15 16 13 14 12 14 13 1 15 1 1 12 1270 1 34 6 63 1200 15 13 15 15 15 15 14 15 13 15 15 15 14 15 15 13 14 13 15 17 13 15 17 15 14 13 14 14 17 15 14 14 15 12 14 15 16 15 15 16 13 15 15 15 14 15 13 14 13 13 12 14 15 12 13 15 16 15 15 12 13 15 15 15 15 14 15 15 14 15 15 15 15 15 16 13 14 12 14 13
2022 202201 75005 2.02201e+12 38 1 2.022001e+12 6 113 11306 4 9999999 36 38 36 37 35 36 38 39 33 39 37 37 37 37 37 36 37 35 39 37 37 34 36 37 38 35 36 35 39 35 36 34 38 37 36 38 36 36 36 36 35 36 39 36 36 37 37 37 38 35 33 38 36 34 36 37 35 37 37 36 35 39 37 39 37 38 38 36 37 37 37 35 37 37 35 37 35 37 37 37 1 38 1 1 12 1270 2 40 2 23 41300 36 38 36 37 35 36 38 39 33 39 37 37 37 37 37 36 37 35 39 37 37 34 36 37 38 35 36 35 39 35 36 34 38 37 36 38 36 36 36 36 35 36 39 36 36 37 37 37 38 35 33 38 36 34 36 37 35 37 37 36 35 39 37 39 37 38 38 36 37 37 37 35 37 37 35 37 35 37 37 37
2022 202201 75119 2.02201e+12 20 1 2.022001e+12 6 111 11106 3 9999999 22 21 21 22 20 20 21 21 20 22 22 22 20 22 21 21 22 22 20 22 22 22 22 22 20 21 20 22 21 21 20 22 21 22 22 20 21 22 22 22 20 22 22 22 21 21 22 22 21 22 21 20 22 21 20 20 22 23 21 22 22 22 22 20 20 22 22 21 22 22 22 21 20 22 22 20 20 22 22 22 1 20 1 1 13 1301 2 88 2 23 5800 22 21 21 22 20 20 21 21 20 22 22 22 20 22 21 21 22 22 20 22 22 22 22 22 20 21 20 22 21 21 20 22 21 22 22 20 21 22 22 22 20 22 22 22 21 21 22 22 21 22 21 20 22 21 20 20 22 23 21 22 22 22 22 20 20 22 22 21 22 22 22 21 20 22 22 20 20 22 22 22
2022 202201 75131 2.02201e+12 11 1 2.022001e+12 6 123 12306 3 9999999 12 0 8 20 8 0 1 14 14 1 16 11 13 20 8 14 0 8 13 12 11 15 8 1 10 13 17 15 12 19 1 12 0 1 8 1 16 8 12 10 11 1 9 21 7 0 1 13 14 1 16 10 13 18 10 13 1 9 14 11 10 13 9 1 9 15 17 15 13 19 0 14 0 1 9 1 16 9 11 9 1 11 1 1 13 1301 2 86 2 23 0 12 0 8 20 8 0 1 14 14 1 16 11 13 20 8 14 0 8 13 12 11 15 8 1 10 13 17 15 12 19 1 12 0 1 8 1 16 8 12 10 11 1 9 21 7 0 1 13 14 1 16 10 13 18 10 13 1 9 14 11 10 13 9 1 9 15 17 15 13 19 0 14 0 1 9 1 16 9 11 9

Step 3: Units of Analysis

Each row in the ACS microdata is an individual, identified by a unique combination of SAMPLE, which defines the year when the individual was surveyed, SERIAL, a unique identifier for that individual’s household, and PERNUM, a unique identifier for the individual within their household (Ruggles et al. 2024). Thus, we can identify units as follows:

  • Households: The combination of SAMPLE and SERIAL provides a unique identifier for every household in the IPUMS
  • Individuals: The combination of SAMPLE, SERIAL, and PERNUM provides a unique identifier for every person in the IPUMS

We can group by these variable combinations and see how many individuals and households were surveyed across PUMAs in Oakland for the 2022 1-year ACS. We can see that each row in the data represents an individual (n_rows equals n_individuals) and, as we would expect, the number of households is much lower than the number of individuals.

oak %>% group_by(PUMA) %>% summarise(
  n_rows = n(),
  n_individuals = n_distinct(SAMPLE, SERIAL, PERNUM),
  n_households = n_distinct(SAMPLE, SERIAL)
  )
Table 7: Oakland Dataset Granularity by PUMA
PUMA n_rows n_individuals n_households
111 1083 1083 578
112 1152 1152 547
113 989 989 385
123 905 905 389

I’ll also randomly select a household in the data to see what such a unit looks like in practice.

Code
household_serials <- oak %>%
  group_by(SERIAL) %>%
  count() %>%
  filter(n > 1) %>%
  pull(SERIAL)
set.seed(2)
sample_household <- sample(household_serials, 1)
n <- oak %>% filter(SERIAL == sample_household) %>% dim() %>% .[1]
oak %>% filter(SERIAL == sample_household) %>% 
  select(c(SERIAL, PERNUM, AGE, SEX, HHINCOME, INCTOT))
Table 8: An example household with income data
SERIAL PERNUM AGE SEX HHINCOME INCTOT
211975 1 31 2 27300 9300
211975 2 13 1 27300 9999999
211975 3 3 2 27300 9999999
211975 4 27 1 27300 18000

Here we can see that this household, with SERIAL 211975 has 4 members, each with a unique PERNUM.

Let’s return to the motivating question – what was the median household income in Oakland, California in 2022? Note the income variables we observe for this family:

  1. INCTOT reports each respondent’s total pre-tax personal income or losses from all sources for the previous year. 9999999 is code to denote that the value is missing, which makes sense given that the missing values above correspond to children in the household (See INCTOT in Ruggles et al. 2024).

  2. HHINCOME reports the total money income of all household members age 15+ during the previous year. The amount should equal the sum of all household members’ individual incomes, as recorded in the person-record variable INCTOT (See HHINCOME in Ruggles et al. 2024)

Given what we know about the unique identifier for households, and the HHINCOME variable, we can construct the appropriate dataset for answering our motivating question – every household in Oakland that had household income.

households_w_income <- oak %>%
  distinct(SAMPLE, SERIAL, .keep_all = TRUE) %>% 
  filter(HHINCOME != 9999999, HHINCOME > 0)

It seems we can proceed to simply calculate the median of the HHINCOME column? Not so fast… Data in the ACS microdata are not what they seem. Before we do any analysis, we have to account for sample weights.

Step 4: Applying sample weights

Let’s return to our sample family from above, but also examine the variables PERWT and HHWT.

oak %>% filter(SERIAL == sample_household) %>% 
  select(c(AGE, SEX, HHINCOME, INCTOT, PERWT, HHWT))
Table 9: An example household with weight data
AGE SEX HHINCOME INCTOT PERWT HHWT
31 2 27300 9300 62 62
13 1 27300 9999999 60 62
3 2 27300 9999999 120 62
27 1 27300 18000 63 62

These are the two primary sample weights in ACS microdata, and they can be interpreted fairly directly.

  • PERWT gives the population represented by each individual in the sample, thus in the first row of the sample household, the 31 year old woman with an individual income of $9,300 represents 62 individuals in the PUMA.
  • HHWT gives the number of households in the general population represented by each household in the sample, thus this household is representative of 62 households in that PUMA.

Any analysis or visualization of ACS microdata should be weighted by PERWT for work at the person-level, and HHWT for the household-level (See Sample Weights in Ruggles et al. 2024).

For visualizations, we pass the weights to the plotting API. For example, in ggplot many chart types support a weight argument.

ggplot(households_w_income, aes(x = HHINCOME, weight = HHWT)) +
  geom_histogram(
    binwidth = 50000,
    fill = "lightblue",
    alpha = 1,
    color = "grey"
  ) +
  theme_minimal() +
  scale_x_continuous(
    labels = scales::label_currency(scale_cut = scales::cut_short_scale()),
    breaks = seq(0, 1600000, 200000)
  ) +
  scale_y_continuous(labels=scales::label_comma()) +
  labs(x = "Household Income", y = "Number of Households")
Household Income Distribution in Oakland (with household weights)

For analysis, we use the srvyr package to define the survey weights before using them to calculate statistics. For example, here we’ll finally address the motivating question. The median household income in Oakland in 2022 as measured in the IPUMS microdata was as follows:

households_w_income %>%
  as_survey(weights=HHWT) %>%
  summarise(weighted_median = survey_median(HHINCOME)) %>% 
  select(weighted_median)
Table 10: 2022 Median Household Income in Oakland, CA
weighted_median
90000

Let’s do a quick comparison of our IPUMS results to the aggregate census data we retrieved in the first section. Here are our full IPUMS results for both median household income and population:

Code
median_table <- households_w_income %>%
  as_survey(weights=HHWT) %>% 
  summarise(weighted_median = survey_median(HHINCOME)) %>% 
  mutate(variable = "median_hh_income",
         ipums_estimate = weighted_median,
         se = weighted_median_se)

count_table <- oak %>%
  as_survey(weights=PERWT) %>% 
  survey_count() %>% 
  mutate(variable = "total_pop",
         ipums_estimate = n,
         se = n_se)

aggregate_data <- oakland_stats %>%
  select(c(variable, estimate)) %>%
  rename(ACS_aggregate_estimate = estimate)

bind_rows(count_table, median_table) %>% 
  select(c(variable, ipums_estimate)) %>% inner_join(aggregate_data, by='variable')
Table 11: IPUMS versus ACS aggregate results
variable ipums_estimate ACS_aggregate_estimate
total_pop 430052 426323
median_hh_income 90000 93146

These are clearly different. What gives? Unfortunately, summary statistics calculated using IPUMS data typically cannot match aggregate ACS figures!

One major reason for this gap is additional sampling error. Recall that the American Community Survey is a sample. A given one-year ACS is typically a 1% sample of the U.S. population, with associated sampling error. When the census makes microdata available, they create a sample of that sample – we do not get the full 1%. This second sampling process introduces further sampling error in the microdata that is not reflected in figures sourced from aggregate ACS data, which are calculated using the full ACS sample (See ACS in Ruggles et al. 2024).

Step 5: Calculating standard errors

Now we have estimates derived from the ACS microdata, but how do we communicate uncertainty around those estimates? For the sake of this example, let’s explore the problem of standard errors via a slightly different motivating question – what was the average household income in Oakland in 2022?

I’m switching to the average because we are about to calculate some standard errors from-scratch, and that’s much more clear-cut for linear statistics like the average than for non-linear statistics like the median. We will return to the median at the end of this section, but rely on a package implementation and not do anything from-scratch.

Anyways, let’s calculate the average household income in Oakland using the sample household weights. The formula for the weighted average is: \[ \bar{X}_w = \frac{\sum_{i=1}^{n}w_ix_i}{\sum_{i=1}^{n}w_i} \] We’ll put that into code and get the following estimate of the average household income in Oakland:

weighted_mean <- (
  sum(households_w_income$HHINCOME * households_w_income$HHWT)
  / sum(households_w_income$HHWT)
  )
weighted_mean
[1] 142876.5

Now, let’s get a standard error for that estimate. Correctly utilizing the sample weights in the variance calculation is a little tricky and there are a few different approaches to the problem. I’ll cover two techniques for variance estimation: the Taylor series method (also referred to as “linearized variance” or Taylor series linearization) and replication weights. Each of these are implemented in the survey package (Lumley 2004).

The Taylor series method

Most packages (e.g. SAS, Stata, and R’s survey package) default to using the Taylor series method for variance estimation of weighted statistics. This is because weighted statistics, like the average above, are not linear with respect to the weights, making analytical calculation of the variance difficult or else undefined (Lohr 2021). The Taylor series method entails calculating the first-order linear approximation of the statistic (e.g. the linear approximation of the weighted mean formula above), then finding the variance of that approximation. The details of those calculations are beyond the scope of this post, but (Lohr 2021, chap 9.1) covers the process in detail, and Wikipedia has a nice walkthrough as well. We will just work with the formula for linearized variance that is arrived at in the Wikipedia post:

\[ \hat{\text{Var}}(\bar{X}_w) = \frac{\sum w_i^2 (x_i - \bar{X}_w)^2}{\left(\sum w_i\right)^2} \] The standard error is just the square root of that variance.
\[ \hat{\text{SE}}(\bar{X}_w) = \sqrt{\hat{\text{Var}}(\bar{X}_w)} \] We’ll code that up in R:

numerator <- sum(households_w_income$HHWT^2 * (households_w_income$HHINCOME - weighted_mean)^2)
denominator <- sum(households_w_income$HHWT)^2
variance <- numerator / denominator
se <- sqrt(variance)
se
[1] 4170.701

That result is a usable estimate for the standard error of the average household income in Oakland. Of course in practice we would always use a package, in this case survey, for calculating that. Here I’ll calculate the average household income in Oakland using survey, and it will default to returning a standard error via the Taylor series method.

households_w_income %>%
  as_survey(weights=HHWT) %>%
  summarise(weighted_mean = survey_mean(HHINCOME))
Table 12: Taylor Series Method Standard Error
weighted_mean weighted_mean_se
142876.5 4171.909

This standard error is off from ours by about $1, which could be due to other corrections applied in the package’s variance calculation. We needn’t be too worried about this discrepancy, though. While the Taylor series method is popular in survey analysis, IPUMS recommends that we use the replication weights method when working with ACS microdata.

Replication weights

In theory, the standard error of an estimate measures the variation of a statistic across multiple samples from a given population. Our sample standard error above, calculated using just the one sample, is an estimate of that theoretical standard error.

Replicate methods operate under the assumption that one sample can be conceived of as a miniature version of the population. Instead of taking multiple samples from the population to construct a variance estimator, one may simply resample from the full, original sample to mimic the theoretical basis of standard errors (Lohr 2021, chap 9.1).

In practice, we don’t explicitly do any resampling, instead relying on “replicate weights” that the census pre-computes. Here we can see what replicate weights (in this case, household replicate weights) look like in our data. Each of REPWT, 1 through 80, is a set of alternative household weights, slightly different from the “production weight,” HHWT.

households_w_income %>%
  mutate(` ` = "...") %>% 
  select(c(HHINCOME, HHWT, REPWT1, REPWT2, ` `, REPWT79, REPWT80)) %>% 
  head()
Table 13: ACS Replicate Household Weights
HHINCOME HHWT REPWT1 REPWT2 REPWT79 REPWT80
129200 50 80 47 49 50
107000 73 20 25 72 74
138000 74 86 85 21 73
44800 41 12 43 39 44
380000 311 99 86 323 105
189150 65 22 20 108 122

We can use replicate weights in a variety of alternative variance estimates. In the IPUMS documentation on ACS replicate weights, they outline a variance estimation procedure that matches the “successive difference replication (SDR) variance” method. We obtain SDR variance by calculating our statistic of interest with the production weights (e.g. HHWT) as follows:
\[ \bar{X}_w = \frac{\sum_{i=1}^{n}\text{HHWT}_ix_i}{\sum_{i=1}^{n}\text{HHWT}_i} \]

Then, for each of the 80 household replicate weights (e.g. REPWT1), we calculate the weighted average income using the replicate weight
\[ \bar{X}_r = \frac{\sum_{i=1}^{n}\text{REPWT}_{ri}x_i}{\sum_{i=1}^{n}\text{REPWT}_{ri}} \] and sum the squared deviations between that “replicate-weighted” estimate, \(\bar{X}_r\), and the production weighted estimate. Specifically: \[ \begin{align*} \hat{\text{Var}}(\bar{X}_w) &= \frac{4}{80} \sum_{r=1}^{80} (\bar{X}_r - \bar{X}_w)^2 \end{align*} \]

(Fay and Train 1995) initially proposed that variance estimator and their paper gives the full derivation. The standard error is once again just the square root of that variance.

\[ \hat{\text{SE}}(\bar{X}_w) = \sqrt{\hat{\text{Var}}(\bar{X}_w)} \] In code, calculated via a for-loop, we’d get the following:

# Calculate X_r
X_r <- vector()
for (r in 1:80){
  X_r[r] <- households_w_income %>%
    as_survey(weights=glue("REPWT", r)) %>%
    summarise(weighted_mean = survey_mean(HHINCOME)) %>% 
    .$weighted_mean
}
# Calculate X
X <- households_w_income %>%
    as_survey(weights=HHWT) %>%
    summarise(weighted_mean = survey_mean(HHINCOME)) %>% 
    .$weighted_mean

# Sum over r
sqrt( (4/80) * sum( (X_r - X)^2 ) )
[1] 3479.567

To be clear, we don’t do that manually in practice – survey supports specifying survey designs with replicate weights. The following is an implementation of the same standard error calculation procedure. Note the identical output.

households_w_income %>% 
  as_survey_rep(
  weight = HHWT ,
  repweights = matches("REPWT[0-9]+"),
  type = "successive-difference",
  mse = TRUE) %>%
  summarise(weighted_mean = survey_mean(HHINCOME))
weighted_mean weighted_mean_se
142876.5 3479.567

There are a few different replicate methods for variance estimation, several of which are equivalent. For example, the successive-difference standard error above is equivalent to using Fay’s Balanced Repeated Replication method with the Faye coefficient set to \(\epsilon=.5\).

households_w_income %>% 
  as_survey_rep(
  weight = HHWT ,
  repweights = matches("REPWT[0-9]+"),
  type = "Fay",
  rho=.5,
  mse = TRUE) %>%
  summarise(weighted_mean = survey_mean(HHINCOME))
weighted_mean weighted_mean_se
142876.5 3479.567

IPUMS’ documentation recommends the following specification, where the replicate method is set to “ACS.” This has the same output as the previous two specifications, and is presumably implementing SDR variance.

households_w_income %>% 
as_survey_rep(
  weight = HHWT,
  repweights = matches("REPWT[0-9]+"),
  type = "ACS",
  mse = TRUE) %>%
  summarise(weighted_mean = survey_mean(HHINCOME))
weighted_mean weighted_mean_se
142876.5 3479.567

Now that we have the methods established, we’ll return to the median, rather than the mean, household income. Some of the above formulas would be different for the median, but the code and package implementations are the same. Here we will calculate the median household income in Oakland, and the standard error for that statistic using three methods:

  1. ACS aggregate data retrieval
  2. Estimation via IPUMS microdata with Taylor series variance/standard error
  3. Estimation via IPUMS microdata with replicate weight variance/standard error
Code
df1 <- oakland_stats %>% 
  mutate(`Standard Error` = moe/1.645, Method = 'Aggregate Data') %>%
  filter(variable == 'median_hh_income') %>% 
  select(c(Method, estimate, `Standard Error`)) %>% 
  rename(`Weighted Median HH Income` = estimate)
df2 <- households_w_income %>% 
as_survey(
  weights = HHWT) %>%
  summarise(`Weighted Median HH Income` = survey_median(HHINCOME)) %>% 
  rename(`Standard Error` = `Weighted Median HH Income_se`) %>% 
  mutate(Method = 'Microdata w/ Taylor Series Method') %>% 
  select(c(Method, `Weighted Median HH Income`, `Standard Error`))
df3 <- households_w_income %>% 
as_survey_rep(
  weight = HHWT,
  repweights = matches("REPWT[0-9]+"),
  type = "ACS",
  mse = TRUE) %>%
  summarise(`Weighted Median HH Income` = survey_median(HHINCOME)) %>% 
  rename(`Standard Error` = `Weighted Median HH Income_se`) %>% 
  mutate(Method = 'Microdata w/ Replicate Weights') %>% 
  select(c(Method, `Weighted Median HH Income`, `Standard Error`))

combined_df <- bind_rows(df1, df2, df3)
combined_df
Method Weighted Median HH Income Standard Error
Aggregate Data 93146 3788.450
Microdata w/ Taylor Series Method 90000 4410.252
Microdata w/ Replicate Weights 90000 4270.391

In IPUMS testing of ACS/PRCS data, replicate weights usually increase standard errors, though in our case the replicate weights produced a slightly smaller standard error than the Taylor series method.

A Production Workflow

Putting all of this post together, here is production code that I would use to build a dataset for analyzing household-level variables in Oakland, leaning on the IPUMS and geocorr helper functions I defined above, and specifying replicate weights and “ACS” variance.

# Get the IPUMS data for California
data <- ipums_data("us2022a", variables, "Incomes by PUMA")
# Find the geographic correspondence for places -> pumas
csv_data <- geocorr_2022("Ca06", "puma22", "place", "pop20")
# isolate pumas that correspond to oakland and have oak population
oak_pumas <- csv_data %>%
  select(-c(state, stab, place, PUMA22name)) %>%
  filter(PlaceName == 'Oakland city, CA', afact > 0) %>% 
  pull(puma22) %>% 
  as.integer()
# Filter the IPUMS data to those pumas
oak_hh_w_inc <- data %>%
  filter(PUMA %in% oak_pumas) %>% 
  distinct(SAMPLE, SERIAL, .keep_all = TRUE) %>% 
  filter(HHINCOME != 9999999, HHINCOME > 0) %>% 
  haven::zap_labels()
# convert the IPUMS data to a survey dataset
oak_hh_inc_svy <- households_w_income %>% 
  as_survey_rep(
    weight = HHWT,
    repweights = matches("REPWT[0-9]+"),
    type = "ACS",
    mse = TRUE)

And here is the median household income.

oak_hh_inc_svy %>%
  summarise(median_hh_inc = survey_median(HHINCOME))
median_hh_inc median_hh_inc_se
90000 4270.391

References

Fay, Robert E, and George F Train. 1995. “Aspects of Survey and Model-Based Postcensal Estimation of Income and Poverty Characteristics for States and Counties.” In, 154–59. Alexandria, VA.
Greg Freedman Ellis, Derek Burk, and Finn Roberts. 2024. “Ipumsr: An R Interface for Downloading, Reading, and Handling IPUMS Data.” https://tech.popdata.org/ipumsr/.
Lohr, Sharon L. 2021. Sampling: Design and Analysis. 3rd edition. Boca Raton: Chapman; Hall/CRC.
Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (April): 1–19. https://doi.org/10.18637/jss.v009.i08.
Mihalik, Cory, Glenn Rice, and Matt Hesser. 2022. “The Missouri Census Data Center (MCDC) 2022 Geographic Correspondence Engine (GEOCORR).” https://mcdc.missouri.edu/applications/geocorr.html.
Pastoor, Isabel. 2024. “How Can I Pull Data at the Zip Code or City Level?” IPUMS Forum. https://forum.ipums.org/t/how-can-i-pull-data-at-the-zip-code-or-city-level/5650/2.
Ruggles, Steven, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. 2024. IPUMS USA: Version 15.0.” Minneapolis, MN: IPUMS. https://doi.org/10.18128/D010.V15.0.
Walker, Kyle. 2023. Analyzing US Census Data. 1st edition. Boca Raton: Chapman; Hall/CRC.

Footnotes

  1. I want to note that several of the points I cover here are things I learned from some coworkers– Bert Wilden and Stephanie Peng.↩︎

  2. For an informal analysis, I might just use a web-based tool like Census Reporter to quickly look something up.↩︎

  3. See the full variable list for the 2022 1-year ACS for the available variables, and see the Census Place table for looking up GEOIDs↩︎

  4. I’ll to note that the combination of IPUMS and Geocorr is a fantastic public good, and it’s extremely generous of the public Universities of Minnesota and Missouri to publish these.↩︎

  5. Were the AFACT higher, e.g. 1%, I could randomly sample 1% of the individuals from that PUMA and include them in my Oakland sample.↩︎

Citation

BibTeX citation:
@online{amerkhanian2025,
  author = {Amerkhanian, Peter},
  title = {Working with {ACS} Microdata in {R}},
  date = {2025-02-20},
  url = {https://peter-amerkhanian.com/posts/ipums-wth-r/},
  langid = {en}
}
For attribution, please cite this work as:
Amerkhanian, Peter. 2025. “Working with ACS Microdata in R.” February 20, 2025. https://peter-amerkhanian.com/posts/ipums-wth-r/.