Working with ACS microdata in R

Code

pacman::p_load(dplyr,
               ggplot2,
               # Statistics
               modelsummary,
               srvyr,
               survey,
               # Webscraping
               httr,
               rvest,
               readr,
               glue,
               # Census
               tidycensus,
               ipumsr,
               # XML Parsing
               xml2,
               purrr,
               stringr)

The following are my notes on how to use American Community Survey (ACS) microdata, leveraging the University of Minnesota’s Integrated Public Use Microdata Series (IPUMS).¹ I cover:

Retrieval: how to choose which ACS product is relevant, how to submit a request to IPUMS, and how to filter down to relevant levels of geography/granularity.
Analysis: how to properly weight ACS data using sample and replication weights for accurate analysis and uncertainty estimation.

I’ll start with a motivating question:

Motivating Question

What was the median household income in Oakland, California in 2022?

Aggregate data with `tidycensus`

Answering that question is straightforward using the Census’ aggregate data – pre-calculated descriptive statistics for aggregate geographies. I typically retrieve aggregate data via the Census API.² In R, the tidycensus package provides an easy-to-use wrapper for that API. Note that I set up an API key for the U.S. Census and I’m storing it in my .Renviron file as census_api_key="yourkeyhere".

census_api_key(Sys.getenv("census_api_key"))

I query B19013_001, the median household income variable, using the 2022 1-year American Community Survey sample and I filter down to Oakland’s GEOID, 0653000, which is a combination of the state code for California, 06, and the place code for Oakland, 53000.³ I’ll throw in the total population variable for good measure:

oakland_stats <- get_acs(
  geography = "place",
  variables = c(
    median_hh_income = "B19013_001",
    total_pop = "B17001_001"
  ),
  state = "CA",
  year = 2022,
  survey = "acs1"
)
oakland_stats <- oakland_stats %>% filter(GEOID == '0653000')
oakland_stats %>% select(c(variable, estimate, moe))

Table 1: Total Population and Median Household Income in Oakand, CA via Aggregate Data

variable	estimate	moe
total_pop	426323	811
median_hh_income	93146	6232

Done! We now know the population and median household income for Oakland in 2022, along with a margin of error. See (Walker 2023) for a comprehensive treatment of working with aggregate census data.

Microdata with IPUMS/`ipumsr`

What if I wanted to have access to the underlying data used to calculate the median? Maybe I want to try a different standard error specification for that statistic, or calculate other statistics, like the average household income. These tasks would all entail accessing census microdata – household and/or individual level census data.

One of the most popular sources for downloading census microdata is the University of Minnesota’s Integrated Public Use Microdata Series (IPUMS). The IPUMS team provides a centralized API for downloading census microdata, comprehensive documentation for working with the data, and harmonized variables across time (Walker 2023, chap. 9).

The easiest way to access IPUMS data in R is with the ipumsr package, which the IPUMS team maintains and which allows users to submit API requests to IPUMS directly from R (Greg Freedman Ellis, Derek Burk, and Finn Roberts 2024) . To get set up, I registered for an IPUMS API key, stored the key in my .Renviron file, and will configure the key in ipumsr as follows:

set_ipums_api_key(Sys.getenv("ipums_api_key"))

The ipumsr website provides details on what survey products the project currently supports, as does the ipums_data_collections() function:

ipums_data_collections() %>%
  filter(api_support == TRUE) %>% 
  arrange(desc(collection_type))

Table 2: IPUMS API Collections

collection_name	collection_type	code_for_api	api_support
IPUMS USA	microdata	usa	TRUE
IPUMS CPS	microdata	cps	TRUE
IPUMS International	microdata	ipumsi	TRUE
IPUMS ATUS	microdata	atus	TRUE
IPUMS AHTUS	microdata	ahtus	TRUE
IPUMS MTUS	microdata	mtus	TRUE
IPUMS NHIS	microdata	nhis	TRUE
IPUMS MEPS	microdata	meps	TRUE
IPUMS NHGIS	aggregate data	nhgis	TRUE

I’ll look at IPUMS USA since my motivating question involves median household income for a year (2022), and IPUMS USA offers annual data from decennial censuses 1790-2010 and American Community Surveys (ACS) 2000-present (Ruggles et al. 2024). We can check out the newest products they have in the USA collection as follows:

get_sample_info(collection="usa") %>%
  arrange(desc(name)) %>%
  head(5)

Table 3: IPUMS USA Products

name	description
us2023d	2019-2023, PRCS 5-year
us2023c	2019-2023, ACS 5-year
us2023b	2023 PRCS
us2023a	2023 ACS
us2022d	2018-2022, PRCS 5-year

Note that “PRCS” refers to the Puerto Rico Community Survey (an ACS equivalent specifically tailored to Puerto Rico). We are principally interested in the ACS, which comes in either one-year (e.g. 2023 ACS) or five-year (e.g. 2018-2022, ACS 5-year) estimates. The differences between these two estimates are described in detail in Census Data: An Overview in (Walker 2023). One differentiating point is that one-year estimates come from a smaller, but more contemporary sample. In our case we’ll use the one-year to get the best sense of the 2022 income dynamics.

Let’s return to the motivating question for this post:

Motivating Question

What was the median household income in Oakland, California in 2022?

To answer that we will:

Get income data from the 2022 1-year ACS (Section 3).
Filter our data down to households in the city of Oakland (Section 4 and Section 5).
Calculate the median (Section 6 and Section 7).

Step 1: Retrieving data

For the first task, I’ll define several helper functions:

retrieve_sample() retrieves a list of variables from a given ACS sample from the IPUMS API. This is the only truly necessary code.
check_description() and ipums_data() are functions for checking if the target extract already exists locally. Retrieving an extract can take some time, and I’d like to avoid doing it repeatedly when I’ve already downloaded the extract in the past.

Each of these are defined in the following folded code block.

Code

retrieve_sample <- function(sample, variables, description) {
  extract <- define_extract_micro(
    description = description,
    collection = "usa",
    samples = c(sample),
    variables = variables
  )
  data_path <- extract %>%
    submit_extract() %>%
    wait_for_extract() %>%
    download_extract(download_dir = here::here("data"),
                     overwrite = TRUE)
  data <- read_ipums_micro(data_path)
  return(data)
}

check_description <- function(file, description) {
  xml_text <- readLines(file, warn = FALSE) %>% paste(collapse = "\n")
  user_description <- str_match(xml_text, "User-provided description:\\s*(.*)]]")[, 2]
  # Check if it matches the target
  if (str_detect(user_description, fixed(description, ignore_case = TRUE))) {
    return(file)
  } else {
    return(NULL)
  }
}

ipums_data <- function(sample, variables, description) {
  local_ipums_extracts <- list.files(
    path = here::here('data'),
    pattern = "\\.xml$",
    full.names = TRUE
  )
  matching_files <- compact(map(local_ipums_extracts, \(x) check_description(x, description)))
  # If there is a match, open the first matching file
  if (length(matching_files) > 0) {
    matched_file <- matching_files[[1]]
    message("Opening: ", matched_file)
    data <- read_ipums_micro(matched_file)
  } else {
    message("No matching dataset found.")
    data <- retrieve_sample(sample, variables, description)
  }
}

I’ll proceed to define a list of variables that I want, including HHINCOME (household income) and INCTOT (individual income). Some of these variables refer to census-specific language – i.e. PUMA , REPWT , REPWTP. I’ll cover exactly what each of these represent later in the post, but in the meantime I’ll also note that the IPUMS website has full reference materials for all available variables.

variables <- list(
  "PUMA",
  "AGE",
  "SEX",
  "EDUC",
  "HHINCOME",
  "INCTOT",
  "REPWT",
  "REPWTP",
  var_spec("STATEFIP",
           case_selections = "06")
)

Note the line, var_spec("STATEFIP", case_selections = "06"). This selects the variable STATEFIP, while also specifying that we want to restrict our request to data where STATEFIP=='06' (California). Using var_spec() is important, as accidentally/unnecessarily downloading an unfiltered, full sample of the entire U.S. is time-consuming.

Anyways, I can request these variables from the 2022 1-year ACS via myipums_data() function.

data <- ipums_data("us2022a", variables, "Incomes by PUMA")

Here are the resulting data, the 2022 1-year ACS for California.

data %>% head()

Table 4: Unfiltered 2022 1-year ACS data for California

YEAR	SAMPLE	SERIAL	CBSERIAL	HHWT	REPWT	CLUSTER	STATEFIP	PUMA	STRATA	GQ	HHINCOME	REPWT1	REPWT2	REPWT3	REPWT4	REPWT5	REPWT6	REPWT7	REPWT8	REPWT9	REPWT10	REPWT11	REPWT12	REPWT13	REPWT14	REPWT15	REPWT16	REPWT17	REPWT18	REPWT19	REPWT20	REPWT21	REPWT22	REPWT23	REPWT24	REPWT25	REPWT26	REPWT27	REPWT28	REPWT29	REPWT30	REPWT31	REPWT32	REPWT33	REPWT34	REPWT35	REPWT36	REPWT37	REPWT38	REPWT39	REPWT40	REPWT41	REPWT42	REPWT43	REPWT44	REPWT45	REPWT46	REPWT47	REPWT48	REPWT49	REPWT50	REPWT51	REPWT52	REPWT53	REPWT54	REPWT55	REPWT56	REPWT57	REPWT58	REPWT59	REPWT60	REPWT61	REPWT62	REPWT63	REPWT64	REPWT65	REPWT66	REPWT67	REPWT68	REPWT69	REPWT70	REPWT71	REPWT72	REPWT73	REPWT74	REPWT75	REPWT76	REPWT77	REPWT78	REPWT79	REPWT80	PERNUM	PERWT	REPWTP	FAMUNIT	RELATE	RELATED	SEX	AGE	EDUC	EDUCD	INCTOT	REPWTP1	REPWTP2	REPWTP3	REPWTP4	REPWTP5	REPWTP6	REPWTP7	REPWTP8	REPWTP9	REPWTP10	REPWTP11	REPWTP12	REPWTP13	REPWTP14	REPWTP15	REPWTP16	REPWTP17	REPWTP18	REPWTP19	REPWTP20	REPWTP21	REPWTP22	REPWTP23	REPWTP24	REPWTP25	REPWTP26	REPWTP27	REPWTP28	REPWTP29	REPWTP30	REPWTP31	REPWTP32	REPWTP33	REPWTP34	REPWTP35	REPWTP36	REPWTP37	REPWTP38	REPWTP39	REPWTP40	REPWTP41	REPWTP42	REPWTP43	REPWTP44	REPWTP45	REPWTP46	REPWTP47	REPWTP48	REPWTP49	REPWTP50	REPWTP51	REPWTP52	REPWTP53	REPWTP54	REPWTP55	REPWTP56	REPWTP57	REPWTP58	REPWTP59	REPWTP60	REPWTP61	REPWTP62	REPWTP63	REPWTP64	REPWTP65	REPWTP66	REPWTP67	REPWTP68	REPWTP69	REPWTP70	REPWTP71	REPWTP72	REPWTP73	REPWTP74	REPWTP75	REPWTP76	REPWTP77	REPWTP78	REPWTP79	REPWTP80
2022	202201	74692	2.02201e+12	14	1	2.022001e+12	6	6509	650906	4	9999999	12	14	14	12	14	14	12	12	11	12	13	13	11	13	13	11	13	12	12	14	13	13	13	13	11	11	13	12	14	13	10	12	13	10	13	13	15	12	14	14	12	11	12	12	10	14	13	13	12	12	11	13	12	11	12	14	13	12	14	12	13	13	12	12	12	12	12	12	12	11	12	13	13	11	14	12	11	11	13	10	1	14	1	1	12	1270	2	56	6	64	14500	12	14	14	12	14	14	12	12	11	12	13	13	11	13	13	11	13	12	12	14	13	13	13	13	11	11	13	12	14	13	10	12	13	10	13	13	15	12	14	14	12	11	12	12	10	14	13	13	12	12	11	13	12	11	12	14	13	12	14	12	13	13	12	12	12	12	12	12	12	11	12	13	13	11	14	12	11	11	13	10
2022	202201	74693	2.02201e+12	27	1	2.022001e+12	6	6501	650106	3	9999999	27	28	48	49	5	47	26	26	6	7	6	28	26	48	6	28	47	27	26	44	50	50	25	28	27	27	6	47	25	27	29	6	49	27	27	6	29	6	6	26	28	28	6	6	51	6	26	27	48	46	47	28	27	6	51	27	6	26	29	4	6	6	28	27	29	27	48	5	27	29	27	46	6	25	27	48	27	43	47	25	1	27	1	1	13	1301	1	52	0	2	0	27	28	48	49	5	47	26	26	6	7	6	28	26	48	6	28	47	27	26	44	50	50	25	28	27	27	6	47	25	27	29	6	49	27	27	6	29	6	6	26	28	28	6	6	51	6	26	27	48	46	47	28	27	6	51	27	6	26	29	4	6	6	28	27	29	27	48	5	27	29	27	46	6	25	27	48	27	43	47	25
2022	202201	74694	2.02201e+12	70	1	2.022001e+12	6	8101	810106	3	9999999	19	11	11	59	71	71	71	71	78	91	89	91	19	57	80	89	70	78	90	11	20	12	11	58	70	71	71	71	80	91	90	91	18	60	79	89	70	79	89	92	80	92	90	90	71	72	69	71	20	11	11	59	80	92	19	58	71	18	58	90	79	90	90	90	70	69	70	71	20	12	10	57	78	91	20	58	71	20	56	12	1	70	1	1	13	1301	1	61	7	71	80	19	11	11	59	71	71	71	71	78	91	89	91	19	57	80	89	70	78	90	11	20	12	11	58	70	71	71	71	80	91	90	91	18	60	79	89	70	79	89	92	80	92	90	90	71	72	69	71	20	11	11	59	80	92	19	58	71	18	58	90	79	90	90	90	70	69	70	71	20	12	10	57	78	91	20	58	71	20	56	12
2022	202201	74695	2.02201e+12	22	1	2.022001e+12	6	8303	830306	4	9999999	22	26	29	36	2	32	36	27	3	2	4	22	18	55	3	22	43	18	27	4	2	3	19	26	20	20	46	5	21	23	20	38	2	27	23	45	20	31	53	20	20	20	54	42	6	49	16	19	3	3	4	21	27	33	4	25	37	29	18	2	4	3	29	18	27	28	34	2	26	23	22	38	5	19	21	35	25	56	30	24	1	22	1	1	12	1270	2	26	7	71	9000	22	26	29	36	2	32	36	27	3	2	4	22	18	55	3	22	43	18	27	4	2	3	19	26	20	20	46	5	21	23	20	38	2	27	23	45	20	31	53	20	20	20	54	42	6	49	16	19	3	3	4	21	27	33	4	25	37	29	18	2	4	3	29	18	27	28	34	2	26	23	22	38	5	19	21	35	25	56	30	24
2022	202201	74696	2.02201e+12	8	1	2.022001e+12	6	6712	671206	3	9999999	9	15	10	8	8	2	2	8	2	15	1	16	8	14	15	10	8	7	9	10	14	9	3	16	16	9	10	1	7	8	8	8	3	9	10	16	15	1	1	8	16	9	2	15	15	9	8	1	8	8	8	8	1	9	8	15	16	3	2	2	8	15	8	8	8	1	1	9	2	16	3	15	8	14	17	9	8	8	8	3	1	8	1	1	13	1301	2	38	6	63	48000	9	15	10	8	8	2	2	8	2	15	1	16	8	14	15	10	8	7	9	10	14	9	3	16	16	9	10	1	7	8	8	8	3	9	10	16	15	1	1	8	16	9	2	15	15	9	8	1	8	8	8	8	1	9	8	15	16	3	2	2	8	15	8	8	8	1	1	9	2	16	3	15	8	14	17	9	8	8	8	3
2022	202201	74697	2.02201e+12	49	1	2.022001e+12	6	7301	730106	4	9999999	52	51	47	5	79	45	45	4	49	5	3	103	91	53	49	49	46	5	102	88	47	48	55	4	98	49	47	5	44	5	3	93	104	48	54	46	53	4	103	3	46	50	51	93	4	48	50	77	49	93	108	5	4	46	47	51	57	89	4	4	54	42	56	101	4	45	55	94	48	96	99	5	4	42	49	47	50	75	4	90	1	49	1	1	12	1270	1	23	6	63	24000	52	51	47	5	79	45	45	4	49	5	3	103	91	53	49	49	46	5	102	88	47	48	55	4	98	49	47	5	44	5	3	93	104	48	54	46	53	4	103	3	46	50	51	93	4	48	50	77	49	93	108	5	4	46	47	51	57	89	4	4	54	42	56	101	4	45	55	94	48	96	99	5	4	42	49	47	50	75	4	90

Step 2: Using Geocorr to identify small geographies

We now have microdata for all of California, but we need to filter down to just Oakland. Unfortunately, this isn’t as simple as just running filter(CITY == 'Oakland') – ACS microdata does not include a field for explicitly identifying cities (note that a city is typically referred to as a “place” in census data).

The smallest geographic area explicitly identified in the microdata is something called a public use microdata area (PUMA) (Pastoor 2024). PUMAS are unique geographies that always aggregate to the state-level (e.g. California can be constructed with a collection of PUMAs), but only sometimes aggregate to other small geographic areas, such as city, metro area, and county (See Census Hierarchies in Walker 2023, chap. 1).

Code

library(tigris)
library(sf)
library(cowplot)
options(tigris_use_cache = TRUE)

bay_area_counties <- c(
  "Alameda", "Contra Costa", "Marin", "Napa",
  "San Francisco", "San Mateo", "Santa Clara", "Solano", "Sonoma"
)

bay_area <- counties(state = "CA", class = "sf", year=2022) %>%
  filter(NAME %in% bay_area_counties) %>%
  st_union()  # Merge counties into a single boundary

# Get all California Places & PUMAs
places <- places(state = "CA", class = "sf", year=2022)
pumas <- pumas(state = "CA", class = "sf",  year=2022)
# Filter only those in the Bay Area
places_bay <- places[st_intersects(places, bay_area, sparse = FALSE), ]
pumas_bay <- pumas[st_intersects(pumas, bay_area, sparse = FALSE), ]

pumas_bbox <- st_bbox(bay_area)

crop_theme <- theme(
    panel.spacing = unit(0, "lines"),      
    plot.margin = margin(0, 0, 0, 0, "cm")
)

map_places <- ggplot() +
  geom_sf(data = places_bay, fill = "lightblue", color = "darkgrey", size = 0.1) +
  geom_sf(data = bay_area, fill = NA, color = "red", size = 4) +  
  coord_sf(ylim = c(pumas_bbox["ymin"], pumas_bbox["ymax"]), xlim = c(pumas_bbox["xmin"], pumas_bbox["xmax"]), expand=FALSE) +  # Apply PUMAs y-limits
  crop_theme +
  theme_minimal() +
  ggtitle("Places")

map_pumas <- ggplot() +
  geom_sf(data = pumas_bay, fill = "lightblue", color = "darkgrey", size = 0.1) +
  geom_sf(data = bay_area, fill = NA, color = "red", size = 4) +  
  coord_sf(ylim = c(pumas_bbox["ymin"], pumas_bbox["ymax"]), xlim = c(pumas_bbox["xmin"], pumas_bbox["xmax"]), expand=FALSE) +  # Ensure y-axis is the same
  crop_theme +
  theme_minimal() +
  ggtitle("PUMAs")


plot_grid(map_places, map_pumas, nrow = 1, align = "h", axis="t", vjust = 0)

Places and PUMAS in the Bay Area (region border in red)

To find out if a city corresponds to a collection of PUMAs and which PUMAs those are, we’ll use Geocorr (geographic correspondence engine), an application that generates correlation lists showing relationships between two or more geographic coverages in the United States (Mihalik, Rice, and Hesser 2022). Geocorr is a sponsored program of the Missouri State library and published by the University of Missouri Center for Health Policy.⁴

Geocorr 2022: Geographic Correspondence Engine

To use Geocorr, I’ll define a function, geocorr_2022() that queries Geocorr 2022 and retrieves a .csv file establishing the relationships between two sets of geographies within a given state. See the following folded code chunk for the specifics:

Code

geocorr_2022 <- function(state, geo_1, geo_2, weight_var) {
  base_url <- "https://mcdc.missouri.edu"
  params <- glue(
    "cgi-bin/broker?_PROGRAM=apps.geocorr2022.sas&",
    "_SERVICE=MCDC_long&_debug=0&",
    "state={state}&g1_={geo_1}&g2_={geo_2}&wtvar={weight_var}&",
    "nozerob=1&fileout=1&filefmt=csv&lstfmt=txt&title=&",
    "counties=&metros=&places=&oropt=&latitude=&longitude=&",
    "distance=&kiloms=0&locname=")
  initial_url <- params %>% url_absolute(base = base_url)
  initial_response <- GET(initial_url)
  html_content <- content(initial_response, as = "text")
  parsed_html <- read_html(html_content)
  # Extract the one link
  csv_url <- parsed_html %>%
    html_node("a") %>%
    html_attr("href") %>%
    stringr::str_trim() %>%
    url_absolute(base = base_url)
  csv_data <- read_csv(csv_url)
  return(csv_data)
}

We’ll use that function to establish the relationships between California’s 2022 PUMAs and its “places,” using individual population as measured in the 2020 Decenial Census to weight the relationships.

csv_data <- geocorr_2022("Ca06", "puma22", "place", "pop20")

With that, we can see which PUMAs correspond to the City of Oakland.

csv_data %>%
  select(-c(state, stab, place, PUMA22name)) %>%
  filter(PlaceName == 'Oakland city, CA')

Table 5: PUMA to Place correspondence for Oakand, CA

puma22	PlaceName	pop20	afact
00111	Oakland city, CA	106433	1
00112	Oakland city, CA	106896	1
00113	Oakland city, CA	125840	1
00114	Oakland city, CA	9	0
00123	Oakland city, CA	101468	1

The AFACT (allocation factor) column shows the proportion of the source area contained in the target area – in this case the proportion of the PUMA population that belongs to Oakland. In this case, 100% of the populations in PUMAs 111, 112, 113, and 123 belong to Oakland, and 0% of PUMA 114. GEOCORR does believe that 9 individuals from 114 live in Oakland, but based on the AFACT of 0 and the fact that I know the PUMA overlays the neighboring city of Piedmont, I’ll feel comfortable dropping that PUMA.⁵

Note that when you plot place to PUMA equivalence, you may find slight differences due to the fact that places can include un-populated areas, such as bodies of water in the case of Oakland.

Code

# Filter PUMAs for Oakland
oakland_pumas <- pumas_bay %>% filter(PUMACE20 %in% c('00111', '00112', '00113', '00123'))
oakland_place <- places_bay %>% filter(NAME == 'Oakland')
# Get bounding box for Oakland PUMAs
oakland_bbox <- st_bbox(oakland_place)

crop_theme <- theme(
    panel.spacing = unit(0, "lines"),      
    plot.margin = margin(0, 0, 0, 0, "cm"),
    axis.title.y = element_blank(),
    axis.title.x = element_blank()
)
# Oakland Places Map
map_places <- ggplot() +
  geom_sf(data = oakland_place, fill = "lightblue", color = "darkgrey", size = 0.1) +
  geom_sf(data = oakland_place, fill = NA, color = "red", size = 4) +  
  coord_sf(ylim = c(oakland_bbox["ymin"], oakland_bbox["ymax"]), 
           xlim = c(oakland_bbox["xmin"], oakland_bbox["xmax"]), 
           expand = FALSE) +
  theme_minimal() +
  crop_theme +
  ggtitle("Oakland (Place)")


# Oakland PUMAs Map
map_pumas <- ggplot() +
  geom_sf(data = oakland_pumas, fill = "lightblue", color = "darkgrey", size = 0.1) +
  geom_sf_text(data = oakland_pumas, aes(label = PUMACE20)) +
  geom_sf(data = oakland_place, fill = NA, color = "red", size = 4) +
  #geom_sf(data = oakland_pumas, fill = NA, color = "red", size = 4) +  
  coord_sf(ylim = c(oakland_bbox["ymin"], oakland_bbox["ymax"]), 
           xlim = c(oakland_bbox["xmin"], oakland_bbox["xmax"]), 
           expand = FALSE) +
  theme_minimal() +
  crop_theme +
  ggtitle("Oakland (PUMAs)")

plot_grid(map_places, map_pumas, nrow = 1, align = "h", axis = "t", vjust = 0)

Place to PUMAs Correspondence in Oakland (place border in red)

Filtering to those PUMAs gets us the 2022 1-year ACS microdata for the City of Oakland (I use haven::zap_labels() just to remove some unnecessary formatting that comes with the data).

oakland_pumas <- c(111, 112, 113, 123)
oak <- data %>%
  filter(PUMA %in% oakland_pumas) %>% 
  haven::zap_labels()
oak %>% head()

Table 6: 2022 1-year ACS data for Oakland, CA

YEAR	SAMPLE	SERIAL	CBSERIAL	HHWT	REPWT	CLUSTER	STATEFIP	PUMA	STRATA	GQ	HHINCOME	REPWT1	REPWT2	REPWT3	REPWT4	REPWT5	REPWT6	REPWT7	REPWT8	REPWT9	REPWT10	REPWT11	REPWT12	REPWT13	REPWT14	REPWT15	REPWT16	REPWT17	REPWT18	REPWT19	REPWT20	REPWT21	REPWT22	REPWT23	REPWT24	REPWT25	REPWT26	REPWT27	REPWT28	REPWT29	REPWT30	REPWT31	REPWT32	REPWT33	REPWT34	REPWT35	REPWT36	REPWT37	REPWT38	REPWT39	REPWT40	REPWT41	REPWT42	REPWT43	REPWT44	REPWT45	REPWT46	REPWT47	REPWT48	REPWT49	REPWT50	REPWT51	REPWT52	REPWT53	REPWT54	REPWT55	REPWT56	REPWT57	REPWT58	REPWT59	REPWT60	REPWT61	REPWT62	REPWT63	REPWT64	REPWT65	REPWT66	REPWT67	REPWT68	REPWT69	REPWT70	REPWT71	REPWT72	REPWT73	REPWT74	REPWT75	REPWT76	REPWT77	REPWT78	REPWT79	REPWT80	PERNUM	PERWT	REPWTP	FAMUNIT	RELATE	RELATED	SEX	AGE	EDUC	EDUCD	INCTOT	REPWTP1	REPWTP2	REPWTP3	REPWTP4	REPWTP5	REPWTP6	REPWTP7	REPWTP8	REPWTP9	REPWTP10	REPWTP11	REPWTP12	REPWTP13	REPWTP14	REPWTP15	REPWTP16	REPWTP17	REPWTP18	REPWTP19	REPWTP20	REPWTP21	REPWTP22	REPWTP23	REPWTP24	REPWTP25	REPWTP26	REPWTP27	REPWTP28	REPWTP29	REPWTP30	REPWTP31	REPWTP32	REPWTP33	REPWTP34	REPWTP35	REPWTP36	REPWTP37	REPWTP38	REPWTP39	REPWTP40	REPWTP41	REPWTP42	REPWTP43	REPWTP44	REPWTP45	REPWTP46	REPWTP47	REPWTP48	REPWTP49	REPWTP50	REPWTP51	REPWTP52	REPWTP53	REPWTP54	REPWTP55	REPWTP56	REPWTP57	REPWTP58	REPWTP59	REPWTP60	REPWTP61	REPWTP62	REPWTP63	REPWTP64	REPWTP65	REPWTP66	REPWTP67	REPWTP68	REPWTP69	REPWTP70	REPWTP71	REPWTP72	REPWTP73	REPWTP74	REPWTP75	REPWTP76	REPWTP77	REPWTP78	REPWTP79	REPWTP80
2022	202201	74718	2.02201e+12	5	1	2.022001e+12	6	111	11106	3	9999999	5	4	5	4	5	5	5	2	6	5	5	3	5	2	2	5	4	4	4	5	4	4	5	5	4	5	5	2	2	2	5	4	5	3	3	4	5	5	4	3	5	5	4	5	5	5	4	2	5	4	4	4	5	2	3	4	5	5	4	5	5	4	5	5	4	5	5	2	2	3	3	4	4	2	2	5	4	4	5	4	1	5	1	1	13	1301	1	20	7	71	0	5	4	5	4	5	5	5	2	6	5	5	3	5	2	2	5	4	4	4	5	4	4	5	5	4	5	5	2	2	2	5	4	5	3	3	4	5	5	4	3	5	5	4	5	5	5	4	2	5	4	4	4	5	2	3	4	5	5	4	5	5	4	5	5	4	5	5	2	2	3	3	4	4	2	2	5	4	4	5	4
2022	202201	74737	2.02201e+12	56	1	2.022001e+12	6	111	11106	3	9999999	6	81	56	66	67	81	65	7	55	80	79	11	55	83	43	12	6	42	56	56	7	82	56	65	66	79	65	7	56	81	80	12	55	83	42	12	6	43	55	55	7	82	55	68	68	81	67	6	56	78	80	12	55	82	43	11	6	42	56	54	6	82	56	66	67	78	67	6	56	81	80	12	55	82	44	12	7	43	54	55	1	56	1	1	13	1301	1	56	6	63	480	6	81	56	66	67	81	65	7	55	80	79	11	55	83	43	12	6	42	56	56	7	82	56	65	66	79	65	7	56	81	80	12	55	83	42	12	6	43	55	55	7	82	55	68	68	81	67	6	56	78	80	12	55	82	43	11	6	42	56	54	6	82	56	66	67	78	67	6	56	81	80	12	55	82	44	12	7	43	54	55
2022	202201	74738	2.02201e+12	15	1	2.022001e+12	6	113	11306	4	9999999	15	13	15	15	15	15	14	15	13	15	15	15	14	15	15	13	14	13	15	17	13	15	17	15	14	13	14	14	17	15	14	14	15	12	14	15	16	15	15	16	13	15	15	15	14	15	13	14	13	13	12	14	15	12	13	15	16	15	15	12	13	15	15	15	15	14	15	15	14	15	15	15	15	15	16	13	14	12	14	13	1	15	1	1	12	1270	1	34	6	63	1200	15	13	15	15	15	15	14	15	13	15	15	15	14	15	15	13	14	13	15	17	13	15	17	15	14	13	14	14	17	15	14	14	15	12	14	15	16	15	15	16	13	15	15	15	14	15	13	14	13	13	12	14	15	12	13	15	16	15	15	12	13	15	15	15	15	14	15	15	14	15	15	15	15	15	16	13	14	12	14	13
2022	202201	75005	2.02201e+12	38	1	2.022001e+12	6	113	11306	4	9999999	36	38	36	37	35	36	38	39	33	39	37	37	37	37	37	36	37	35	39	37	37	34	36	37	38	35	36	35	39	35	36	34	38	37	36	38	36	36	36	36	35	36	39	36	36	37	37	37	38	35	33	38	36	34	36	37	35	37	37	36	35	39	37	39	37	38	38	36	37	37	37	35	37	37	35	37	35	37	37	37	1	38	1	1	12	1270	2	40	2	23	41300	36	38	36	37	35	36	38	39	33	39	37	37	37	37	37	36	37	35	39	37	37	34	36	37	38	35	36	35	39	35	36	34	38	37	36	38	36	36	36	36	35	36	39	36	36	37	37	37	38	35	33	38	36	34	36	37	35	37	37	36	35	39	37	39	37	38	38	36	37	37	37	35	37	37	35	37	35	37	37	37
2022	202201	75119	2.02201e+12	20	1	2.022001e+12	6	111	11106	3	9999999	22	21	21	22	20	20	21	21	20	22	22	22	20	22	21	21	22	22	20	22	22	22	22	22	20	21	20	22	21	21	20	22	21	22	22	20	21	22	22	22	20	22	22	22	21	21	22	22	21	22	21	20	22	21	20	20	22	23	21	22	22	22	22	20	20	22	22	21	22	22	22	21	20	22	22	20	20	22	22	22	1	20	1	1	13	1301	2	88	2	23	5800	22	21	21	22	20	20	21	21	20	22	22	22	20	22	21	21	22	22	20	22	22	22	22	22	20	21	20	22	21	21	20	22	21	22	22	20	21	22	22	22	20	22	22	22	21	21	22	22	21	22	21	20	22	21	20	20	22	23	21	22	22	22	22	20	20	22	22	21	22	22	22	21	20	22	22	20	20	22	22	22
2022	202201	75131	2.02201e+12	11	1	2.022001e+12	6	123	12306	3	9999999	12	0	8	20	8	0	1	14	14	1	16	11	13	20	8	14	0	8	13	12	11	15	8	1	10	13	17	15	12	19	1	12	0	1	8	1	16	8	12	10	11	1	9	21	7	0	1	13	14	1	16	10	13	18	10	13	1	9	14	11	10	13	9	1	9	15	17	15	13	19	0	14	0	1	9	1	16	9	11	9	1	11	1	1	13	1301	2	86	2	23	0	12	0	8	20	8	0	1	14	14	1	16	11	13	20	8	14	0	8	13	12	11	15	8	1	10	13	17	15	12	19	1	12	0	1	8	1	16	8	12	10	11	1	9	21	7	0	1	13	14	1	16	10	13	18	10	13	1	9	14	11	10	13	9	1	9	15	17	15	13	19	0	14	0	1	9	1	16	9	11	9

Step 3: Specify the unit of analysis

Each row in the ACS microdata represents an individual, identified by a unique combination of SAMPLE, which defines the year when the individual was surveyed, SERIAL, a unique identifier for that individual’s household, and PERNUM, a unique identifier for the individual within their household (Ruggles et al. 2024). Given that, we can define a straightforward definition for unique individuals in the ACS:

Individuals: The combination of SAMPLE, SERIAL, and PERNUM provides a unique identifier for every person in the IPUMS. ACS microdata is typically individual-level, so this filtering should not be necessary.

Defining a household is slightly more complex.

Housing Units: The combination of SAMPLE and SERIAL provides a unique identifier for every housing unit in the IPUMS. However, this includes both households and general group quarter housing, e.g. correctional facilities. Housing units are further categorized using the GQ variable.
- Households: Non group quarter housing units, defined by filtering housing units to those where GQ, is 1 or 2 (Bloem 2018). The comparability section of the GQ variable documentation explains this choice in greater detail, but in short it represents non-institutional housing-units where 9 or fewer people unrelated to the head-of-household reside.

We can group/filter by these variable combinations and see how many individuals, housing units, and households were surveyed across PUMAs in Oakland for the 2022 1-year ACS.

oak %>%
  group_by(PUMA) %>%
  summarise(
    n_rows = n(),
    n_individuals = n_distinct(SAMPLE, SERIAL, PERNUM),
    n_housing_units = n_distinct(SAMPLE, SERIAL)) %>%
  inner_join(oak %>%
               filter(GQ <= 2) %>%
               group_by(PUMA) %>%
               summarise(n_households = n_distinct(SAMPLE, SERIAL)),
             by = join_by(PUMA))

Table 7: Oakland Dataset Granularity by PUMA

PUMA	n_rows	n_individuals	n_housing_units	n_households
111	1083	1083	578	539
112	1152	1152	547	517
113	989	989	385	337
123	905	905	389	354

We see that each row in the data represents an individual (n_rows equals n_individuals) and, as we would expect, the number of housing units and households are successively lower than the number of individuals.

Households, group quarters, and institutions

So, given that our motivating question concerns household income, we can apply the following filtering, per (Bloem 2018), and isolate just the heads of all the households in Oakland.

households <- oak %>%
  filter(
    PERNUM == 1,
    GQ <= 2
    )

But to build a little intuition around that selection, let’s also look at what different housing units look like, and why the household definition is restrictive.

A household

I’ll randomly select a household in the data to see what such a unit looks like in practice.

Code

household_serials <- oak %>%
  filter(GQ <= 2) %>%
  group_by(SERIAL) %>%
  count() %>%
  filter(n > 1) %>%
  pull(SERIAL)
set.seed(5)
sample_household_true <- sample(household_serials, 1)
n <- oak %>% filter(SERIAL == sample_household_true) %>% dim() %>% .[1]
oak %>% filter(SERIAL == sample_household_true) %>% 
  select(c(SERIAL, GQ, PERNUM, RELATE, AGE, SEX, HHINCOME, INCTOT))

Table 8: An example household with income data

SERIAL	GQ	PERNUM	RELATE	AGE	SEX	HHINCOME	INCTOT
196265	1	1	1	41	2	184500	104500
196265	1	2	2	40	1	184500	60000
196265	1	3	3	6	2	184500	9999999
196265	1	4	3	4	1	184500	9999999
196265	1	5	6	65	2	184500	20000

This household, with SERIAL 196265 has 5 members, each with a unique PERNUM. To identify household members’ relationship to the head-of-household, we reference RELATE. The first row, where RELATE==1 denotes a 41 year old female (SEX==2) head-of-household. The second, where RELATE==2 denotes their spouse, in this case a 40 year old male. This couple has two children (RELATE==3) and the husband’s mother (RELATE==6), also living in the household.

Non-institutional group-quarters

We can contrast that fairly “nuclear family” household with a non-institutional group-quarters housing unit in Oakland:

Code

household_serials <- oak %>%
  filter(GQ == 5) %>%
  pull(SERIAL)
set.seed(6)
sample_household <- sample(household_serials, 1)
sampled_data <- oak %>% 
  filter(SERIAL == sample_household) %>% 
  select(SERIAL, PUMA, GQ, PERNUM, RELATE, AGE, SEX, HHINCOME, INCTOT)
sampled_data

Table 9: An example group quarters unit with income data

SERIAL	PUMA	GQ	PERNUM	RELATE	AGE	SEX	HHINCOME	INCTOT
165880	111	5	1	1	26	1	474800	0
165880	111	5	2	11	26	2	474800	125000
165880	111	5	3	11	39	1	474800	35000
165880	111	5	4	11	33	2	474800	60000
165880	111	5	5	11	31	2	474800	35000
165880	111	5	6	11	26	1	474800	21000
165880	111	5	7	11	26	1	474800	0
165880	111	5	8	11	26	2	474800	80000
165880	111	5	9	11	26	1	474800	80800
165880	111	5	10	11	25	2	474800	38000
165880	111	5	11	11	22	1	474800	0

We see that there is a head-of-household, a 26 year old male, but all 10 other housing unit members are “partners, friends, [or] visitors” (RELATE==11). These are largely adults with earnings, so this group quarters unit would have a very high “household income” and bias the estimated median income up were it included for that calculation.

Institutional housing units

Group quarters also include institutional housing units, such as correctional facilities.

Code

household_serials <- oak %>%
  filter(GQ == 3) %>%
  pull(SERIAL)
set.seed(1)
sample_household <- sample(household_serials, 1)
sampled_data <- oak %>% 
  filter(SERIAL == sample_household) %>% 
  select(SERIAL, PUMA, GQ, PERNUM, RELATE, AGE, SEX, HHINCOME, INCTOT)
sampled_data

Table 10: An example institutional housing unit with income data

SERIAL	PUMA	GQ	PERNUM	RELATE	AGE	SEX	HHINCOME	INCTOT
75131	123	3	1	13	86	2	9999999	0

This housing unit only includes one member, an 86 year old woman who is an “institutional inmate” (RELATE==13) with no income. This unit would likely bias median household income down, were it included.

Household Income in the ACS

Let’s return to the motivating question:

Motivating Question

What was the median household income in Oakland, California in 2022?

Note the income variables we observe for this family:

INCTOT reports each respondent’s total pre-tax personal income or losses from all sources for the previous year. 9999999 is code to denote that the value is missing, which makes sense given that the missing values above correspond to children in the household (See INCTOT in Ruggles et al. 2024).
HHINCOME reports the total money income of all household members age 15+ during the previous year. The amount should equal the sum of all household members’ individual incomes, as recorded in the person-record variable INCTOT (See HHINCOME in Ruggles et al. 2024)

Given what we know about the unique identifier for households, and the HHINCOME variable, we can construct the appropriate dataset for answering our motivating question – every household in Oakland that had household income.

households_w_income <- households %>%
  filter(
    HHINCOME != 9999999,
    HHINCOME >= 0,
    )

It seems we can proceed to simply calculate the median of the HHINCOME column? Not so fast… Data in the ACS microdata are not what they seem. Before we do any analysis, we have to account for sample weights.

Step 4: Applying sample weights

Let’s return to our sample family from above, but also examine the variables PERWT and HHWT.

oak %>% filter(SERIAL == sample_household_true) %>% 
  select(c(AGE, SEX, HHINCOME, INCTOT, PERWT, HHWT))

Table 11: An example household with weight data

AGE	SEX	HHINCOME	INCTOT	PERWT	HHWT
41	2	184500	104500	51	51
40	1	184500	60000	64	51
6	2	184500	9999999	51	51
4	1	184500	9999999	51	51
65	2	184500	20000	76	51

These are the two primary sample weights in ACS microdata, and they can be interpreted fairly directly.

PERWT gives the population represented by each individual in the sample, thus in the first row of the sample household, the 41 year old woman with an individual income of $104,500 represents 51 individuals in the PUMA.
HHWT gives the number of households in the general population represented by each household in the sample, thus this household is representative of 51 households in that PUMA.

Any analysis or visualization of ACS microdata should be weighted by PERWT for work at the person-level, and HHWT for the household-level (See Sample Weights in Ruggles et al. 2024).

For visualizations, we pass the weights to the plotting API. For example, in ggplot many chart types support a weight argument.

ggplot(households_w_income, aes(x = HHINCOME, weight = HHWT)) +
  geom_histogram(
    binwidth = 50000,
    fill = "lightblue",
    alpha = 1,
    color = "grey"
  ) +
  theme_minimal() +
  scale_x_continuous(
    labels = scales::label_currency(scale_cut = scales::cut_short_scale()),
    breaks = seq(0, 1600000, 200000)
  ) +
  scale_y_continuous(labels=scales::label_comma()) +
  labs(x = "Household Income", y = "Number of Households")

Household Income Distribution in Oakland (with household weights)

For analysis, we use the srvyr package to define the survey weights before using them to calculate statistics. For example, here we’ll finally address the motivating question.

Motivating Question

What was the median household income in Oakland, California in 2022?

The answer, as measured in the IPUMS microdata is as follows:

households_w_income %>%
  as_survey(weights=HHWT) %>%
  summarise(weighted_median = survey_median(HHINCOME)) %>% 
  select(weighted_median)

Table 12: 2022 Median Household Income in Oakland, CA

weighted_median
89000

Let’s do a quick comparison of our IPUMS results to the aggregate census data we retrieved in the first section. Here are our full IPUMS results for both median household income and population:

Code

median_table <- households_w_income %>%
  as_survey(weights=HHWT) %>% 
  summarise(weighted_median = survey_median(HHINCOME)) %>% 
  mutate(variable = "median_hh_income",
         ipums_estimate = weighted_median,
         se = weighted_median_se)

count_table <- oak %>%
  as_survey(weights=PERWT) %>% 
  survey_count() %>% 
  mutate(variable = "total_pop",
         ipums_estimate = n,
         se = n_se)

aggregate_data <- oakland_stats %>%
  select(c(variable, estimate)) %>%
  rename(ACS_aggregate_estimate = estimate)

bind_rows(count_table, median_table) %>% 
  select(c(variable, ipums_estimate)) %>% inner_join(aggregate_data, by='variable')

Table 13: IPUMS versus ACS aggregate results

variable	ipums_estimate	ACS_aggregate_estimate
total_pop	430052	426323
median_hh_income	89000	93146

These are clearly different. What gives? Unfortunately, summary statistics calculated using IPUMS data typically cannot match aggregate ACS figures!

One major reason for this gap is additional sampling error. Recall that the American Community Survey is a sample. A given one-year ACS is typically a 1% sample of the U.S. population, with associated sampling error. When the census makes microdata available, they create a sample of that sample – we do not get the full 1%. This second sampling process introduces further sampling error in the microdata that is not reflected in figures sourced from aggregate ACS data, which are calculated using the full ACS sample (See ACS in Ruggles et al. 2024).

Step 5: Calculating standard errors

Now we have estimates derived from the ACS microdata, but how do we communicate uncertainty around those estimates? For the sake of this example, let’s explore the problem of standard errors via a slightly different motivating question – what was the average household income in Oakland in 2022?

I’m switching to the average because we are about to calculate some standard errors from-scratch, and that’s much more clear-cut for linear statistics like the average than for non-linear statistics like the median. We will return to the median at the end of this section, but rely on a package implementation and not do anything from-scratch.

Anyways, let’s calculate the average household income in Oakland using the sample household weights. The formula for the weighted average is: \[ \bar{X}_w = \frac{\sum_{i=1}^{n}w_ix_i}{\sum_{i=1}^{n}w_i} \] We’ll put that into code and get the following estimate of the average household income in Oakland:

weighted_mean <- (
  sum(households_w_income$HHINCOME * households_w_income$HHWT)
  / sum(households_w_income$HHWT)
  )
weighted_mean

[1] 140889.1

Now, let’s get a standard error for that estimate. Correctly utilizing the sample weights in the variance calculation is a little tricky and there are a few different approaches to the problem. I’ll cover two techniques for variance estimation: the Taylor series method (also referred to as “linearized variance” or Taylor series linearization) and replication weights. Each of these are implemented in the survey package (Lumley 2004).

The Taylor series method

Most packages (e.g. SAS, Stata, and R’s survey package) default to using the Taylor series method for variance estimation of weighted statistics. This is because weighted statistics, like the average above, are not linear with respect to the weights, making analytical calculation of the variance difficult or else undefined (Lohr 2021). The Taylor series method entails calculating the first-order linear approximation of the statistic (e.g. the linear approximation of the weighted mean formula above), then finding the variance of that approximation. The details of those calculations are beyond the scope of this post, but (Lohr 2021, chap 9.1) covers the process in detail, and Wikipedia has a nice walkthrough as well. We will just work with the formula for linearized variance that is arrived at in the Wikipedia post:

\[ \hat{\text{Var}}(\bar{X}_w) = \frac{\sum w_i^2 (x_i - \bar{X}_w)^2}{\left(\sum w_i\right)^2} \] The standard error is just the square root of that variance.
\[ \hat{\text{SE}}(\bar{X}_w) = \sqrt{\hat{\text{Var}}(\bar{X}_w)} \] We’ll code that up in R:

numerator <- sum(households_w_income$HHWT^2 * (households_w_income$HHINCOME - weighted_mean)^2)
denominator <- sum(households_w_income$HHWT)^2
variance <- numerator / denominator
se <- sqrt(variance)
se

[1] 4141.197

That result is a usable estimate for the standard error of the average household income in Oakland. Of course in practice we would always use a package, in this case survey, for calculating that. Here I’ll calculate the average household income in Oakland using survey, and it will default to returning a standard error via the Taylor series method.

households_w_income %>%
  as_survey(weights=HHWT) %>%
  summarise(weighted_mean = survey_mean(HHINCOME))

Table 14: Taylor Series Method Standard Error

weighted_mean	weighted_mean_se
140889.1	4142.384

This standard error is off from ours by about $1, which could be due to other corrections applied in the package’s variance calculation. We needn’t be too worried about this discrepancy, though. While the Taylor series method is popular in survey analysis, IPUMS recommends that we use the replication weights method when working with ACS microdata.

Replication weights

In theory, the standard error of an estimate measures the variation of a statistic across multiple samples from a given population. Our sample standard error above, calculated using just the one sample, is an estimate of that theoretical standard error.

Replicate methods operate under the assumption that one sample can be conceived of as a miniature version of the population. Instead of taking multiple samples from the population to construct a variance estimator, one may simply resample from the full, original sample to mimic the theoretical basis of standard errors (Lohr 2021, chap 9.1).

In practice, we don’t explicitly do any resampling, instead relying on “replicate weights” that the census pre-computes. Here we can see what replicate weights (in this case, household replicate weights) look like in our data. Each of REPWT, 1 through 80, is a set of alternative household weights, slightly different from the “production weight,” HHWT.

households_w_income %>%
  mutate(` ` = "...") %>% 
  select(c(HHINCOME, HHWT, REPWT1, REPWT2, ` `, REPWT79, REPWT80)) %>% 
  head()

Table 15: ACS Replicate Household Weights

HHINCOME	HHWT	REPWT1	REPWT2		REPWT79	REPWT80
129200	50	80	47	…	49	50
107000	73	20	25	…	72	74
138000	74	86	85	…	21	73
44800	41	12	43	…	39	44
380000	311	99	86	…	323	105
189150	65	22	20	…	108	122

We can use replicate weights in a variety of alternative variance estimates. In the IPUMS documentation on ACS replicate weights, they outline a variance estimation procedure that matches the “successive difference replication (SDR) variance” method. We obtain SDR variance by calculating our statistic of interest with the production weights (e.g. HHWT) as follows:
\[ \bar{X}_w = \frac{\sum_{i=1}^{n}\text{HHWT}_ix_i}{\sum_{i=1}^{n}\text{HHWT}_i} \]

Then, for each of the 80 household replicate weights (e.g. REPWT1), we calculate the weighted average income using the replicate weight
\[ \bar{X}_r = \frac{\sum_{i=1}^{n}\text{REPWT}_{ri}x_i}{\sum_{i=1}^{n}\text{REPWT}_{ri}} \] and sum the squared deviations between that “replicate-weighted” estimate, $\bar{X}_r$, and the production weighted estimate. Specifically: \[ \begin{align*} \hat{\text{Var}}(\bar{X}_w) &= \frac{4}{80} \sum_{r=1}^{80} (\bar{X}_r - \bar{X}_w)^2 \end{align*} \]

(Fay and Train 1995) initially proposed that variance estimator and their paper gives the full derivation. The standard error is once again just the square root of that variance.

\[ \hat{\text{SE}}(\bar{X}_w) = \sqrt{\hat{\text{Var}}(\bar{X}_w)} \] In code, calculated via a for-loop, we’d get the following:

# Calculate X_r
X_r <- vector()
for (r in 1:80){
  X_r[r] <- households_w_income %>%
    as_survey(weights=glue("REPWT", r)) %>%
    summarise(weighted_mean = survey_mean(HHINCOME)) %>% 
    .$weighted_mean
}
# Calculate X
X <- households_w_income %>%
    as_survey(weights=HHWT) %>%
    summarise(weighted_mean = survey_mean(HHINCOME)) %>% 
    .$weighted_mean

# Sum over r
sqrt( (4/80) * sum( (X_r - X)^2 ) )

[1] 3432.032

To be clear, we don’t do that manually in practice – survey supports specifying survey designs with replicate weights. The following is an implementation of the same standard error calculation procedure. Note the identical output.

households_w_income %>% 
  as_survey_rep(
  weight = HHWT ,
  repweights = matches("REPWT[0-9]+"),
  type = "successive-difference",
  mse = TRUE) %>%
  summarise(weighted_mean = survey_mean(HHINCOME))

weighted_mean	weighted_mean_se
140889.1	3432.032

There are a few different replicate methods for variance estimation, several of which are equivalent. For example, the successive-difference standard error above is equivalent to using Fay’s Balanced Repeated Replication method with the Faye coefficient set to $\epsilon=.5$.

households_w_income %>% 
  as_survey_rep(
  weight = HHWT ,
  repweights = matches("REPWT[0-9]+"),
  type = "Fay",
  rho=.5,
  mse = TRUE) %>%
  summarise(weighted_mean = survey_mean(HHINCOME))

weighted_mean	weighted_mean_se
140889.1	3432.032

IPUMS’ documentation recommends the following specification, where the replicate method is set to “ACS.” This has the same output as the previous two specifications, and is presumably implementing SDR variance.

households_w_income %>% 
as_survey_rep(
  weight = HHWT,
  repweights = matches("REPWT[0-9]+"),
  type = "ACS",
  mse = TRUE) %>%
  summarise(weighted_mean = survey_mean(HHINCOME))

weighted_mean	weighted_mean_se
140889.1	3432.032

Now that we have the methods established, we’ll return to the median, rather than the mean, household income. Some of the above formulas would be different for the median, but the code and package implementations are the same. Here we will calculate the median household income in Oakland, and the standard error for that statistic using three methods:

ACS aggregate data retrieval
Estimation via IPUMS microdata with Taylor series variance/standard error
Estimation via IPUMS microdata with replicate weight variance/standard error

Code

df1 <- oakland_stats %>% 
  mutate(`Standard Error` = moe/1.645, Method = 'Aggregate Data') %>%
  filter(variable == 'median_hh_income') %>% 
  select(c(Method, estimate, `Standard Error`)) %>% 
  rename(`Weighted Median HH Income` = estimate)
df2 <- households_w_income %>% 
as_survey(
  weights = HHWT) %>%
  summarise(`Weighted Median HH Income` = survey_median(HHINCOME)) %>% 
  rename(`Standard Error` = `Weighted Median HH Income_se`) %>% 
  mutate(Method = 'Microdata w/ Taylor Series Method') %>% 
  select(c(Method, `Weighted Median HH Income`, `Standard Error`))
df3 <- households_w_income %>% 
as_survey_rep(
  weight = HHWT,
  repweights = matches("REPWT[0-9]+"),
  type = "ACS",
  mse = TRUE) %>%
  summarise(`Weighted Median HH Income` = survey_median(HHINCOME)) %>% 
  rename(`Standard Error` = `Weighted Median HH Income_se`) %>% 
  mutate(Method = 'Microdata w/ Replicate Weights') %>% 
  select(c(Method, `Weighted Median HH Income`, `Standard Error`))

combined_df <- bind_rows(df1, df2, df3)
combined_df

Method	Weighted Median HH Income	Standard Error
Aggregate Data	93146	3788.450
Microdata w/ Taylor Series Method	89000	4354.199
Microdata w/ Replicate Weights	89000	3793.112

In IPUMS testing of ACS/PRCS data, replicate weights usually increase standard errors, though in our case the replicate weights produced a slightly smaller standard error than the Taylor series method.

A Production Workflow

Putting all of this post together, here is production code that I would use to build a dataset for analyzing household-level variables in Oakland, leaning on the IPUMS and geocorr helper functions I defined above, and specifying replicate weights and “ACS” variance.

# Get the IPUMS data for California
data <- ipums_data("us2022a", variables, "Incomes by PUMA")
# Find the geographic correspondence for places -> pumas
csv_data <- geocorr_2022("Ca06", "puma22", "place", "pop20")
# isolate pumas that correspond to oakland and have oak population
oak_pumas <- csv_data %>%
  select(-c(state, stab, place, PUMA22name)) %>%
  filter(PlaceName == 'Oakland city, CA', afact > 0) %>% 
  pull(puma22) %>% 
  as.integer()
# Filter the IPUMS data to those pumas
oak_hh_w_inc <- data %>%
  filter(PUMA %in% oak_pumas) %>% 
  distinct(SAMPLE, SERIAL, .keep_all = TRUE) %>% 
  filter(HHINCOME != 9999999, HHINCOME > 0) %>% 
  haven::zap_labels()
# convert the IPUMS data to a survey dataset
oak_hh_inc_svy <- households_w_income %>% 
  as_survey_rep(
    weight = HHWT,
    repweights = matches("REPWT[0-9]+"),
    type = "ACS",
    mse = TRUE)

And here is the median household income.

oak_hh_inc_svy %>%
  summarise(median_hh_inc = survey_median(HHINCOME))

median_hh_inc	median_hh_inc_se
89000	3793.112

Warning: package 'reticulate' was built under R version 4.3.3

References

Bloem, Jeff. 2018. “Matching ACS Median Household Income?” IPUMS Forum. https://forum.ipums.org/t/matching-acs-median-household-income/2099.

Fay, Robert E, and George F Train. 1995. “Aspects of Survey and Model-Based Postcensal Estimation of Income and Poverty Characteristics for States and Counties.” In, 154–59. Alexandria, VA.

Greg Freedman Ellis, Derek Burk, and Finn Roberts. 2024. “Ipumsr: An R Interface for Downloading, Reading, and Handling IPUMS Data.” https://tech.popdata.org/ipumsr/.

Lohr, Sharon L. 2021. Sampling: Design and Analysis. 3rd edition. Boca Raton: Chapman; Hall/CRC.

Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (April): 1–19. https://doi.org/10.18637/jss.v009.i08.

Mihalik, Cory, Glenn Rice, and Matt Hesser. 2022. “The Missouri Census Data Center (MCDC) 2022 Geographic Correspondence Engine (GEOCORR).” https://mcdc.missouri.edu/applications/geocorr.html.

Pastoor, Isabel. 2024. “How Can I Pull Data at the Zip Code or City Level?” IPUMS Forum. https://forum.ipums.org/t/how-can-i-pull-data-at-the-zip-code-or-city-level/5650/2.

Ruggles, Steven, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. 2024. “IPUMS USA: Version 15.0.” Minneapolis, MN: IPUMS. https://doi.org/10.18128/D010.V15.0.

Walker, Kyle. 2023. Analyzing US Census Data. 1st edition. Boca Raton: Chapman; Hall/CRC.

Footnotes

I want to note that several of the points I cover here are things I learned from some coworkers– Bert Wilden and Stephanie Peng.↩︎
For an informal analysis, I might just use a web-based tool like Census Reporter to quickly look something up.↩︎
See the full variable list for the 2022 1-year ACS for the available variables, and see the Census Place table for looking up GEOIDs↩︎
I’ll to note that the combination of IPUMS and Geocorr is a fantastic public good, and it’s extremely generous of the public Universities of Minnesota and Missouri to publish these.↩︎
Were the AFACT higher, e.g. 1%, I could randomly sample 1% of the individuals from that PUMA and include them in my Oakland sample.↩︎

Citation

BibTeX citation:

@online{amerkhanian2025,
  author = {Amerkhanian, Peter},
  title = {Working with {ACS} Microdata in {R}},
  date = {2025-02-20},
  url = {https://peter-amerkhanian.com/posts/ipums-wth-r/},
  langid = {en}
}

For attribution, please cite this work as:

Amerkhanian, Peter. 2025. “Working with ACS Microdata in R.” February 20, 2025. https://peter-amerkhanian.com/posts/ipums-wth-r/.

Aggregate data with tidycensus

Microdata with IPUMS/ipumsr

Step 1: Retrieving data

Step 2: Using Geocorr to identify small geographies

Step 3: Specify the unit of analysis

Households, group quarters, and institutions

A household

Non-institutional group-quarters

Institutional housing units

Household Income in the ACS

Step 4: Applying sample weights

Step 5: Calculating standard errors

The Taylor series method

Replication weights

A Production Workflow

References

Footnotes

Citation

Aggregate data with `tidycensus`

Microdata with IPUMS/`ipumsr`