Voter Registration Rates: Identifying Where People Are Not Registered and Why


November 25, 2018


This project was generated using Rmarkdown. Many chunks of code were omitted. Full code:

While much research exists comparing voter turnout to population counts, not a lot compares voter registration to population counts. Registration rates could be valuable for campaigns looking to make strategic decisions regarding where resources would be the most effective. In areas with relatively low registration rates, campaigns could focus their resources on turning apathetic citizens into new voters. In areas with higher registration rates, they could focus on energizing their existing voter base. Regardless, registration rates are a useful source of information that can greatly benefit political campaigns and movements.

Analysis of voter registration rates may also show trends in types of areas that tend to have lower or higher registration rates based on demographics (e.g. gender and race) and party affiliation. Perhaps certain demographics or types of communities tend to feel left out or apathetic. Understanding those patterns can be very useful for politicians seeking to expand their base to untapped pools of potential voters. These trends would also be useful for making generalizations about areas not covered in this project.

Data sources

There are two main types of data used in this project: voter registration data and census population data. Because North Carolina’s voting data is free and easily accessible, this project only looks at North Carolina’s registration rates. This project examines North Carolina’s voter registration files1 North Carolina voter registration data and data from the U.S. Census Bureau2 U.S. Census Bureau ACS data: for each county. I began by loading data from the Census Bureau’s American Community Survey (ACS), which estimates a wide range of information based on a 5-year period ending in 2016.

# Read in data files
pop <- read.csv("acs.csv")  # population counts
dem <- read.csv("counties3.csv")  # demographic info (age, race, gender)
incomes <- read.csv("income.csv")  # incomes (mean, median)
edu <- read.csv("edu.csv")  # education levels
housing <- read.csv("housing.csv")  # owner/renter rates

# Select variables
pop <- select(pop, GEO.display.label, HC01_VC87, HC01_VC108)
dem <- select(dem, GEO.display.label, HC01_VC108, HC01_VC23, HC03_VC49, HC03_VC50, HC03_VC51, HC03_VC56, HC03_VC64, HC03_VC69, HC03_VC70, HC03_VC88, HC03_VC93, HC03_VC109, HC03_VC110)
incomes <- select(incomes, GEO.display.label, HC01_EST_VC13, HC01_EST_VC15)
edu <- select(edu, GEO.display.label, HD01_VD01, HD01_VD03, HD01_VD04, HD01_VD05, HD01_VD06, HD01_VD07)
housing <- select(housing, GEO.display.label, HD01_VD01, HD01_VD02, HD01_VD03)

# Rename variables
names(pop) <- c("county", "pop_total", "vep_est")
names(dem) <- c("county", "pop_18up", "median_age","white", "black", "indian", "asian", "hw_pi", "other", "two", "hispanic", "not_hispanic", "male", "female")
names(incomes) <- c("county", "median_income", "mean_income")
names(edu) <- c("county", "total_edu", "no_hs", "hs", "some_col", "col", "grad")
names(housing) <- c("county", "total_housing", "owners", "renters")

There are now four dataframes, each of which need to be edited for usability. I created a function to standardize their formats and then combined them.

# Function to clean up dataframes
data_clean <- function(df, delete.row1=TRUE) {
  if(delete.row1) df <- df[-1,]  # delete first row
  row.names(df) <- 1:nrow(df)  # renumber rows
  df$county <- as.character(df$county)  # reclassify county
  df$county <- rename(df)  # standardize county names
  for (i in 2:ncol(df)) df[,i] <- as.numeric(as.character(df[,i]))  # reclassify numbers

# Apply function
pop <- data_clean(pop);  dem <- data_clean(dem);  incomes <- data_clean(incomes);  edu <- data_clean(edu);  housing <- data_clean(housing)

# Combine dataframes
census <- pop %>%
  left_join(dem, by="county") %>%
  left_join(incomes, by="county") %>%
  left_join(housing, by="county") %>%
  left_join(edu, by="county")

This census dataset now includes 2016 estimates on population and demographic data for all 100 counties in North Carolina.

Calculating voting-eligible populations

When I first compared voting-eligible population counts from the Census Bureau’s ACS data with voter registration counts, many counties had registration rates above 100%. This was likely because the population data I was using was a few years old, while the voter files were current. Examining county populations over time revealed that many counties with rates over 100% indeed had rising populations. The first step, then, was to compile accurate estimates of the 2018 voting-eligible populations for each county.

The Census Bureau does not yet have 2018 population estimates. It does have yearly population estimates for 2010-2017 for each county, which can be modeled to forecast 2018 populations, but it only has yearly data on citizenship for 38 of North Carolina’s 100 counties. To ensure my estimates are as accurate as possible, I used the following procedure:

  1. For each of the 38 counties with complete data, calculate the yearly proportion of the population that meets citizenship and age requirements to vote for 2010-2017.
  2. Forecast the 2018 proportions of eligible voters for these counties.
  3. Model these proportions on each county’s 2016 ACS estimates and demographic compositions, then estimate the proportions for the other 62 counties.
  4. Forecast 2018 total populations for each county.
  5. Multiply these by the fitted proportions to estimate the 2018 voting-eligible populations for all 100 counties.

Figure 1. Density plots showing the distribution of county populations for the sample of 38 counties vs. all 100 counties.

Step 3 assumes that the 38 counties with complete data are representative of all 100 counties. I could not find any explanation from the Census Bureau as to why these counties were selected, but Figure 1 shows that the distribution of their total populations appears to be similar to the distribution of all 100 counties. Smaller counties are definitely underrepresented in the sample, but the overall distribution shape is roughly similar enough to proceed with this step.

Step 1: Proportions of citizens from the sample

I obtained 8 different datasets from the Census Bureau’s ACS yearly data, one for each year. Each dataset includes population figures by county for the 38 sample counties. To start off, I loaded these files as a list of dataframes. I then created some functions to clean up the dataframes.

The datasets from the Census Bureau broke up the populations by gender, age, and citizenship, so to calculate voting-eligible population, I had to follow the formula displayed in the code below.

for (i in 1:8) {
  # Calculate voting-eligible population for each year
  census_list[[i]] <- mutate(census_list[[i]], pop_18up = male_18up + female_18up,
                             pop_noncitizen = male_noncitizen + female_noncitizen,
                             vep = pop_18up - pop_noncitizen,
                             prop = vep / total_pop) %>% 
                      select(county, prop)
  # Rename columns to include years in preparation for combining dataframes
  names(census_list[[i]])[2] <- paste("prop", 2009+i, sep="_")

# Reduce into one dataframe
proportions <- census_list %>% reduce(left_join, by = "county")

There were a few NA values, which I filled with the average of the previous two values in the row. The dataframe proportions is now a clean dataset with the proportion of the population that is eligible to vote by county and year. Here’s a snippet of the data:

Table 1. A subset of proportions.

county 2010 2011 2012 2013 2014 2015 2016 2017
Alamance 70.5 72.1 71.5 71.9 71.3 72.5 71.9 71.6
Brunswick 79.2 79.9 79.8 79.5 79.7 82.3 83.1 82.5
Buncombe 76.1 77.1 77.3 76.6 77.6 77.1 78.4 78.3
Burke 74.6 74.7 75.0 76.6 77.5 78.5 78.6 77.9
Cabarrus 68.2 67.8 68.8 68.6 68.8 69.2 69.4 69.9

Step 2: Forecasting 2018 proportions

To predict the 2018 proportions, simple second-order random walk time series models were run for each county according to the following structure: \[p_t = 2p_{t-1} - p_{t-2} + w_t\]

Figure 2. Example of a time series forecast for the 2018 citizen population proportion. Figure 2. Example of a time series forecast for the 2018 citizen population proportion.

where \(p_t\) is the 2018 estimate based on the 2017 proportion \(p_{t-1}\) and 2016 proportion \(p_{t-2}\), accounting for some random white-noise error \(w_t\). This model was chosen since it is the simplest possible time series process, allowing us to quickly compute predictions for all 38 counties without having to go through tedious diagnostics and transformations on the data to meet model assumptions for each county.

# Times series forecasts
for (i in 1:nrow(proportions)) {
  series <- ts(as.numeric(proportions[i,2:(ncol(proportions)-1)]))
  proportions$prop_2018[i] <- sarima.for(series, 1, 0, 2, 0)$pred[1]

The proportions dataset now has 2018 estimates for the voting-eligible population proportions for the 38 sample counties.

Step 3: Modeling proportions

Next, I modeled the proportions for each county. The main variable was the 2016 ACS proportion estimates, which should be fairly accurate. I could have even bypassed several of these steps and just used the ACS proportion estimates for each county, assuming they are relatively constant, but I wanted to be more precise. I created a regression model that inputted these proportion estimates along with other variables I thought might be useful for establishing trends in voting-eligible proportions: race/ethnicity (white, black, hispanic), gender (female), age (median_age), income (median_income, mean_income), and education levels (no_hs, hs, any_college, grad).

After trying different models, the one below ended up being the best fit:

Figure 3. Histogram of voting-eligible population proportions. Figure 3. Histogram of voting-eligible population proportions.

# Multiple regression model
model <- lm(prop_2018 ~ vep_est_prop + pop_total + any_college, census)

# Add the 38 calculated values and 62 fitted values to dataframe
census$prop_2018[which($prop_2018))] <- predict(model, census[which($prop_2018)),])

The formula from this model is as follows: \[\hat{p} = 0.040 + 0.986x_1 - (1.76 \times 10^8) x_2 - 0.0255x_3\] where \(x_1\) is the 2016 voting eligible population proportion estimate, \(x_2\) is the total population, and \(x_3\) is the proportion of each county that has attended any amount of college.

Step 4: Forecasting 2018 populations

Figure 4. Example of a time series forecast for the 2018 population of Yancey County. Figure 4. Example of a time series forecast for the 2018 population of Yancey County.

Recall that while the census has yearly data on voting-eligible populations for only 38 counties, it has the total populations for all 100. I was able to forecast 2018 populations for all 100 counties using a very similar approach to Step 2. The model again is a second-order random walk times series: \[x_t = 2x_{t-1} - x_{t-2} + w_t\] where \(x_t\) is the 2018 population estimate based on the 2017 population \(x_{t-1}\) and 2016 population \(x_{t-2}\), accounting for some random white-noise error \(w_t\).

# Times series forecasts
for (i in 1:nrow(census_time)) {
  model <- ts(as.numeric(census_time[i,2:(ncol(census_time)-1)]))
  census_time$pop2018_est[i] <- sarima.for(model, 1, 0, 2, 0)$pred[1]

Below is a plot of 10 randomly selected counties with their 2018 population projections.

Figure 5. Population change over the last 8 years for 10 random counties, with 2018 projections.

Figure 5. Population change over the last 8 years for 10 random counties, with 2018 projections.

Step 5: Calculating voting-eligible populations

Now that we have the 2018 total populations and voting-eligible proportions for each county, multiplying the two together gets the voting-eligible populations.

# Add 2018 population projections into main dataframe
census <- left_join(census, select(census_time, county, pop2018_est), by="county")

# Calculate voting-eligible population
census <- mutate(census, vep = pop2018_est * prop_2018)
census$vep <- round(census$vep, 0)

The only other component of voting-eligible populations not accounted for now is the populations of prisoners and people on probation or parole, who are not allowed to vote in North Carolina.3 National Conference of State Legislatures: The only data available reports statewide figures; I could not find any estimates for these populations by county.4 North Carolina does publish the capacity and county of each of its prisons, but prisons are not necessarily at full capacity. This could significantly throw off voting-eligible populations for small counties with prisons. Hence prison capacity was used later on in a regression model as a potential indicator variable for registration rates rather than a component of the population calculation. It would not be fair to assume that the proportion of the population that falls into this category is consistent across all counties. Counties with prisons and high poverty rates would have higher numbers. Because of this, we must proceed with caution and understand that voter registration rates may be slightly inflated in some areas as they assume that 100% of the adult citizen populations in each county are eligible to vote.

Accuracy verification

To make sure these estimates are adequate, I compared the sum of voting-eligible populations across all counties to statewide estimates from the US Elections Project5 Elections Project estimates: and the Census Bureau.6 The Census estimate is for 2017, not 2018, so it is likely less accurate.

Table 2. Comparison of North Carolina voting-eligible population (VEP) estimates from different sources.

Source VEP estimate Difference Percent
My estimate 7,604,905 0 0%
US Elections Project 7,577,011 27,894 0.37%
US Census Bureau 7,509,879 95,026 1.25%

My estimate is very close to the estimate from the US Elections Project, so my estimates on each county’s voting-eligible population are most likely adequate.

Voter registration counts

Compared to the voting-eligible population, it was much simpler to get the registered voter population in each county. North Carolina’s voter registration file includes, among many other variables, which county each voter lives in. The only difficulty I faced getting voter registration counts was having the patience to load a file with millions of pieces of data.

# Read in TXT file
ncfile <- read.delim("ncvoter_Statewide.txt")

# Select variables and active voters
voters <- ncfile %>%
  filter(voter_status_desc=="ACTIVE") %>%
  select(county_desc, party_cd) %>%
  group_by(county_desc) %>%
  summarise(voters = n(),
            dem = length(which(party_cd == "DEM")),
            rep = length(which(party_cd == "REP")),
            lib = length(which(party_cd == "LIB")),
            cst = length(which(party_cd == "CST")),
            gre = length(which(party_cd == "GRE")),
            una = length(which(party_cd == "UNA")))

# Create frequency table for counties
names(voters)[1] <- "county"
voters$county <- as.character(voters$county)

# Reformat county names
for(i in 1:nrow(voters)) voters$county[i] <- simpleCap(voters$county[i])

This new table, voters, lists the number of active registered voters and their party affiliations in each county.

Voter registration rates

We now have the active registered voter populations and eligible voter populations, so dividing those gets us the registration rates for each county.

# Join dataframes
rates <- select(voters, county, voters) %>%
  left_join(select(census, county, vep), by="county") %>%
  mutate(reg_rate = voters/vep) %>%

# Round rates
rates$reg_rate <- round(rates$reg_rate, 3) * 100

The rates range from 55.9% (Onslow County) to 89.1% (Dare County). Let’s look at this on a map:

Figure 6. Map of North Carolina counties shaded by voter registration rates. This map was created using code adapted from Reproducible Research Course:

Figure 7. Histogram of voter registration rates. Figure 7. Histogram of voter registration rates.