BW #34: House of Representatives (solution)
There was big news in the US House of Representatives this week, as members of the Republican party ousted the speaker. This week, we look at some recent election data about the US House.
This week, the US House of Representatives ousted its speaker, Kevin McCarthy, from office. What will happen next isn’t quite obvious, and probably isn’t a good omen for the Republican party, for the House, for the US government, and quite possibly even the world.
But hey, we’ll leave politics to other people. Here at Bamboo Weekly, we’re all about the data! And this week, we looked at election data about the House of Representatives.
Data and seven questions
This week's data comes from the MIT Election Lab (https://electionlab.mit.edu/), run by Professor Charles Stewart III. The data itself comes in CSV format, downloadable from the site
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2
We’re looking at the CSV file that you can get by retrieving the file called 1976-2022-house.tab. You can download this data in a variety of formats; I chose the comma-separated values (CSV), but other options are available.
You’ll also want to look at the codebook (aka “data dictionary”), downloadable from the same page in Markdown format.
I had seven questions and tasks for you about this week’s data. Without further ado, let’s take a look at them:
Load the data into a Pandas data frame. We only need the columns `year`, `state`, `state_po`, `district`, `stage`, `candidate`, `party`, `candidatevotes`, and `totalvotes`.
The first thing we’ll need to do is load up Pandas:
import pandas as pd
After having done that, we’ll want to import the CSV file into a data frame. I asked you to only load a handful of the columns, so we’ll use the “use_cols” keyword argument to read_csv:
filename = '/Users/reuven/Downloads/1976-2022-house.csv'
df = pd.read_csv(filename,
usecols=['year', 'state', 'state_po',
'district', 'stage', 'candidate',
'party', 'candidatevotes', 'totalvotes'])
I forgot to mention in my original question that we only want to keep the general-election results around, dropping any from the primaries. We can see how many results there are for each with value_counts:
df['stage'].value_counts()
The result I get:
stage
GEN 32392
PRI 60
Name: count, dtype: int64
So most of the results are indeed from a general election. But let’s get rid of those that have to do with primaries. We can do that by creating a boolean series based on comparing the “stage” column with the string “GEN”. We can then feed that boolean series to “loc”. Only those rows containing GEN for “stage” will then be returned. We can then assign the result back to df:
df = df.loc[df['stage'] == 'GEN']
We can combine all of the above into a single chained query as follows:
df = (
pd.read_csv(filename,
usecols=['year', 'state', 'state_po',
'district', 'stage', 'candidate',
'party', 'candidatevotes', 'totalvotes'])
.loc[lambda df_: df_['stage'] == 'GEN']
.drop('stage', axis='columns')
)
You might notice some subtle differences between what I did above, and what I did earlier.
First, my call to “loc” no longer puts a boolean series inside of the square brackets. Rather, I pass a lambda (i.e., an anonymous function) that takes a single argument, which we call “df_” to indicate that it’s a temporary, local data frame — in this case, the data frame that we got back from read_csv.
We then run our comparison between the “stage” column on that temporary data frame and “GEN”, keeping only those for which it returns True.
Once we’ve used the “stage” column, we really don’t need it an more. So I invoke “drop” on the data frame, removing the “stage” column. We need to indicate that we’re working with columns, rather than rows, via the “axis” keyword argument.
I end up with a data frame of 32,392 rows and eight columns. I’ll add that it is a bit silly to have both the “state” column (with the state name) and the “state_po” column (with the state two-letter abbreviation). If I were doing this for myself, I’d probably just use “state_po”, since it uses less space and memory. But since not everyone reading BW is intimately familiar with all 50 US state names, let alone their two-letter abbreviations, I decided to stick with the longer values.
I’ll add, by the way, that the total memory used by this data frame on my system is 9.2 MB. Both “state” and “state_po” contain repeated strings, which are perfect candidates for Pandas categories — basically, the equivalent of enums. I can convert them both:
df['state'] = df['state'].astype('category')
df['state_po'] = df['state_po'].astype('category')
After executing the above, the memory footprint of the data frame drops to 5.4 MB, or about 60% of the original size. That’s right — with only two calls to “astype” and less than one second of execution time, we’ve cut our memory usage nearly in half. That’s a pretty big win, I’d say.
How many districts are there per state in 2022? What 10 states have the most districts?
The whole idea of the House of Representatives is that there should be one representative for every X people. (The number represented by X has changed over the years. Someone recently suggested that it be changed to 44 billion, but I don’t think that’ll work out very well.)
From this data, can we find out how many districts are in each state?
First, we need to find all of the rows from the year 2022. We have a “year” column, and could theoretically create a boolean series that compares the year with 2022. Then we could pass that boolean series to “loc”, and get back only those matching rows.
However, this is the sort of case that I think calls for the use of “set_index”: I can set the index to be the year, and then retrieve those rows that are from the year I want. The effect is the same, but it feels a bit easier and clearer to me. Because set_index, like so many methods in Pandas, returns a new data frame (but doesn’t affect the original one), we can use it in a chained set of methods:
(
df
.set_index('year')
.loc[2022]
)
With this in place, we can ask the question: How many districts are there per state? Once we have a “per state” question, that means we’ll almost certainly need to use “groupby”. Grouping means that we want to run an aggregation method for each distinct value in a categorical column. We can thus say:
(
df
.set_index('year')
.loc[2022]
.groupby('state')['district']
)
In other words: For each distinct state, we want to get the value of “district”. However, if we do this, we’ll get a warning from Pandas, telling us that currently a “groupby” on a categorical dtype will give us a result for every element in the category, whether it’s observed or not. This default will be changing in a future version of Pandas to True, such that only observed values in the category will be shown. We can silence this error by passing observed=False. (In our case, because we created the category based on actual data, there isn’t any such thing as a non-observed category member, so it doesn’t really matter that much.)
We can thus say:
(
df
.set_index('year')
.loc[2022]
.groupby('state', observed=False)['district']
)
Now we need to apply an aggregation method. Which one will we use? I originally thought that I would use “count”, since I want to know how many there are. Then I can use “sort_values” to list the states, and how many congressional districts they each have. And then I can use “head” to get the top 10:
(
df
.set_index('year')
.loc[2022]
.groupby('state', observed=False)['district']
.count()
.sort_values(ascending=False)
.head(10)
)
Doing this gives me some weird results, though:
state
NEW YORK 158
CALIFORNIA 104
TEXAS 93
FLORIDA 72
NEW JERSEY 53
MICHIGAN 51
ILLINOIS 46
TENNESSEE 37
MASSACHUSETTS 36
VIRGINIA 35
Name: district, dtype: int64
I’m no expert, but I’m pretty sure that New York does not have more representatives in Congress than California. And if we add the numbers from New York, California, Texas, as Florida together, we get 427, which is just under the total number of members of Congress. So … what’s going on here?
We basically counted how many candidates there were running for Congress in each state, not how many districts there were! That’s because this data set lists all candidates, not only the winners.
We’ll need another way to find this, and I decided the easiest would be to use the “max” aggregate method. After all, if the districts are numbered, and if this data set lists them, then we should be fine:
(
df
.set_index('year')
.loc[2022]
.groupby('state', observed=False)['district']
.max()
.sort_values(ascending=False)
.head(10)
)
Sure enough, we get:
state
CALIFORNIA 52.0
TEXAS 38.0
FLORIDA 28.0
NEW YORK 26.0
PENNSYLVANIA 17.0
ILLINOIS 17.0
OHIO 15.0
GEORGIA 14.0
NORTH CAROLINA 14.0
MICHIGAN 13.0
Name: district, dtype: float64
This makes much more sense! We see that California has the most representatives, followed by Texas, Florida, and New York. What about the bottom-most states? We can get those with:
(
df
.set_index('year')
.loc[2022]
.groupby('state', observed=False)['district']
.max()
.sort_values(ascending=False)
.tail(10)
)
And the results:
state
MAINE 2.0
IDAHO 2.0
HAWAII 2.0
SOUTH DAKOTA 0.0
NORTH DAKOTA 0.0
ALASKA 0.0
VERMONT 0.0
DELAWARE 0.0
WYOMING 0.0
DISTRICT OF COLUMBIA NaN
Name: district, dtype: float64
Wondering how a state can have 0 representatives? That’s not what this means; rather, there is only a single Congressional district in each of these states. Thus, it doesn’t get a number, and the data set shows it to be 0. Even worse is Washington, DC, which has no Congressional representative at all. (I know, it seems weird to me, too.)