BW #76: Aging legislators (solution)

BW #76: Aging legislators (solution)

This week, I thought it would be interesting to find out just how old American politicians are. President Joe Biden (81) announced that he won't run for re-election, but Donald Trump (78) isn't a youngster, either. Nancy Pelosi (84) seems to have been central to coordinating Biden's decision to pull out of the race.

We do see some inching toward younger talent and candidates: Vice president Kamala Harris isn't even 60 years yet, which makes her seem quite young compared with all of them — until you compare her with Republican vice presidential candidate JD Vance (39), who was clearly selected in part because of his youth.

Just how old are American politicians? I certainly have an image of them as very old, especially in the Senate, but maybe that's unfair or wrong? The only way to find out is to examine the data, and that's what we do here at Bamboo Weekly.

Data and eight questions

I looked around for a data source that could tell me the ages of US politicians. I was delighted to find a project that has been going on for some time, a GitHub repository with more than 100 collaborators. The project is here:

https://github.com/unitedstates/congress-legislators

The good news is that they have a lot of data, and seem to update it on a very regular basis. (For example, they already included the death of Sheila Jackson Lee, who passed away on July 19th.) The data comes from a wide variety of sources and formats, and the source data thus includes a number of unique identifiers, each from a distinct system.

We'll look at two of the files, legislators-historical.yaml and legislators-current.yaml, both (as the suffix indicates) in YAML format. The first file contains all legislators who previously served, and the second contains all those who are currently serving.

While the age of legislators is the topical focus for this week's questions, I found that reading the data from YAML into a useful Pandas data frame was a fairly big task. I thus broke it up (to some degree) into multiple questions.

As usual, a link to the Jupyter notebook I used to ask and answer these questions is at the bottom of this post.

Read the two YAML files (legislators-historical and legislators-current) into a single data frame.

Before doing anything else, I loaded Pandas. But because we're also going to be working with YAML, I also installed the pyyaml package from PYPI, which we import with the name yaml:

import pandas as pd
from pandas import DataFrame
import yaml

We have two YAML files that we need to read into Pandas data frames. Then we'll need to combine them.

I hadn't worked with YAML in some time, and was under the (mistaken) assumption that Pandas would have a read_yaml method, much as it has read_csv and read_json. But it turns out that no such method exists. We'll thus need to use pyyaml to read the YAML files into Python dictionaries, and then use those dicts to build a data frame. A bit roundabout, but we can do it.

I also assumed that reading a YAML file with pyyaml would be similar to reading a JSON or pickle file, with a load method. That turns out to be true, but it's not enough to run yaml.load on a file. You also need to provide a value to the Loader keyword argument; fortunately, you can pass yaml.Loader as an argument:

historical_filename = 'legislators-historical.yaml'
historical_data = yaml.load(open(historical_filename), Loader=yaml.Loader)

But wait: It turns out that YAML has some security issues, namely that a malicious file can force your computer to execute arbitrary code upon loading the file. So it's far better to use safe_load, which has the added benefit of not requiring another argument:

historical_data = yaml.safe_load(open(historical_filename))
historical_df = DataFrame(historical_data)

That defines historical_data as a list of dicts. We can turn that into a data frame by invoking DataFrame on the data:

historical_df = DataFrame(historical_data)

To load the current data, I did roughly the same thing:

current_filename = 'legislators-current.yaml'

current_data = yaml.safe_load(open(current_filename))
current_df = DataFrame(current_data)

I now had two data frames with the same column names, and wanted to stack them on top of one another. To do that, I used pd.concat, which takes a list of data frames as an argument:

df = pd.concat([historical_df, current_df])

I've used pd.concat many times, but this was the first time I got an error saying that the data frames couldn't be combined because the index was duplicated, and let to conflicts. That's fine; I added the ignore_index keyword argument, which solved the problem:

df = pd.concat([historical_df, current_df], ignore_index=True)

At this point, we have a data frame with 12,684 rows and 7 columns. However, three of those columns contain Python dicts, and one contains a list of dicts in each cell. We'll work to clean that up and make a better data frame in the next few questions.

Three columns (id, bio, and name) contain Python dicts. Expand each dict to be new columns in the row, and then remove the original columns. Then take the "terms" column, which contains a list of dicts, and expand it such that you have multiple rows per legislator, one term per row. (The rest of the data for the legislator will be duplicated.) Finally, set the "bioguide" column to be the index.

In some ways, a dict is like a row in a data frame: The keys are the column names, and the values are the cell values. So it seems reasonable to assume that we can somehow expand a series of dicts into a data frame.

One not-very-obvious way to do this is with the pd.json_normalize method. It does exactly what I described, namely takes a series of dicts and returns a data frame:

pd.json_normalize(df['id'])

The above returns a new data frame with 17 columns (one for each dict key) and 12,684 rows – one for each in the original data frame. We can do this for each of the three columns that contains a dict, but then how do we merge it back into the original data frame?

We can use pd.concat, but this time, we'll pass axis='columns', so that the concatenation happens side-to-side, rather than top-to-bottom:

(pd.concat([df, 
                 pd.json_normalize(df['id']),
                 pd.json_normalize(df['bio']),
                 pd.json_normalize(df['name'])],
                 axis='columns')
)                 

I also asked you to remove the original columns. We can do that with drop, specifying that we want to drop columns, rather than (its default of) rows:

df = (pd.concat([df, 
                 pd.json_normalize(df['id']),
                 pd.json_normalize(df['bio']),
                 pd.json_normalize(df['name'])],
                 axis='columns')
      .drop(['id', 'bio', 'name'], axis='columns')
)

Next, I asked you to use the "bioguide" column as the index. We can do that with set_index:

(pd.concat([df, 
                 pd.json_normalize(df['id']),
                 pd.json_normalize(df['bio']),
                 pd.json_normalize(df['name'])],
                 axis='columns')
      .drop(['id', 'bio', 'name'], axis='columns')
      .set_index('bioguide')
)

At this point, we've expanded all of the dicts into columns. But there is still the terms column, with each cell containing a list of dicts. Each dict in that list represents a single term in which the legislator served. So if someone served two terms, there will be a two-element list of dicts, and if they served 10 terms, there will be a 10-element list of dicts.

There isn't an obvious way to handle data like this. But Pandas does provide an explode method, which takes a series containing lists and moves each list into its own element. So given a series, a two-element list would be turned into two elements, and the 10-element list would be turned into 10 rows. The index would be preserved, though.

We can also run explode on a data frame, specifying which column contains a list. When you do this, each of the rows is duplicated, once for each list element. The list elements themselves are spread across the rows. This will basically solve our problem, although it will mean that there will be a separate row not for each legislator, but rather for each term that they served. Note that the index will continue to contain the unique "bioguide" ID code for each legislator:

df = (pd.concat([df, 
                 pd.json_normalize(df['id']),
                 pd.json_normalize(df['bio']),
                 pd.json_normalize(df['name'])],
                 axis='columns')
      .drop(['id', 'bio', 'name'], axis='columns')
      .set_index('bioguide')
      .explode('terms')
     )

The resulting data frame has 45,035 rows and 28 columns. We definitely won't need all of these columns, but I decided against trimming them down for now.