BW #90: Voter participation (solution)

BW #90: Voter participation (solution)

US elections are less than one week from now, and one of the biggest questions is just how many people will show up to vote. The margins in the 2020 election were razor thin in several swing states, and it's looking like that might happen once again. As a result, the "ground game," in which political parties talk to potential voters and try to persuade them to vote, is likely to play a large role.

Even in a tight, important presidential race, with the parties trying their best to get people to vote, a huge number of Americans don't bother to do so. That led me to wonder how many Americans do vote, how these numbers have changed over the years, and how this compares with voting numbers in other countries.

Data and six questions

To answer these questions, I took data from the International Institute for Democracy and Electoral Assistance, which calls itself "International IDEA" (https://www.idea.int/). International IDEA tries to help countries to become more democratic – including monitoring and encouraging women in leadership. We have looked at their data before, back in BW 69 (https://www.bambooweekly.com/bw-69-election-participation/ ).

International IDEA's "voter turnout database" is at

https://www.idea.int/data-tools/data/voter-turnout-database

Download the data by clicking on the "export data" button next to the headline on the page. This should download an Excel file onto your computer.

To see the data dictionary for this data set, click on the "glossary" link on the right side of the page, under the headline "background information."

Here are my solutions and explanations for this week; as always, a link to download my Jupyter notebook is at the bottom of this post.

Turn the first ("All") tab of the spreadsheet into a Pandas data frame. The Year column should be a datetime value. (Note: The dashes in the dates will make this tricky!) Use the "Country", "Election Type", and "Year" values for a three-level multi-index.

Before doing anything else, let's load Pandas:

import pandas as pd

Now that we've done that, I'll load the file into a data frame. To create a data from from an Excel file, we can use read_excel, indicating that we just want the "All" tab:


filename = 'idea_export_voter_turnout_database_region.xlsx'
df = pd.read_excel(filename,
                  sheet_name='All')

This is good, except that I also asked you to set it up with a multi-index containing "Country", "Election Type", and "Year". If this were a CSV file, we could name the columns we want in the index; with Excel, we need to pass a list of integers indicating which columns we want:


filename = 'idea_export_voter_turnout_database_region.xlsx'
df = pd.read_excel(filename,
                  sheet_name='All',
                  index_col=[0, 3, 4])

This works, but there's still one thing missing: The Date column (now part of our multi-index) still has a dtype of object, meaning that it refers to a Python object – in this case, a string. We can tell read_excel to treat one or more columns as datetime values by passing the column name to parse_dates:


filename = 'idea_export_voter_turnout_database_region.xlsx'
df = pd.read_excel(filename,
                  sheet_name='All',
                  parse_dates=['Year'],
                  index_col=[0, 3, 4])

Now, I said that it would be tricky to deal with the date... but that turns out not to be true! The above code works just fine, turning the Year column into a datetime dtype.

Why did I think it would be tricky? Because when I downloaded the file and imported it into Pandas, the - (minus sign) character wasn't separating the date components. Rather, it was an n-dash, a different character that looks extremely similar, but is distinct. In order to make that work in my original solution, I had passed the date_format keyword argument. Or at least that's what I thought: When I double checked my work while writing these solutions, I found that the dates used regular ol' minus signs, and that the dates were parsed just fine without any extra hints.

One of the nice things about Jupyter is that it allows us to repeat our steps. I run the notebook for each BW edition several times, restarting the kernel each time, to make sure that I'm not accidentally using variables or functions that I defined while performing experiments.

The result of this call to pd.read_excel is a data frame with 3,717 rows and 10 columns, as well as a three-part multi-index on the rows.

Turn columns whose values contain percent signs (%) and commas into float dtypes. Why can we not use an int dtype for all of these?

Next, I asked you to turn all of the columns with % or , in the numbers into columns with a float64 dtype. There are a few ways to do this, but my preference is to create a Python list of strings (i.e., the names of the columns) and iterate over them.

Within each column, we use the str.replace method with a regular expression to search for either % or ,, replacing the found character with an empty string. Once we've performed that replacement, we can invoke astype to get back a new series, which we assign back to the column it came from:


for column_name in ['Voter Turnout', 'VAP Turnout', 
                    'Invalid votes', 'Total vote',
                    'Registration', 'Voting age population', 
                    'Population']:
    print(f'Processing {column_name}')
    df[column_name] = (
        df[column_name]
        .str.replace(r'[%,]', '', regex=True)
        .astype(float)
    )

Pandas allows you to access columns either using dot notation (i.e., treating the column name as an attribute on the data frame's variable) or square brackets (i.e., putting the column name in a string in the brackets). I've never been a big fan of using dot notation, but in this case, we don't have a choice. Column names that contain spaces and other punctuation aren't legal attributes in Python, which means that we cannot name them with dot notation.

I can then run df.dtypes and get a full report demonstrating that we successfully changed the columns' dtypes:

ISO2                      object
ISO3                      object
Voter Turnout            float64
Total vote               float64
Registration             float64
VAP Turnout              float64
Voting age population    float64
Population               float64
Invalid votes            float64
Compulsory voting         object
dtype: object

Why did we convert all of the columns to floats, when some of them could have been integers? Because each column contains some NaN values, and NaN cannot be coerced into an integer format. Indeed, running df.isna().sum() tells us how many NaN values are in each column:

ISO2                      13
ISO3                      19
Voter Turnout            183
Total vote                81
Registration             168
VAP Turnout              133
Voting age population     65
Population                53
Invalid votes            851
Compulsory voting         13
dtype: int64

Unless we want to use Pandas extension types or PyArrow, both of which support nullable integers, we'll thus need to turn our NaN-containing columns into floats.