BW #90: Voter participation (solution)
US elections are less than one week from now, and one of the biggest questions is just how many people will show up to vote. The margins in the 2020 election were razor thin in several swing states, and it's looking like that might happen once again. As a result, the "ground game," in which political parties talk to potential voters and try to persuade them to vote, is likely to play a large role.
Even in a tight, important presidential race, with the parties trying their best to get people to vote, a huge number of Americans don't bother to do so. That led me to wonder how many Americans do vote, how these numbers have changed over the years, and how this compares with voting numbers in other countries.
Data and six questions
To answer these questions, I took data from the International Institute for Democracy and Electoral Assistance, which calls itself "International IDEA" (https://www.idea.int/). International IDEA tries to help countries to become more democratic – including monitoring and encouraging women in leadership. We have looked at their data before, back in BW 69 (https://www.bambooweekly.com/bw-69-election-participation/ ).
International IDEA's "voter turnout database" is at
https://www.idea.int/data-tools/data/voter-turnout-database
Download the data by clicking on the "export data" button next to the headline on the page. This should download an Excel file onto your computer.
To see the data dictionary for this data set, click on the "glossary" link on the right side of the page, under the headline "background information."
Here are my solutions and explanations for this week; as always, a link to download my Jupyter notebook is at the bottom of this post.
Turn the first ("All") tab of the spreadsheet into a Pandas data frame. The Year
column should be a datetime
value. (Note: The dashes in the dates will make this tricky!) Use the "Country", "Election Type", and "Year" values for a three-level multi-index.
Before doing anything else, let's load Pandas:
import pandas as pd
Now that we've done that, I'll load the file into a data frame. To create a data from from an Excel file, we can use read_excel
, indicating that we just want the "All" tab:
filename = 'idea_export_voter_turnout_database_region.xlsx'
df = pd.read_excel(filename,
sheet_name='All')
This is good, except that I also asked you to set it up with a multi-index containing "Country", "Election Type", and "Year". If this were a CSV file, we could name the columns we want in the index; with Excel, we need to pass a list of integers indicating which columns we want:
filename = 'idea_export_voter_turnout_database_region.xlsx'
df = pd.read_excel(filename,
sheet_name='All',
index_col=[0, 3, 4])
This works, but there's still one thing missing: The Date
column (now part of our multi-index) still has a dtype of object
, meaning that it refers to a Python object – in this case, a string. We can tell read_excel
to treat one or more columns as datetime
values by passing the column name to parse_dates
:
filename = 'idea_export_voter_turnout_database_region.xlsx'
df = pd.read_excel(filename,
sheet_name='All',
parse_dates=['Year'],
index_col=[0, 3, 4])
Now, I said that it would be tricky to deal with the date... but that turns out not to be true! The above code works just fine, turning the Year
column into a datetime
dtype.
Why did I think it would be tricky? Because when I downloaded the file and imported it into Pandas, the -
(minus sign) character wasn't separating the date components. Rather, it was an n-dash, a different character that looks extremely similar, but is distinct. In order to make that work in my original solution, I had passed the date_format
keyword argument. Or at least that's what I thought: When I double checked my work while writing these solutions, I found that the dates used regular ol' minus signs, and that the dates were parsed just fine without any extra hints.
One of the nice things about Jupyter is that it allows us to repeat our steps. I run the notebook for each BW edition several times, restarting the kernel each time, to make sure that I'm not accidentally using variables or functions that I defined while performing experiments.
The result of this call to pd.read_excel
is a data frame with 3,717 rows and 10 columns, as well as a three-part multi-index on the rows.
Turn columns whose values contain percent signs (%) and commas into float dtypes. Why can we not use an int dtype for all of these?
Next, I asked you to turn all of the columns with %
or ,
in the numbers into columns with a float64
dtype. There are a few ways to do this, but my preference is to create a Python list of strings (i.e., the names of the columns) and iterate over them.
Within each column, we use the str.replace
method with a regular expression to search for either %
or ,
, replacing the found character with an empty string. Once we've performed that replacement, we can invoke astype
to get back a new series, which we assign back to the column it came from:
for column_name in ['Voter Turnout', 'VAP Turnout',
'Invalid votes', 'Total vote',
'Registration', 'Voting age population',
'Population']:
print(f'Processing {column_name}')
df[column_name] = (
df[column_name]
.str.replace(r'[%,]', '', regex=True)
.astype(float)
)
Pandas allows you to access columns either using dot notation (i.e., treating the column name as an attribute on the data frame's variable) or square brackets (i.e., putting the column name in a string in the brackets). I've never been a big fan of using dot notation, but in this case, we don't have a choice. Column names that contain spaces and other punctuation aren't legal attributes in Python, which means that we cannot name them with dot notation.
I can then run df.dtypes
and get a full report demonstrating that we successfully changed the columns' dtypes:
ISO2 object
ISO3 object
Voter Turnout float64
Total vote float64
Registration float64
VAP Turnout float64
Voting age population float64
Population float64
Invalid votes float64
Compulsory voting object
dtype: object
Why did we convert all of the columns to floats, when some of them could have been integers? Because each column contains some NaN
values, and NaN
cannot be coerced into an integer format. Indeed, running df.isna().sum()
tells us how many NaN
values are in each column:
ISO2 13
ISO3 19
Voter Turnout 183
Total vote 81
Registration 168
VAP Turnout 133
Voting age population 65
Population 53
Invalid votes 851
Compulsory voting 13
dtype: int64
Unless we want to use Pandas extension types or PyArrow, both of which support nullable integers, we'll thus need to turn our NaN
-containing columns into floats.