BW #27: Young voters (solution)
Young Americans don't typically vote. This week, we look at the Harvard Youth Poll, including whether they think voting matters, how afraid they have been feeling, and where they get their news.
This week, we looked at young people in the US, and some of the results from the latest Harvard Youth Poll (https://iop.harvard.edu/youth-poll/45th-edition-spring-2023). The poll asked so many interesting questions that it was hard to choose just a few topics — but I hope that it gave you a taste for this data, and that you even started to look at some of the other topics it surveyed.
Data and questions
The Harvard Youth Poll’s data is reported in a single Excel document. This Excel document was messier than most I’ve seen, because it put all of the answers, to dozens of questions, in a single spreadsheet document, on a single sheet. Which meant that reading the document was… challenging and messy. But you wouldn’t be reading Bamboo Weekly if you weren’t up for a challenge, right?
I posed eight questions and tasks. Let’s get to them:
Download the Excel file.
This was, by far, the easiest task of the week! I first started up Pandas:
import pandas as pd
Then I downloaded the Excel file from the Harvard Youth Poll’s site:
/content/files/sites/default/files/2023-05/harvard-20iop-20youth-20poll-20spring-202023-20crosstabs.xlsx
Now I was ready to start working with the data itself.
Grab the data from the "Likely voter 2024" question (starting on line 168 of the Excel file) into a data frame. Set the answers ("Definitely will be voting", etc.) to be the index. Remove the "All" column and the rows for "weighted N" and "unweighted N".
The Excel document provided by HYP has two sheets. You can think of each sheet as a separate document, sort of like a zipfile with numerous CSV files inside of it. In this case, the first sheet was a list of the questions asked in the poll, and the second sheet contained the results. The questions on the first sheet were all hyperlinks to the second sheet, which was a nice touch, I’ll admit.
Let’s start with reading the second sheet into a data frame, using read_excel:
likely_voter_2024_df = pd.read_excel(filename,
sheet_name=1)
The above tells Pandas to read the entire second sheet (because the first has an index of 0) into a data frame. But it turns out that we don’t really want the entire sheet. That’ll give us the data, but not in any form that we can or will use. We need the values in a column to be of the same dtype to avoid having it labeled as “object”, and the headers for each of the sub-tables in the spreadsheet are the same everywhere, we want and need to snip out just a limited number of rows for the answers to each question.
This means that we’ll need to tell Pandas:
- Ignore all of the rows up to where the desired snippet begins
- Ignore the rows after where that snippet ends
We can accomplish this via a combination of the “skiprows” and “nrows” keyword arguments. The first tells Pandas how many rows to skip before starting to read from the file, and the second tells Pandas how many rows to read once it has begun. We can thus say:
likely_voter_2024_df = pd.read_excel(filename,
sheet_name=1,
skiprows=166,
nrows=6)
This works! Moreover, I asked for just 6 rows, which cuts off the bottom two rows from that analysis, namely the weighted and unweighted numbers.
But wait: I also asked you to treat the first column as the index. To do that, I’ll need to use the index_col keyword argument:
likely_voter_2024_df = pd.read_excel(filename,
sheet_name=1,
skiprows=166,
index_col=0,
nrows=6)
This is all good, but what about the column labels? The way that this data is structured, the columns will be a multi-index, with main headers and subheaders. Normally, we can pass an integer value to the “header” keyword argument, to indicate which row should be treated as a header. To get a multi-index, we simply pass a list of integers:
likely_voter_2024_df = pd.read_excel(filename,
sheet_name=1,
header=[0,1],
skiprows=166,
index_col=0,
nrows=6)
We have now loaded our data, have the answers in the index, and a multi-index of columns. This is all great, except for one thing, namely the “All” column at the start of the data frame. And yes, we could just live with it there, but I thought it might be nice to get rid of it.
Normally, we could use the “drop” method to get rid of a column. But doing that with a multi-index is annoying; there’s only a name at the multi-index’s lower level. So how exactly can we drop that column?
Well, it turns out that “drop” lets us specify the level at which we want to match the name. So by saying “All” (the value in the lower level) and then passing level=1 along with axis="columns", we’ll be all set:
likely_voter_2024_df = pd.read_excel(filename,
sheet_name=1,
header=[0,1],
skiprows=166,
index_col=0,
nrows=6).drop('All', level=1, axis='columns')
Sure enough, this creates a data frame with the rows and columns we want, ready for analysis.
If you opened the file in Excel, by the way, you might have noticed that the numbers were all percentages. Why are they not displayed with percent signs in Pandas? Anyway, shouldn’t they be strings, if they have percent signs?
The answer is that Excel distinguishes between its data and how it’s displayed. (We can do that in Pandas to some degree, also, I’ll admit.) So when you see something with a % after it in Excel, it’s still a float, just being shown as a percentage. And when the data is read into Pandas, it’s kept as a float — and then displayed as one. So you don’t need to turn it from a string into a float, or the like.
What race/ethnicity has the highest percentage of people saying they'll definitely be voting? Which has the highest percentage saying they definitely *won't* be voting?
Now that we’ve created a data frame based on the information in the Excel spreadsheet, let’s do a bit of analysis: Among young Americans, which race/ethnicity has the highest percentage of people saying they’ll definitely be voting?
Our data frame’s index has a row labeled “Definitely will be voting”. So we can use loc to retrieve that row:
likely_voter_2024_df.loc['Definitely will be voting']
However, we aren’t interested in all of the columns; we just want those under race/ethnicity.
Here, we can use the two-argument version of loc, where the first argument is the row selector, and the second argument is the column selector.
But wait — if I ask for the column “Race/Ethnicity Category,” that’s the top level of a two-level multi-index. I’ll get each of the columns under it, for the row “Definitely will be voting.”
Of course, that’s precisely what we want. Here’s my query:
likely_voter_2024_df.loc[
"Definitely won’t be voting",
'Race/Ethnicity Category']
Here’s the result I get from this query:
Likely voter 2024
White 0.073537
Black 0.153782
Hispanic 0.146848
Mixed 0.065671
AAPI 0.044558
Name: Definitely won’t be voting, dtype: float64
I asked for a single row and a single column, but the column was the top level of a multi-index. As a result, I get back not a single value, but a series — all of the values in the row I requested, and in all of the second level of the multi-index.
How, though, can I then find out which group has the highest percentage? I could find the max value, and then find the index associated with that max value. Or I can just use the “idxmax” method, which does that for me, returning the index of the maximum value:
likely_voter_2024_df.loc['Definitely will be voting', 'Race/Ethnicity Category'].idxmax()
The result that I get back is “White”.
Which ethnicity gives the highest percentage saying that they definitely won’t be voting? I’ll retrieve a different row (for definitely won’t be voting), and again ask for the idxmax:
likely_voter_2024_df.loc["Definitely won’t be voting", 'Race/Ethnicity Category'].idxmax()
The result here is “Black.”