BW #69: Election participation (solution)
[Another late issue! Lots of training and meetings about training today. Also, this issue took a lot of time to write and analyze. But if you’ve always been curious about DuckDB, how its queries stack up to Pandas, and where it’s faster and slower — I think that the wait was worthwhile, with the longest analysis and description section I’ve ever written, at 4,500 words. Thanks, as always, for your support!]
The last week has been full of election news, from Donald Trump’s new status as a convicted felon, to the upcoming elections in the United Kingdom, to the election of a new president of Mexico, to weaker support in India for Narendra Modi in what will likely be his third term as prime minister.
Rather than look at one particular country’s election data, I thought it might be appropriate to analyze data across a large number of countries. Moreover, I decided that this would be a good time to look into DuckDB, an in-memory analytical database that has been growing steadily in popularity over the last few years. It was mentioned on a number of occasions at PyCon US, and I thought that this might give us a good opportunity to look at DuckDB, comparing its syntax, capabilities, and speed with what we have in Pandas.
Traditional relational databases divide tasks between a server (where the data is stored, and where queries are executed) and a client (where the queries are formulated). In the simplest case, you have two processes, one server and one client. However, you can have multiple clients submitting queries to a single server. And it’s now common to have multiple servers working together. You can thus have a large number of processes, often located on different computers, communicating with one another.
By contrast, DuckDB doesn’t even consume one process. It’s an in-process database, meaning that it is loaded as a library into whatever program wants to use it, without any external connections. Granted, that process will then grow in size to accommodate DuckDB’s functionality, as well as the data that you’re analyzing. But you won’t have the overhead of having to transfer data across processes.
Moreover, DuckDB works with a variety of in-memory data structures, including Pandas data frames. You can thus load a CSV or Excel file into DuckDB — but you can also perform SQL “select” queries directly from a data frame.
A quick tutorial on DuckDB is here: https://duckdb.org/2021/05/14/sql-on-pandas.html. If you have any experience with SQLite or PostgreSQL, you’ll likely feel at home with DuckDB’s functionality and syntax. And if you haven’t ever used SQL before, then DuckDB isn’t a bad way to start.
Data and five questions
This week's data comes from International IDEA's voter turnout database, whose home page is here:
https://www.idea.int/data-tools/data/voter-turnout-database
Go to that page, and click on the "Export data" button. This will download an Excel file. We're interested in the first ("All") sheet in the document.
I gave you five questions and tasks this week, many of which asked you to perform the same task twice — once with Pandas, and a second time with DuckDB. The point was both to compare the syntax and to get a sense of whether DuckDB would offer any performance help. To some degree, I knew going into these tests that they would be unfair, because the data set is relatively small; I will soon be doing some tests on a much larger data set, to see how much that is affected.
Here are my questions for this week. As always, a link to my Jupyter notebook is at the end of the newsletter.
Read the Excel file into Pandas as a data frame. However, turn the "Year" column into a datetime column, the numeric columns into float (removing "%" and "," characters), and the "Election type" and "Compulsory voting" columns into categories. How much memory do we save in making such changes?
Let’s start off by loading Pandas:
import pandas as pd
Next, because it’s an Excel file, I’ll use “read_excel” to load it into Pandas:
filename = 'idea_export_voter_turnout_database_region.xlsx'
df = pd.read_excel(filename, sheet_name='All')
Note that I pass the “sheet_name” keyword argument, allowing me to indicate which sheet (or multiple sheets, if I want) I would like to get back as a data frame. It might seem like I’m somehow asking for all of the sheets in the Excel file to be returned as a data frame, but actually, the first sheet has a name of “All”. We thus get back a data frame containing 3,649 rows and 13 columns.
How much memory does this take up? We can find out by invoking “memory_usage” on our data frame, which returns the number of bytes used by each column. We can then use “sum” to find the total amount of memory used by the data frame.
Note that we need to pass the “deep=True” keyword argument in order to ensure that Python-object columns are calculated correctly:
df.memory_usage(deep=True).sum()
The result? 2,622,423 bytes, or about 2.5 MB.
I’ve found that data read from a CSV file usually needs some help and hints to determine the dtype of at least a few columns. Excel files, by contrast, support a rich set of data types, most of which map quite naturally to the dtypes in Pandas. Let’s check to see what dtypes we got:
Country object
ISO2 object
ISO3 object
Election Type object
Year object
Voter Turnout object
Total vote object
Registration object
VAP Turnout object
Voting age population object
Population object
Invalid votes object
Compulsory voting object
dtype: object
Ah, I see — they are all of “object” dtypes, meaning that they were all imported as strings. That’s unusual, and I’m guessing that this is because the data was originally in a textual format (e.g., CSV), and was imported into Excel without doing any cleaning or setting of types.
It’s tempting to say that we can resolve this situation by just passing a few arguments to “read_excel”, telling it to parse the “Year” column as a date, and many of the rest as floats. But that’s sadly impossible, because so many of the number-wannabe columns contain dollar signs and commas, which aren’t legal there.
So we’ll have to go through a bunch of columns, removing non-digit characters, and then turning them into float columns. (Why float? Because they might well contain NaN values. Although as we saw last week, we could use nullable types…) That’s annoying, but it works.
I’ll thus go back to loading the Excel file, but with an important addition to our call to “read_excel”:
df = pd.read_excel(filename,
sheet_name='All',
parse_dates=['Year'])
Next, I’ll iterate over a bunch of columns, invoking “str.replace” on each to remove commas and percent signs, followed by “astype(‘int’)”, then assigned back to the original variable.
for one_colname in convert_to_float:
df[one_colname] = (df
[one_colname]
.str.replace(r'[,%]', '', regex=True)
.astype('float')
)
Finally, I took two columns that contained (repeated) text, and made each a bit more efficient by turning them into Pandas categories.
convert_to_category = ['Election Type', 'Compulsory voting']
for one_colname in convert_to_category:
df[one_colname] = df[one_colname].astype('category')
After all this, which made the data frame more robust, have we also managed to make it smaller? I again run
df.memory_usage(deep=True).sum()
The new size is 829,452 bytes, or about 31 percent of the original size. Which means that getting our data into good dtypes not only ensures we can perform the calculations we want and need, but that we’ve saved a lot of memory, too.
Here are the final dtypes:
Country object
ISO2 object
ISO3 object
Election Type category
Year datetime64[ns]
Voter Turnout float64
Total vote float64
Registration float64
VAP Turnout float64
Voting age population float64
Population float64
Invalid votes float64
Compulsory voting category
dtype: object
Using Pandas, which five countries with non-compulsory voting has the highest mean VAP turnout? Now use DuckDB to calculate the same thing. How long did each query take?
I’ve often heard people complain that too few people vote in elections. I was thus curious to know in which countries the greatest proportion of voters do actually vote. However, I had to remove countries in which there is mandatory voting, since that would clearly skew the numbers.
To calculate this in Pandas, we’ll first remove rows referring to countries in which there is compulsory voting. I did this using a combination of “loc” and “lambda”, looking for rows in which the “Compulsory voting” column had a “No” value:
(
df
.loc[lambda df_: df_['Compulsory voting'] == 'No']
)
Next, I performed a “groupby” operation, asking for the mean value of “VAP Turnout” for each unique value in the “Country” column:
(
df
.loc[lambda df_: df_['Compulsory voting'] == 'No']
.groupby('Country')['VAP Turnout'].mean()
)
This produced a series of 185 elements, with country names as the index and the mean voting percentage for each country as the values. However, I was only interested in the five countries with the greatest percentage. One option would be to use “nlargest”, but I decided here to use a combination of “sort_values” and “head”. Note that I had to indicate that I wanted the sort to be in descending order:
(
df
.loc[lambda df_: df_['Compulsory voting'] == 'No']
.groupby('Country')['VAP Turnout'].mean()
.sort_values(ascending=False)
.head(5)
)
The result was:
Country
Croatia 315.038800
North Macedonia, Republic of 293.369565
Somalia 129.466667
Cook Islands 100.640000
Viet Nam 96.898000
Name: VAP Turnout, dtype: float64
Um, wait a second. How can there be 315 percent voter turnout? We might be aiming for high turnout, but that seems a bit … high, no?
The FAQ at https://www.idea.int/data-tools/data/voter-turnout-database indicates that VAP (“voting age population”), along with turnout and other numbers, are often estimates, and that they aren’t always updated in sync. That said, we can probably assume that voter turnout in Croatia and Maceconia is quite high, even if not quite 315 percent.
What about the next three? Well, Somalia isn’t exactly a bastion of open democracy. The Cook Islands have a population of 15,000 people, so it seems likely to me that they have high turnout rates. And Vietnam … it’s a one-party state, so I’m not sure who people are voting for, but it’s good to know (I guess) that they’re voting in high percentages.
Bottom line, having a very high percentage of the public coming to vote doesn’t necessarily mean that you’re a model democracy.
How long did it take Pandas to perform this query? I use the “%%timeit” magic command in Jupyter, and found that it took, on average, 1.04 ms to execute my query.
What about DuckDB? How do we even use DuckDB, for that matter?
DuckDB is, as I wrote above, an in-memory relational database that uses SQL. It uses standard SQL queries to create tables, update them, and retrieve from them. However, the tables all reside in memory. Moreover, it’s a columnar database, meaning that (like Pandas) it is structured primarily along columns, rather than along rows, as traditional databases did. It’s not meant for high-speed transactions; rather, DuckDB is designed for high-performance, in-process analysis of data.
Because DuckDB is in the same process, it has access to all of the objects in memory. A nice side effect, particularly useful to us, is that it can thus query a Pandas data frame as if it were a database table. Pandas queries are often similar to SQL, and I’m sure that some database veterans have long wished that they could use SQL to query their data frames. Well, your wish has come true!
You’ll first have to install DuckDB on your computer. On my Mac, I use Homebrew for open-source installations. I thus added it with
brew install duckdb
However, I also needed to install the Python bindings for it. That was equally easy:
pip install duckdb
Inside of my Jupyter notebook, I then wrote:
import duckdb
Once that was done, I was able to query my data frame:
duckdb.query('''SELECT Country, mean("VAP Turnout") as mvt
FROM df
WHERE "Compulsory voting" = 'No'
GROUP BY Country
ORDER BY mvt DESC
LIMIT 5''')
The above SQL query basically does what my Pandas query did:
- I start with SELECT, indicating what columns and values I want to get back. I asked for the country name and the mean of the “VAP Turnout” column, aliased to the name “mvt”. The alias will make it easier to order by that column.
- I indicate that I want to retrieve from “df”. That’s right; I just name the variable in which my data frame is located, and DuckDB takes care of the rest. Wow!
- I add a condition, indicating that I only want rows without compulsory voting. Notice that SQL, unlike Python, distinguishes between single quotes and double quotes: Single quotes are for strings, whereas double quotes allow us to reference tables and columns whose names contain spaces.
- We then group by the country, indicating that we want a result for each unique value in the Country column
- We then order the results by the “mvt” alias we previously created, in descending order
- We only want the top 5 results.
Those results are:
┌──────────────────────────────┬────────────────────┐
│ Country │ mvt │
│ varchar │ double │
├──────────────────────────────┼────────────────────┤
│ Croatia │ 315.03880000000004 │
│ North Macedonia, Republic of │ 293.3695652173913 │
│ Somalia │ 129.46666666666667 │
│ Cook Islands │ 100.64 │
│ Viet Nam │ 96.898 │
└──────────────────────────────┴────────────────────┘
We basically got the same results, although we are seeing some slight differences in the float precision. Notice that we get the results back as a DuckDB table, as in all databases. Want to get back a Pandas data frame? No problem; just call “to_df”:
Country mvt
0 Croatia 315.038800
1 North Macedonia, Republic of 293.369565
2 Somalia 129.466667
3 Cook Islands 100.640000
4 Viet Nam 96.898000
And how long does DuckDB take to do this calculation? 2.22 ms on average, about twice as long as Pandas.