BW #102: WordPress (solution)
Get better at: Grouping, pivot tables, plotting, and joining
[Administrative note: This coming Sunday, I'll be holding office hours about Pandas for all paid subscribers. Ask me any questions you have about Pandas! The full invitation, including Zoom links, will be sent out on Friday morning.]
The open-source world has long been impressed by the wild success of WordPress for online publishing and e-commerce. (Heck, I use it myself, as part of my LernerPython.com course platform, and I'm currently redoing the platform to depend even more on WordPress.)
Lately, the WordPress community has been embroiled in controversy and infighting, with WordPress founder Matt Mullenweg accusing hosting company WPEngine of not supporting the open-source project sufficiently. Things have only gotten worse since the initial accusations happened several months ago, with a variety of lawsuits and restrictions on who can participate in the open-source WordPress community, among other things.
The Register summarized the latest on January 14th (https://www.theregister.com/2025/01/14/wordpress_leader_matthew_mullenweg_exiles/), and a new site collects articles about this drama (https://mullenweg.wtf/) . It remains to be seen just what will happen, but given the huge number of businesses that depend on WordPress — as users, programmers, designers, hosts, and consultants — I expect that it'll work out.
If nothing else, this controversy demonstrates how important it is to set up good open-source governance for any significant project.
Data and six questions
This week's data is a bit different than our usual fare; we're examining CSV files that I created from the Git logs in the WordPress repository.
I created the first logfile with the following command:
❯ git log --pretty=format:'%H~~%ad~~%an~~%ae~~%s' --date=iso > gitlog_basic.csv
The above asks Git to produce a log using a format that I defined:
- The full 40-character commit ID, aka a SHA-1 hash
- The date of the commit
- The name of the committer
- The e-mail address of the committer
- A short version of the commit subject
Note that all of these are available using special %
format codes, all documented in the manual for git log
. I asked for the date to be formatted in ISO format, which I know that Pandas can handle directly. I then wrote this into a CSV file, gitlog_basic.csv
.
But wait a second... what's with the ~~ characters? I used those to separate the fields on each line. I wanted to use commas, but there are commas in the Git subject lines, and getting the quotes to work was a bit more complex than I would have wanted. I tried my old standby, the tab character, but it turns out (!) that there are tabs in the subject lines, too! I decided to go with something that didn't actually show up in the Git log, namely ~~
.
I also wanted the number of lines added and removed in each of the commits. You can get that from git log
by asking for --numstat
. With just that option, you then get the full log; we again need to use --format
and specify %H
to get only the hash tag and the added/removed lines.
But these come out on separate lines:
0f2334da8111913a2a78c5000f27f791bb405f0f
4 2 src/wp-includes/kses.php
72 0 tests/phpunit/tests/kses.php
eb50dd7cbf8bcbeda7521e1c152103d9d0c82009
4 1 src/wp-includes/class-wp-customize-widgets.php
61b7b9713ef6a71126c3739a929ce604762de1bf
8 11 src/wp-admin/includes/class-wp-theme-install-list-table.php
I wrote a Python program that went through each of these lines. Sadly, I found out after publishing yesterday's newsletter that my program had an off-by-one error, such that it was counting the lines that occurred before each commit ID, rather than after. Here is (I hope) a debugged version of the program:
#!/usr/bin/env python3
import sys
import re
output = {}
for one_line in sys.stdin:
one_line = one_line.strip()
# ignore blank lines
if not one_line:
continue
# start of new commit
elif re.search('^[a-f0-9]{40}$', one_line):
current_commit_id = one_line
output[current_commit_id] = {'added':0, 'removed':0}
else: # Stats line
added, removed, *rest = one_line.split()
if added != '-':
output[current_commit_id]['added'] += int(added)
if removed != '-':
output[current_commit_id]['removed'] += int(removed)
for key, value in output.items():
print(f'{key}~~{value["added"]}~~{value["removed"]}')
I ran the following command in my Unix shell:
git log --format="%H" --numstat | python summarize-added-removed-lines.py > new-numstats.csv
Here's the file that it created:
Here's the main CSV file with the commit info for WordPress:
Now that we have these files, we can go through the questions and tasks that I gave to you. As always, you can download my Jupyter notebook from the bottom of this post.
Import the wordpress-gitlog.csv
file into a data frame. Make sure the date
column has a datetime
type. From the email
column, create two new columns, email_user
and email_domain
, from the parts before and after the @
sign in the e-mail address. From the subject
column, create a category
column containing the category from before the first :
character.
Before doing anything else, we'll load up Pandas:
import pandas as pd
I asked you to load the CSV file into a Pandas data frame. To do that, we can use read_csv
:
filename = 'data/wordpress-gitlog.csv'
df = (pd
.read_csv(filename)
)
I'm usually a fan of trying to read the data using the defaults. If it works, then great! And if not, then I can do a bit more thinking and checking.
In this case, I had actually created the CSV files, and as I indicated above, I even purposely used a double tilde as the separator. So I had to pass the sep
keyword argument, passing it a value of '~~'
.
The thing is, a two-character separator slows Pandas down a bit, because it requires the use of the Python parsing engine, rather than the more common, faster C-based parsing engine. And Pandas will happily remind you of this with a warning, if you don't explicitly say engine='python'
. So I added that, too.
The CSV file doesn't have any column names, so I passed the names
keyword argument, which lets me set the names of the columns. This also tells read_csv
not to treat the first row as headers.
Finally, I wanted to treat the date
column as a date, so I passed parse_dates
a one-element list containing the string 'date'
:
filename = 'data/wordpress-gitlog.csv'
df = (pd
.read_csv(filename, sep='~~', engine='python',
names=['commit', 'date', 'name', 'email', 'subject'],
parse_dates=['date'])
)
I also asked you to create new columns for the category, the e-mail address (before the @
sign), and the domain (after the @
sign). We will use the categorical information, but the questions that I had planned for working with e-mail addresses weren't as interesting as I had expected.
To create new columns, we can use the assign
method, which takes keyword arguments. The key will be used as the name of a column, and the value can be anything – but here, we'll use a lambda
expression, performing various actions on our columns.
Each of the three new columns we create is based on an existing string column. Moreover, each of them is based on the fact that we can break the existing string into pieces, and get one of those pieces. How can we do that in Pandas, though?
First, we can use str.split
, which works similarly to the builtin Python method of the same name. It returns a list of strings. But that's where things get interesting; given a series containing lists of strings, how can we get the item at index 0 or 1?
If this were a series of strings, then we could use str.get
to grab an item via its index, much as we might do with []
in regular Python. But what do we use if we have a list?
The answer: We can use str.get
on the lists, too! I'm not sure if this is on purpose, or is an oversight, or is the happy result of a dynamic language with multiple types that implement []
. I used str.split
on the string to get a list, and then str.get
on the list to get a particular element:
filename = 'data/wordpress-gitlog.csv'
df = (pd
.read_csv(filename, sep='~~', engine='python',
names=['commit', 'date', 'name', 'email', 'subject'],
parse_dates=['date'])
.assign(category = lambda df_: df_['subject'].str.split(':').str.get(0),
email_user = lambda df_: df_['email'].str.split('@').str.get(0),
email_domain = lambda df_: df_['email'].str.split('@').str.get(-1))
)
I ended up with a data frame containing 50,596 rows and 8 columns. All of the columns have a dtype of object
except for date
, which is of type datetime
.
Read the wordpress-numstats.csv
file into a data frame. The three columns are the commit ID (SHA-1), the number of lines added in that commit, and the number of lines removed. Join the latter two columns into the main data frame.
Next, I read the numstats
file into a data frame. Once again, I used ~~
as my separator, meaning that I had to pass sep
and also engine
. And because the file doesn't have any headers, I also had to pass the names
keyword argument:
(
pd
.read_csv('data/wordpress-numstats.csv', sep='~~', engine='python',
names=['commit', 'added', 'removed'])
)
I then asked you to combine this data frame with the existing one. There are two ways to combine data frames in Pandas, with join
and merge
. Simply put, you use join
when the two data frames share an index. You use merge
when the data frames have common non-index columns. (You can easily use set_index
to move a column to the index, if you prefer to use join
.)
Here, I decided to use merge
, doing so on the commit
column in each of my two data frames. I purposely chose the name commit
to be the same in both data frames; if they weren't the same, I could still specify them with left_on
and right_on
. I then took the resulting data frame and assigned it back to df
:
df = (
pd
.read_csv('data/wordpress-numstats.csv', sep='~~', engine='python',
names=['commit', 'added', 'removed'])
.merge(df, on='commit')
)