BW #98: Retail sales (solution)

BW #98: Retail sales (solution)

This week, we looked at retail sales in the United Kingdom, using a data set provided by the Office of National Statistics (ONS, https://www.ons.gov.uk). They track a wide variety of businesses, looking at how much they're selling, how that compares with the previous month, and also how it compares with the same month in the previous year. Looking at this data can give us a sense not only of how well businesses are doing, but also which sectors are doing better (and worse).

I thought it appropriate to look at such sales, given that we're approaching the end of the year, when retail stores are doing everything they can to encourage us to buy more.

Data and six questions

This week's data comes from the "Retail Sales Index time series" data set from the ONS, which captures recent retail sales. You can get the data from:

https://www.ons.gov.uk/businessindustryandtrade/retailindustry/datasets/retailsales/current

The most recently updated data files are available under the "Latest version" headline at the top of the page; I downloaded the Excel file via the link there, which is:

https://www.ons.gov.uk/file?uri=/businessindustryandtrade/retailindustry/datasets/retailsales/current/drsi.xlsx

Sadly, there doesn't seem to be a single data dictionary for this data set. The column names tend to be long and descriptive, though, so it's not as much of a problem as you might think. That said, the structure of the data set is a bit odd, as you'll soon see.

Below are my solutions to the six tasks and questions I posed yesterday. You can download my Jupyter notebook via the link at the bottom of this posting.

Read the data from the Excel spreadsheet into a Pandas data frame. Keep the (long!) titles as the column names. Keep only those rows where there is a year and a three-letter month name, such as "2024 NOV" (the most recent one), turn those into datetime values, and make them the index.

To start things off, we'll load Pandas:

import pandas as pd

With that in place, we can load the Excel file into Pandas using the read_excel method. If the Excel file were to contain multiple tabs, then we would be able to specify which one we want to load. But there's only a single tab here, so we can invoke read_excel with only one keyword argument, index_col, which turns the Title column into the index:

filename = 'drsi.xlsx'

df = (
    pd
    .read_excel(filename, index_col='Title')
)

This works, in that we have a data frame. But we can't really use the data frame, mostly because we have several rows that contain explanatory string values, forcing each of the columns to be a dtype of object, and to contain strings (rather than numbers). To get our data into numeric values, let's keep only those rows that have dates in the format I described ("YYYY MM").

I like to do this with the filter method, which returns the rows (or columns) whose index matches the pattern that we provide. In this case, I passed the regex keyword argument, then giving it a regular expression that describes what I wanted:

  • ^ to anchor the regular expression to the start of the string
  • \d{4} for 4 digits
  • \s+ for one or more whitespace characters
  • \w{3} for three alphanumeric characters
  • $ to anchor the expression to the end of the string

In other words: We want only those rows who index is "YYYY MM". Of course, this means that if a row's index is "1234 xyz", that we'll still match. But...Let's hope that this won't happen.

My query now looks like this:

df = (
    pd
    .read_excel(filename, index_col='Title')
    .filter(regex=r'^\d{4}\s+\w{3}$', axis='rows')  
)

However, there's still at least one problem, namely that the index contains strings, rather than datetime values. I can run pd.to_datetime on a series of strings and get back datetime values, but the format here isn't understood by pd.to_datetime . We can pass the format keyword argument, with a string ('%Y %b' in this case) describing the format we want it to use.

I have to be honest and say that I never remember these format codes; I used http://strfti.me/, which I absolutely love, for this purpose.

The thing is, I want to use method chaining. How can I integrate to_datetime into a method chain, when I want to assign it to the index? I would normally use the assign method for this, but that only works on regular columns, not on an index. Instead, we have to use set_axis, passing it the result from running pd.to_datetime on the contents of df.index.

But ... we encounter a problem there, in that we don't want to use df.index. Rather, we want to use the index from the data frame that we got back after the call to filter, which is different. We could assign the result of df.filter to a new variable, but I really want to use method chaining. How can I do that?

The answer is to use pipe, which lets us run any function we want on a data frame we got through chaining. And in this case, the function will be a lambda (anonymous function), which will receive the data frame output from filter as an argument. That data frame will be assigned to df_, and then we can run set_axis on it:

df = (
    pd
    .read_excel(filename, index_col='Title')
    .filter(regex=r'^\d{4}\s+\w{3}$', axis='rows')  
    .pipe(lambda df_: df_.set_axis(pd.to_datetime(df_.index, format='%Y %b'), axis='rows'))
    .astype(float)
)

We're almost done! All that's left is to turn all of the values in our data frame into floats. We can always use astype on a series to get a new series of a new type; it turns out that data frames have a method of the same name. Assuming that we want all of our columns to have a dtype of float, we can say:

df = (
    pd
    .read_excel(filename, index_col='Title')
    .filter(regex=r'^\d{4}\s+\w{3}$', axis='rows')  
    .pipe(lambda df_: df_.set_axis(pd.to_datetime(df_.index, format='%Y %b'), axis='rows'))
    .astype(float)
)

After all this, I have a data frame with 443 rows, , and 621 columns. And yes, if that sounds like a lot of columns, it is! It's because this single spreadsheet is trying to do a lot of different things – I'd even say too many different things – and thus each set of columns are appropriate for different rows.

The dataset combines a variety of different types of information, each measured in a different way. Among them are (a) percentage change since the previous month, (b) percentage change since the same month one year ago, (c) relative to a base index, seasonally adjusted, (d) relative to a base index, not seasonally adjusted, (e) in absolute terms, typically reported in millions of pounds (£), seasonally adjusted, and (f) in absolute terms, typically reported in millions of pounds (£), not seasonally adjusted. How many columns are in each of these categories?

Next, I asked you to find a number of subsets of our columns. In every case, you'll want to use the df.filter method, which lets you choose a number of rows via the index or columns via their names. df.filter lets you select them in a few ways, but my favorite is to use regular expressions with the regex keyword argument. However, if we're looking for exact text and aren't using metacharacters from regular expressions, we can instead use the like keyword argument, which takes the string literally.

Note: If you hear the term "regular expressions" and immediately feel sick to your stomach, you're not alone – but I promise, they're not as difficult as you think. And it's totally worth learning them! Check out my e-mail course, https://RegexpCrashCourse.com, to learn them over a two-week period.

Let's go through each of these selections and how I'll use df.filter plus regular expressions to get what we want:

Percentage change since the previous month

Here, we can look for columns that mention "previous month" in them. After finding those, then we can run len on the columns attribute, which returns the column names:

len(df.filter(like='previous month', axis='columns').columns)

I found 10 such columns.

Percentage change since the same month one year ago

Once again, we can look for exact text with like, and then run len on the columns:

len(df.filter(like='%change month a year ago', axis='columns').columns)

I found 12 such columns.

Relative to a base index, seasonally adjusted

Next, things get a bit trickier. I want to find seasonally adjusted indexes. From looking through the columns, it seems like those all have the text sa, followed (not always immediately) by the word index or Index. Sadly, we cannot pass flags (e.g., re.I) to df.filter, so I faked it a bit, looking for .*ndex, meaning that the first letter of "index" might or might not be capitalized.

Note that .* means "any character except newline, zero or more times." So there might be lots of characters between sa and index, or there might be none. (None seems unlikely.)

We might have been able to get away with a regular expression of sa.*ndex, except for one thing – the data set also has many rows that are not seasonally adjusted. And they are marked with nsa. This means that we have to tell Pandas to find column names with sa but not with nsa. For this, I used a negative character class, [^n], which means "any character except for n." That seemed to work:

len(df.filter(regex='[^n]sa.*ndex', axis='columns').columns)

I found 75 columns matching this pattern... but to be honest, I didn't check them that carefully, so I might have some false positives and false negatives.

Relative to a base index, not seasonally adjusted

This time, we're looking for nsa, which doesn't require our negative character class. However, we still do need the .*, so we'll once again use a regular expression:

len(df.filter(regex='nsa.*ndex', axis='columns').columns)

I found 84 columns matching this pattern.

In absolute terms, typically reported in millions of pounds (£), seasonally adjusted

I found what I needed here by using a negative character class (this time, using capital letters), followed by zero or more characters (.*), followed by the £ symbol:

len(df.filter(regex='[^N]SA.*£', axis='columns').columns)

I found 8 columns matching this pattern.