BW #94: Strategic Wine Reserve (solution)

Earlier this week, I read an article in FT Alphaville (https://www.ft.com/content/6b8462a4-b7b9-4bc9-af60-4c24e4491791) about the UK's Government Hospitality Wine Cellar (GWC), which the article playfully calls the country's "strategic wine reserve."

The GWC is so good at identifying up-and-coming wine that it is self funding: The GWC's advisory board sells their older, expensive wine in order to pay for new wine that they anticipate will rise in value over time. Some of the wine they buy is served at government events, but some is there as an investment. The GHC, which has existed since the end of World War I, ensures that people attending government functions will get tipsy on only the finest of vintages.

This week, we'll examine data that the reporters who worked on this story, Robin Wigglesworth (https://www.ft.com/robin-wigglesworth) and Louis Ashworth (https://www.ft.com/louis-ashworth), generously agreed to share with me (minus the proprietary bits). I'm grateful to them not only for writing this fun article, but also for their generosity in sharing this data.

Data and six questions

I posed six tasks and questions for you to answer this week. The learning goals include grouping, sorting data, and plotting with Pandas.

You won't have to work hard to find this week's data; I've enclosed it right here:

Strategic Wine Reserve

Strategic Wine Reserve.xlsx

91 KB

Here are my questions; my Jupyter notebook is enclosed at the bottom of this message.

Read the Excel data into a Pandas data frame. How many bottles of wine, according to this data set, does the UK have on hand? How many liters of wine does that work out to?

We start off by loading Pandas:

import pandas as pd

With that in place, we can then load the Excel spreadsheet into Pandas, using read_excel:

filename = 'Strategic Wine Reserve.xlsx'

df = pd.read_excel(filename)

We won't use all of the columns, but I didn't see a reason to remove any of them, since the resulting data frame is relatively small, with 517 rows and 27 columns.

I then asked you to find how many bottles of wine are in the GWC cellar. It might at first seem like we should invoke count on the TotalQuantity column:

df['TotalQuantity'].count()

But of course, that just gives us the number of rows in the data frame, and by extension the number of types of wine. If we want to know how many bottles there are, we'll invoke sum, which totals the integers in that column:

df['TotalQuantity'].sum()

The result is 32,259, which is ... quite a lot of wine.

But wait: How many total liters of wine is that? The bottles that I buy are typically 750 ml in volume, but we can check and see what volumes are in the GWC cellar by looking at the Size column:

df['Size'].value_counts()

I invoked value_counts, telling me how many times each volume was in the data frame, and got the following:

Size
750ml    493
1.5L      17
375ml      5
500ml      1
1.0L       1
Name: count, dtype: int64

Most of the bottles are the same 750 ml size that I buy, but we see that some are larger and some are smaller. In theory, we can just multiply the TotalQuantity column by the Size column, sum up the result, and we'll get the total volume of wine in the UK's basement.

But of course, it's not at simple as that: The volumes (as you can see) are actually strings, not integers or floats. If we're going to perform a mathematical operation on them, we'll first need to turn them into numbers.

I decided to use the replace method to accomplish this, using a Python dict to indicate what current values (keys) should be translated into (values). Pandas complained when I used floats as the replacement values in replace, so I did it in two steps, first replacing the strings with other strings, and then using astype to convert them into floats:

(
    df['Size']
    .replace({'750ml':'.750',
              '1.5L':'1.500',
              '375ml':'.375',
              '500ml':'.500',
              '1.0L':'1'})
    .astype(float)
)

Now that we have a new column with a dtype of float, we can multiply it by the TotalQuantity column. Notice that I measured the bottle sizes in liters, such that 1.5 L became 1.500 liters.

We can normally multiply with the * operator, but we can't do that when using method chaining. We'll instead use the mul method, which performs the same operation, but as a method. We can then invoke sum and get the total:

(
    df['Size']
    .replace({'750ml':'.750',
              '1.5L':'1.500',
              '375ml':'.375',
              '500ml':'.500',
              '1.0L':'1'})
    .astype(float)
    .mul(df['TotalQuantity'])
    .sum()
)

The total is 24,159.125 liters of wine.

What is the mix of wine colors? Of countries of origin? Looking at the country with the greatest number of wines in the collection, what is the percentage breakdown of regions within that country?

What kinds of wine does the GWC contain? I decided to look at the colors, although you could probably learn as much (or more) from looking at the varieties, too. To find out the mix of colors, we can run value_counts on the Color column (which I definitely expected to be called Colour, but I digress):

df['Color'].value_counts()

I got the following:

Color
Red      327
White    159
Other     29
Rosé       2
Name: count, dtype: int64

We see that many of these wines are reds. The FT Alphaville article mentioned that for many occasions, the government will simply buy a lot of not-too-fancy white wines and then sell them. I'll also guess that red wine appreciates in value more over time. Regardless, we see that a lot of them are red.

What if we want to make a pie chart from these numbers? (I had intended to ask you to do this, but forgot. Whoops!) We can invoke plot.pie to get such a plot. Note that we don't want to invoke plot.pie directly on the series, but rather on the output from value_counts:

(
    df['Color']
    .value_counts()
    .plot.pie()
)

There isn't anything wrong, per se, with the result:

Maybe it's just me, but I find it weird to have the "red" portion of the pie colored in blue, the "white"portion of the pie in orange, and the "rosé" of the pie in red. We can specify the colors used by passing a list of strings to the colors keyword argument:

(
    df['Color']
    .value_counts()
    .plot.pie(colors=['red', 'white', 'gray', 'pink'])
)

We get the following pie chart as a result:

The good news is that it worked. The bad news is that the "white" section isn't really visible, because the background is white, too. We can fix this by passing the wedgeprops ("wedge properties") keyword argument to .plot.pie, giving it a dictionary that it should pass along to the underlying Matplotlib implementation. Here, I pass edgecolor and linewidth, ensuring that we get a thin, black line around each wedge of the pie:

(
    df['Color']
    .value_counts()
    .plot.pie(colors=['red', 'white', 'gray', 'pink'],
              wedgeprops={'edgecolor': 'black', 'linewidth':0.5})
)

The result:

What about the distribution of countries? We can again use value_counts:

df['Country'].value_counts()

We can see that a great deal of the wine is from France:

Country
France            360
United Kingdom     46
Portugal           37
Germany            15
Italy              13
Australia          11
South Africa        8
New Zealand         8
USA                 7
Spain               4
Austria             2
Argentina           1
Japan               1
Canada              1
Russia              1
Chile               1
Brazil              1
Name: count, dtype: int64

France is very clearly the source of most wines in the collection. But French wine is famously tied to a specific region, not to the overall country. What regions to those wines come from?

To find out, we can keep only the rows where the Country is 'France', and then use value_counts on the Region column:

(
    df
    .loc[lambda df_: df_['Country'] == 'France']
    ['Region']
    .value_counts(normalize=True)
)

I used loc and lambda here; when the function that we create with lambda is invoked, it returns boolean series that loc uses to keep only a subset of the rows – those describing French wines. We can then ask for just the Region column, and then invoke value_counts on it. Here, I passed normalize=True to value_counts, which returns percentages rather than absolute numbers:

Region
Bordeaux            0.538889
Burgundy            0.277778
Champagne           0.050000
Rhône               0.047222
Cognac              0.038889
Loire Valley        0.016667
Alsace              0.016667
France              0.008333
Southwest France    0.005556
Name: proportion, dtype: float64

We can thus see that of the French wines stored in the UK's cellar, more than half are from Bordeaux. Only 5 percent are from Champagne, so they might not be celebrating at these events as much as you thought.

If you aren't comfortable hard-coding 'France' as the country, you can calculate the most-common country by invoking mode, which returns the most frequent value in the series:

(
    df
    .loc[lambda df_: df_['Country'] == df['Country'].mode().iloc[0] ]
    ['Region']
    .value_counts(normalize=True)
)

Notice that we can't just compare the result of mode with the Country column. That's because it returns a series – normally containing one element, but if multiple elements appear the same number of times, then they'll all appear. I use iloc here to retrieve just the first one.

BW #94: Strategic Wine Reserve (solution)

Data and six questions

Read the Excel data into a Pandas data frame. How many bottles of wine, according to this data set, does the UK have on hand? How many liters of wine does that work out to?

What is the mix of wine colors? Of countries of origin? Looking at the country with the greatest number of wines in the collection, what is the percentage breakdown of regions within that country?

Read next

BW #94: Strategic Wine Reserve

Black Friday sale 2024