BW #83: Gasoline prices (solution)

BW #83: Gasoline prices (solution)

This week, we looked at the price of gasoline in the United States – both because it has declined over the last year, and because it is often considered a major factor in presidential elections. The New York Times reported on the lower gas prices just yesterday (https://www.nytimes.com/2024/09/11/business/gas-prices.html?unlocked_article_code=1.J04.b2Z9.5bY7k7zsD7_f&smid=url-share), and we can expect that this topic will continue to command attention for the next two months.

This week, we'll take a look at some of the historical data regarding gasoline prices in the United States. We'll not only analyze the data using Pandas, but we'll produce some nice-looking output using Great Tables (https://pypi.org/project/great-tables/), whose project leads were recently interviewed on the Real Python podcast (https://realpython.com/podcasts/rpp/214/).

Great Tables assumes that you have already analyzed the data, and now want to present it on a Web site or in a publication. It makes it easy to spruce up your formatted output in a wide variety of ways, without changing the data itself. You can think of it as the tabular equivalent of Matplotlib or (even better) Seaborn. After using it to answer this week's questions, you'll have a good sense of how it works and how you can start to use it.

Data and six questions

Our data comes from the US Energy Information Administration (https://www.eia.gov/), which tracks (among other things) the price of gasoline of different types, and in different areas of the country. You can download the data from their main report page, which is updated weekly:

https://www.eia.gov/petroleum/gasdiesel/

The data itself is in an Excel spreadsheet you can download by clicking on the "full history XLS" link next to the "regular gasoline prices" table. Or you can download it directly from here:

https://www.eia.gov/petroleum/gasdiesel/xls/pswrgvwall.xls

This week, I gave you six tasks and questions. As usual, a link to the Jupyter notebook I used to solve these problems is at the bottom of the post.

Create a data frame from the "Data 1" tab in the Excel file. The index should be the "Date" column. Shorten the other column names by removing the text that's common to them.

Let's start by loading Pandas:

import pandas as pd

With that in place, we can load the data from the Excel file using read_excel, which returns a data frame:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

In the simplest form, read_excel returns a single data frame from an Excel file. But in this case, the Excel file contains a number of tabs, each of which could be considered its own data structure. We only want one of these, the tab labeled "Data 1", with prices of regular gasoline. We can pass the sheet_name keyword argument with either a string value ("Data 1") or the integer index of the tab. I chose to use the index.

I also wanted the "Date" column to be turned into the data frame's index. In order to do that, we need to pass the index_col keyword argument, either naming or providing the numeric index of the column we want to use.

We can try this:

filename = 'pswrgvwall.xls'
df = pd.read_excel(filename, 
                   sheet_name=1,
                  index_col='Date')

But if we do, the data will be all messed up. Or it won't work at all. That's because there are some explanatory notes and references in the first two rows of the file. We can tell Pandas to ignore those first two lines, and start parsing on the third line (i.e., index 2) by passing the header keyword argument, along with a value of 2:

df = pd.read_excel(filename, 
                   sheet_name=1,
                  header=2,
                  index_col='Date')

The good news is that we now have a data frame whose columns all have a dtype of float64, containing the data from the Excel spreadsheet.

But wait a second: The index (Date) that we read into the file contains datetime values. Don't we normally need to use the parse_dates keyword argument in order to turn strings into datetime values? Yes, we do. But this column in Excel actually contained datetime values. When read_excel read the values into a data frame, it just used the data type that Excel gave to it, without the need to explicitly convert the values.

We end up with a data frame with 1,778 rows and 20 columns. Each row represents one week of gasoline pricing data, and each column represents a region of the United States (or an average of the entire country) from which the data was drawn.

This worked fine, but if I ask to see df.columns, I get the following:

Index(['Weekly U.S. Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly East Coast Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly New England (PADD 1A) Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Central Atlantic (PADD 1B) Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Lower Atlantic (PADD 1C) Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Midwest Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Gulf Coast Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Rocky Mountain Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly West Coast Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Colorado Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Florida Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly New York Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Minnesota Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Ohio Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Texas Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Washington Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Cleveland, OH Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Denver, CO Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Miami, FL Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)',
       'Weekly Seattle, WA Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)'],
      dtype='object')

While it's good to have column names that clearly indicate the data they'll contain, these seem a bit long, especially since their text repeats so much. I asked you to cut out the repetitive text. How can we do that?

Remember that we can retrieve the index from a data frame with the index property. And we can assign back to that property, giving it a list or other iterable, assuming that it's the right length. The same is true for the columns property, which (somewhat confusingly) is also an "index" object. I thus decided to use a list comprehension, iterating over the column names, invoking plain ol' Python string methods (str.removeprefix and str.removesuffix) on each of the column names:

df.columns = [(one_column
               .removesuffix('Regular Conventional Retail Gasoline Prices  (Dollars per Gallon)')
               .removeprefix('Weekly')
               .strip())
              for one_column in df.columns]
df.head()

And yes, the lines are a bit long and out of control... but so are the column names!

Also notice that I used str.strip after removing those prefixes and suffixes, to ensure that there isn't extra leading or trailing whitespace.

After performing this surgery on the column names, we have the following:

Index(['U.S.', 'East Coast', 'New England (PADD 1A)',
       'Central Atlantic (PADD 1B)', 'Lower Atlantic (PADD 1C)', 'Midwest',
       'Gulf Coast', 'Rocky Mountain', 'West Coast', 'Colorado', 'Florida',
       'New York', 'Minnesota', 'Ohio', 'Texas', 'Washington', 'Cleveland, OH',
       'Denver, CO', 'Miami, FL', 'Seattle, WA'],
      dtype='object')

I find this easier to deal with and understand. We could have removed the parenthesized PADD text in a few of the columns, but I think that we did enough to make the column names easy to read and understand for now.

In which region of the US have gas prices been, on average, the highest since 2000?

The United States is a big country, and it's well known that some areas are more expensive than the others. I asked you to take the average price of gas for each region since 2000, and show which areas are the most expensive.

First, let's choose only those rows from 2000 and onward. Because our data frame has a datetime index, we can use some tricks to select certain rows. For example, if we use .loc with a string to select from a datetime index, we can retrieve all of the rows matching that datetime.

But if our string contains only part of a date – the year, or the year and the month – then .loc will match any value on the smaller measures. So if you include the year and month, .loc will retrieve rows with that year and month, regardless of the day. And if you include only the year, .loc will retrieve rows with that year, regardless of the month and day.

Moreover, we can use an open-ended slice in .loc. So if I look for .loc['2000':], this will return all rows with a year of 2000 or later:

(
    df
    .loc['2000':]
)

We now have a shorter (but just-as-wide) data frame as before, limited to rows from 2000 and onward. We want to calculate the mean of each column; fortunately, invoking mean does this, returning a series of floats, one for each column. (The series index is the same as df.columns.)

(
    df
    .loc['2000':]
    .mean()
)

If we want to get the 10 most expensive locations, we have two basic options. One is to use sort_values to sort the series. If we pass the ascending=False keyword, then we can use head(10) to get the 10 most expensive locations:

(
    df
    .loc['2000':]
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

There's nothing wrong with this, but I always wonder if it might be faster to run nlargest, which both sorts and returns a set number of records:

(
    df
    .loc['2000':]
    .mean()
    .nlargest(10)
)

The good news is that I got the same answer in both cases:

Seattle, WA                   3.206568
Washington                    3.151362
West Coast                    2.886687
Miami, FL                     2.866249
New York                      2.774159
Florida                       2.754303
Cleveland, OH                 2.747500
Ohio                          2.729495
Central Atlantic (PADD 1B)    2.719431
New England (PADD 1A)         2.666187
dtype: float64

However, the two approaches didn't take the same amount of time, when measuring with timeit. Using nlargest took 588 μs, whereas sort_values and head took only 371 μs, about 40 percent faster.

Regardless, we can see that Seattle has had, on average, much higher gas prices than other areas of the country.