Skip to content
13 min read · Tags: plotting datetime filtering csv regular-expressions correlations

BW #107: Consumer confidence (solution)

Get better at: CSV files, dates and times, filtering, regular expressions, correlations, and plotting

BW #107: Consumer confidence (solution)

This week, we're looking at data from the University of Michigan's long-running survey of consumer confidence (http://www.sca.isr.umich.edu/). The survey asks a large number of questions about how Americans feel about the economy as it stands now, and also where it's going. And when they came out with their latest results last week, it made a lot of headlines (https://edition.cnn.com/2025/02/21/economy/consumer-sentiment-inflation-tariffs/index.html?cid=external-feeds_iluminar_msn), because it marked the largest drop in consumer confidence in quite some time.

This comes as no surprise: Americans are still concerned about inflation, with egg prices a consistent topic. But other economic trouble would seem to be coming as well, between corporate layoffs (and limited hiring), canceled government contracts, large-scale government layoffs, deportations of immigrants, and proposed tariffs on the biggest trading partners with the US.

Data and seven questions

The data comes from the "Surveys of Consumers" site at the University of Michigan. The university performs a number of different surveys, the best-known as of which measures consumer sentiment, with three measures:

The numbers are relative figures, so a measure of 80 is better than one of 70, but worse than one of 90. The idea is that we can compare measurements at two points in time, and compare the sentiments.

The data itself can be downloaded from the overall data page, at

https://data.sca.isr.umich.edu/data-archive/mine.php

We'll be looking at a few different parts of the data, so you'll want to download the "all tables" data in CSV format.

You can also get a data dictionary (the "Time series variable codebook"), explaining the column names and what they measure, from

https://data.sca.isr.umich.edu/technical-docs.php

I gave you seven tasks and questions. Learning goals include working with date-time data, regular expressions, filtering, resampling, window functions, and plotting. (All of these topics are tagged at the top of the page; click on a tag to see all BW newsletters that use the topic.)

If you're a paid subscriber, then you'll be able to download the data directly from a link at the bottom of this post, download my Jupyter notebook, and click on a single link that loads the notebook and the data into Google Colab, so you can get to work experimenting right away.

Here are my seven tasks and questions:

Read the CSV file into a data frame. Create a date column (with a datetime dtype) based on the Month and yyyy columns, and make it the index. Keep only the columns that start with ics, ice, icc, or px1.

First, we'll load up Pandas?

import pandas as pd

With that in place, we can try to load the CSV file using read_csv:

df = (
    pd
    .read_csv(filename)
)

Already, we hit a snag: We don't get any errors, but if we invoke df.shape, we find out that there are 206 rows and just 1 column.

This is because the first line of the file contains the line, "All Tables". And because every line of a CSV file must contain the same number of columns, read_csv assumes that all of the lines thus have a single column, which is quite obviously wrong.

To skip the first line (i.e., index 0) and use the second line (i.e., index 1) as the actual list of headers with the correct column count, we can pass the header=1 keyword argument to read_csv:

df = (
    pd
    .read_csv(filename, header=1)
)

When we check df.shape, this gives us 205 rows and 360 columns. Yes, that is a very large number of columns, and we'll soon pare that down. But from a quick visual inspection, as well as this numeric result, we can see that we've read things in correctly.

We're dealing with time-related data, and I wanted to get a single date columns with a datetime dtype that we could assign to the index. Fortunately, the data set comes with a four-digit year column (yyyy) and a 1- or 2-digit month coumn (Month). how can we turn these two values into a single datetime column?

First, we'll use assign to add a new column to our data frame. We could theoretically provide a column or expression, but we need something a bit more complex here – we want to invoke pd.to_datetime on a string that we create from the year and month columns. We can thus use lambda as the value to the assign keyword argument; inside of the lambda, we invoke pd.to_datetime on the result of combining yyyy and Month, joined by a '-' character.

What about the date? Fortunately, pd.to_datetime doesn't seem to care, and assigns the first day of the month. Once we've created the date column, we can then invoke set_index and turn it into the data frame's index. Since I'll be doing some time-related actions that can benefit from having a sorted index, I also invoke sort_index:

df = (
    pd
    .read_csv(filename, header=1)
    .assign(date = lambda df_: pd.to_datetime(df_['yyyy'].astype(str) + '-' + df_['Month'].astype(str)))
    .set_index('date')
    .sort_index()
)

We now have a perfectly reasonable data frame. We have all of the data from the CSV file, and a datetime index. However, we aren't going to use all of those columns. Removing unneeded columns is an important way to save memory – and just to keep track of what's going on in your data frame, if it's fairly large.

I thus asked you to keep only those columns that start with one of several three-letter strings, for the measurements we want to have.

To do this, I like to use the filter method, which lets us choose columns (based on the column names) or rows (based on the index). We provide a pattern, often a regular expression, which describes the names that we want to keep.

Here, I wanted those columns that started with ics, icc, ice, or px1. I thus passed a regular expression via the regex keyword argument. I used ^ to indicate that the regular expression needed to be "anchored" to the start of the string. I then used | to activate "alternation," meaning that we want one of the substrings to be found immediately after the start of the string:

df = (
    pd
    .read_csv(filename, header=1)
    .assign(date = lambda df_: pd.to_datetime(df_['yyyy'].astype(str) + '-' + df_['Month'].astype(str)))
    .set_index('date')
    .sort_index()
    .filter(regex=r'^(ics|icc|ice|px1)')
)

The result? A data frame with 205 rows and 30 columns. Each row represents a single month of data as released by the University of Michigan survey.

What are the most recent month-to-month percentage changes in overall ("all") ICS, ICC, and ICE? What are the most recent year-to-year percentage changes in these three values?

First, we'll limit the data to the three columns containing the data we want:

(
    df
    [['ics_all', 'icc_all', 'ice_all']]
)

The diff method calculates how much each value in a data frame differs from the value above it. The first row contains NaN values, and each subsequent row contains positive or negative numbers.

However, I didn't ask you to find the absolute change over each month. Rather, I asked you to find the percentage change from month to month. To do that, we can instead use pct_change, which does the same thing as diff, but reports percentage changes:

(
    df
    [['ics_all', 'icc_all', 'ice_all']]
    .pct_change()
)

We're interested in the most recent set of values. Since we have sorted the data frame by date, this means that the final row will be the most recent one. We could use iloc[-1], to get the final values, using an index of -1 to indicate that we want to count from the end.

But iloc will return the values alone, and I want to see the date (i.e., the index) for the final row. I thus prefer to use tail, which returns both the index and the values:


(
    df
    [['ics_all', 'icc_all', 'ice_all']]
    .pct_change()
    .tail(1)
)

We can see that the ICS score has declined by 3 percent, the ICC score hasn't changed at all, and the ICE score has gone down by about 5 percent.

As for year-on-year numbers, we can do the same thing – we just need to pass the periods=12 keyword argument to pct_change, such that it'll compare each value with that 12 before it, rather than just above it:

(
    df
    [['ics_all', 'icc_all', 'ice_all']]
    .pct_change(periods=12)
    .iloc[-1]
)

The numbers I got weren't great, but they weren't as bad as the news reported. Maybe I calculated something wrong?

I grabbed the final (actual) values:

(
    df
    [['ics_all', 'icc_all', 'ice_all']]
    .tail(1)
)

I got 71.7, 75.1, and 69.5, respectively.

I then looked at the home page (http://www.sca.isr.umich.edu/) for the Michigan study. And I found these values ... in the January data, not the February data. In other words, the data file that the survey staff publishes doesn't (yet) include the data that they were talking about with the press. I'm not sure if this is because they don't want it out right away, or (more likely) they're short staffed.

I decided to modify my data frame, adding this most recent data:

import numpy as np
df.loc['2025-02-01'] = np.nan
df.loc['2025-02-01', 
       ['ics_all', 'icc_all', 'ice_all']] = [64.7, 65.7, 64.0]

I first added a new row for the current month (February), assigning all of the values to be np.nan. It's important to use np.nan rather than 0, so that we don't accidentally include the 0 in our calculations. I then overwrote the values for ICS, ICC, and ICE with what I got from the Michigan survey's site.

Actually, this won't work. Or it might appear to work, but it won't – because I used a string for the index, and we need a datetime value. We can use pd.to_datetime to ensure that our index remains consistent, with a single dtype:

import numpy as np
df.loc[pd.to_datetime('2025-02-01')] = np.nan
df.loc[pd.to_datetime('2025-02-01'), 
       ['ics_all', 'icc_all', 'ice_all']] = [64.7, 65.7, 64.0]

I then re-ran my month-to-month and year-to-year comparisons, and got precisely the same values as the survey site provided. Here is the month-over-month change for each of these measures:

             ics_all   icc_all   ice_all
date                                    
2025-02-01 -0.097629 -0.125166 -0.079137

And here is the year-over-year change:

             ics_all   icc_all   ice_all
date                                    
2025-02-01 -0.158648 -0.172544 -0.148936

Pardon my use of technical economics jargon here, but my only response to this is: Yikes!