Skip to content
10 min read · Tags: excel cleaning regular-expressions multi-index plotting

# BW #111: State taxes (solution)

Get better at: Excel files, cleaning data, regular expressions, multi-indexes, and plotting

# BW #111: State taxes (solution)

This week, we looked at data about state income-tax rates in the United States. Not every state has an income tax, and those that do have a variety of different rates and brackets.

The Tax Foundation (https://taxfoundation.org/) a think tank that monitors US tax policy – and which is generally against taxes – recently released a new report summarizing income-tax rates in each state. I'm not a tax expert, and I honestly don't find them that interesting... but when I saw the spreadsheet, I decided that it was so weird and messy that it was worth examining, if only as a way to practice cleaning data. And cleaning data, while not one of the most glamorous parts of data analysis, is crucially important.

Data and five questions

This week's data is related to the Tax Foundation's recent report on local income taxes:

https://taxfoundation.org/research/all/state/local-income-taxes-2023/

You can download the Tax Foundation's spreadsheet from here:

https://taxfoundation.org/wp-content/uploads/2025/02/2025-State-Individual-Income-Tax-Rates-and-Brackets-2025.xlsx

The spreadsheet includes a huge number of footnotes; for the sake of simplifying our analysis, we are going to completely ignore these, something you shouldn't do when you're calculating your taxes.

Learning goals this week including working with Excel files, cleaning data, regular expressions, and handling multi-indexes.

If you're a paid subscriber, then you'll be able to download the data directly from a link at the bottom of this post, download my Jupyter notebook, and click on a single link that loads the notebook and the data into Google Colab, so you can get to work experimenting right away.

Here are the five questions for this week:

Read the Excel spreadsheet into a Pandas data frame. The index should contain state names (well, abbreviations) from the "State" column, without the footnotes. Remove the footnote lines at the end. You can remove the columns containing ">" symbols. You can treat both "none" and "n.a." as 0 values. Cells that contain text (i.e., not numbers) can be turned into 0.

Let's start by loading Pandas:

import pandas as pd

In theory, we can import the Excel file into a data frame with read_excel:


filename = 'data/bw-111.xlsx'

df = (pd
      .read_excel(filename)
      )

However, there are a few problems with this:

Our query thus looks like this:

df = (pd
      .read_excel(filename,
                  header=[0,1],
                 nrows=217)
     )

This reads the data in, but there's still a lot of work to do if we want to perform calculations with it. There are so many non-numeric values in the spreadsheet that Pandas won't allow us to convert the columns to floats (or even integers).

For example, there are lots of references to footnotes inside of parentheses. I decided to use replace to indicate what text in our data frame should be replaced, and with what value(s). In most of these cases, I passed the regex=True keyword argument, allowing me to specify the search string with a regular expression. (Not familiar with regular expressions? Get my regular expression crash course, at RegexpCrashcourse.com.

The footnotes are all referenced inside of parentheses. I want to remove the opening (, any characters that might then appear, and then the closing ). I can do that with the '\(.*\)' regexp, using \ to indicate that the parentheses should be taken literally, rather than used as special ("meta") characters in the regexp string. I'll replace those parenthesized expressions with the empty string, '':


df = (pd
      .read_excel(filename,
                  header=[0,1],
                 nrows=217)
      .replace(to_replace='\(.*\)', value='', regex=True)
     )

I then ran a bunch of additional replacements:

This does a lot to clean up our file. Here's how things look so far:


df = (pd
      .read_excel(filename,
                  header=[0,1],
                 nrows=217)
      .replace(to_replace='\(.*\)', value='', regex=True)
      .replace(to_replace='none', value='0')
      .replace(to_replace='n.a.', value='0')
      .replace(to_replace='^\s*$', value=pd.NA, regex=True)
      .replace(to_replace='\s+$', value='', regex=True)
      .replace(to_replace='^\s+', value='', regex=True)
     )

I should add that it would almost certainly be possible to combine these calls to replace, and that doing so would likely speed up the query. But I like the idea of building these queries line by line, and then being able to debug them when something goes wrong. I can assure you that many things went wrong when I was creating this series of method calls, and being able to comment lines out and backtrack was invaluable.

Now that we've done the majority of our cleaning, we can try to get our data frame into better shape. For example, we want state names to be our index. We could just assign the state-related column to the index. However, there's an issue there – the state names only appear in the first row for which it's relevant. For example, there are three rows for Alabama, but "Ala." is only written on the first of them. There are 10 lines for California, but "Calif." is only written on the first.

We thus have several things to do:

Here's how we can do it, with a combination of assign, lambda, and then set_index:


df = (pd
      .read_excel(filename,
                  header=[0,1],
                 nrows=217)
      .replace(to_replace='\(.*\)', value='', regex=True)
      .replace(to_replace='none', value='0')
      .replace(to_replace='n.a.', value='0')
      .replace(to_replace='^\s*$', value=pd.NA, regex=True)
      .replace(to_replace='\s+$', value='', regex=True)
      .replace(to_replace='^\s+', value='', regex=True)
      .assign(state=lambda df_: df_[('Unnamed: 0_level_0', 'State')].ffill() )
      .set_index('state')
     )

There are a few remaining things to clean up here. For example, we have the text, "7.0% on capital gains income only" in the rows for Washington. After a bit of thought and experimentation, I wrote a regexp that looks for text that contains non-digits and non-decimal points, and replaces the entire thing with an empty string.

I then used drop to remove some columns. Notice that because we have a multi-index on the columns, we need to use a tuple to describe each of the levels. We pass the columns keyword argument to drop, and then give it a list of tuples, i.e., a list of the multi-index names we want to get rid of:


df = (pd
      .read_excel(filename,
                  header=[0,1],
                 nrows=217)
      .replace(to_replace='\(.*\)', value='', regex=True)
      .replace(to_replace='none', value='0')
      .replace(to_replace='n.a.', value='0')
      .replace(to_replace='^\s*$', value=pd.NA, regex=True)
      .replace(to_replace='\s+$', value='', regex=True)
      .replace(to_replace='^\s+', value='', regex=True)
      .assign(state=lambda df_: df_[('Unnamed: 0_level_0', 'State')].ffill() )
      .set_index('state')
      .replace(to_replace='^.*[^\d.].*$', value='', regex=True)
      .drop(columns=[('Unnamed: 0_level_0', 'State'), 
                    ('Single Filer', 'Rates.1'),
                    ('Married Filing Jointly', 'Rates.1')])
     )

Finally, we do a bit more replacing:

Then we run astype(float) to convert all of the columns to float values, run ffill to copy any existing value into a subsequent row that contains NaN, and (finally) run drop_duplicates, to remove any duplicated lines:


df = (pd
      .read_excel(filename,
                  header=[0,1],
                 nrows=217)
      .replace(to_replace='\(.*\)', value='', regex=True)
      .replace(to_replace='none', value='0')
      .replace(to_replace='n.a.', value='0')
      .replace(to_replace='^\s*$', value=pd.NA, regex=True)
      .replace(to_replace='\s+$', value='', regex=True)
      .replace(to_replace='^\s+', value='', regex=True)
      .assign(state=lambda df_: df_[('Unnamed: 0_level_0', 'State')].ffill() )
      .set_index('state')
      .replace(to_replace='^.*[^\d.].*$', value='', regex=True)
      .drop(columns=[('Unnamed: 0_level_0', 'State'), 
                    ('Single Filer', 'Rates.1'),
                    ('Married Filing Jointly', 'Rates.1')])
      .replace(to_replace='\D+', value='', regex=True)
      .replace(to_replace='', value='0')
      .astype(float)
      .ffill()
      .drop_duplicates()
     )

The result? A data frame with 159 rows and 9 columns, all of which have a float64 dtype. And the state names are in the index.

Whew! This was a fair amount of cleaning... but I knew that it would be a bit rough.

Which state has the highest rate for single filers? Which states have no income tax at all?

Now that we have our data frame in place, we can start to answer some questions. If we want to get the rates for single filers, we can use [], passing it a tuple to represent the levels of the multi-index:


(
    df
    [('Single Filer', 'Rates')]
)

We want to get the maximum value, so we can call max:


(
    df
    [('Single Filer', 'Rates')]
   .max()
)

The problem is that we get the maximum value, but we don't know what index (i.e., what state) is associated with it. For that, we'll have to call idxmax. But then we'll get the state, not the rate.

If we want both, we should use agg, which lets us pass a list of strings representing the aggregation methods we want to use, as in:


(
    df
    [('Single Filer', 'Rates')]
    .agg(['max', 'idxmax'])
)

The result:

max        0.133
idxmax    Calif.
Name: (Single Filer, Rates), dtype: object

In other words, California's highest tax bracket (which only applies to single filers' income above $1m) is 13.3 percent.

I also asked which states have no income tax at all. We can find that by again retrieving just the single filer rates, and then using loc with lambda to keep only those where the value is 0:

(
    df
    [('Single Filer', 'Rates')]
    .loc[lambda s_: s_ == 0]    
)

We get quite seven states:

state
Alaska    0.0
Fla.      0.0
Nev.      0.0
S.C.      0.0
S.D.      0.0
Wash.     0.0
Wyo.      0.0
Name: (Single Filer, Rates), dtype: float64