BW #43: Financial protection (solution)
Americans who are frustrated with a financial product or service can complain to the CFPB. How many complaints do they get a year, and from where? This week, we analyze their complaints database.
[Administrative note: I’ll be holding office hours this coming Sunday. Look for e-mail in the coming day with Zoom details. I look forward to seeing you there!]
This week, we’re looking at the database of complaints sent to the Consumer Financial Protection Bureau (https://www.consumerfinance.gov/, https://en.wikipedia.org/wiki/Consumer_Financial_Protection_Bureau), a US government agency that tries to help consumers who have had problems with financial products and services. The database is large (about 3 GB) and describes a wide variety of complaints over more than a decade, giving us a chance to explore it and quite a bit of Pandas functionality. Let’s get it to it!
Data and 8 questions
This week’s data set is the database of consumer complaints, available at the CFPB's site:
https://www.consumerfinance.gov/data-research/consumer-complaints/
You should download the entire database in CSV format from
https://files.consumerfinance.gov/ccdb/complaints.csv.zip
The zipfile is about 690 MB in size, and the unzipped file is about 3 GB in size, so make sure your computer has enough RAM to handle it!
Here are this week's eight tasks and questions:
Read the CFPB complaint data. Read all columns, and don't specify any dtypes. How long did the read take? How much memory does the resulting data frame take?
I started off by loading Pandas into memory, with the standard import and alias:
import pandas as pd
I also defined a “filename” variable with the CSV file I want to load. I’ll be loading it a large number of times, and decided
filename = 'complaints.csv'
I asked you to read the entire file (all columns) into a data frame. That’s easy enough to do with “read_csv”:
df = pd.read_csv(filename)
However, when I gave the above instruction, I got a warning from Pandas:
DtypeWarning: Columns (16) have mixed types. Specify dtype option on import or set low_memory=False.
What does this warning mean? Well, the data frame is large, with 4,367,328 rows and 18 columns. Instead of reading it all into memory, analyzing the contents of each column, and deciding what dtype to use, Pandas read it into memory in chunks. That saves memory, but it also means that the dtype it decides on from earlier chunks might not work with later chunks, leading to an incorrectly determined dtype.
Pandas suggests getting rid of this warning by either setting low_memory=False (thus letting it examine the entirety of a column’s data before setting the dtype) or by explicitly telling Pandas what dtypes to use.
I decided that it might be interesting to see just what kind of effect these suggestions might have.
For starters, I decided to set low_memory=False
df = pd.read_csv(filename,
low_memory=False)
I wanted to know how long this took — and for such tasks, I’m a big fan of the magic “timeit” commands in Jupyter. %timeit runs a line of code repeatedly, giving you the mean run time, and %%timeit does so within a full cell. Note that if you’re using %%timeit, then it must be the first line in the cell. Python comments (starting with #) and other text must come after the call to %timeit.
%%timeit
df = pd.read_csv(filename,
low_memory=False)
The result? Loading the data on my system takes quite some time:
19.6 s ± 96.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In other words, it ran the code 7 times, and found that on average, it took 19.6 seconds to load.
How much memory does this data frame take up? An easy way to check is with the “memory_usage” method, which returns the memory usage for each column:
df.memory_usage()
I get the following results:
Index 132
Date received 34938624
Product 34938624
Sub-product 34938624
Issue 34938624
Sub-issue 34938624
Consumer complaint narrative 34938624
Company public response 34938624
Company 34938624
State 34938624
ZIP code 34938624
Tags 34938624
Consumer consent provided? 34938624
Submitted via 34938624
Date sent to company 34938624
Company response to consumer 34938624
Timely response? 34938624
Consumer disputed? 34938624
Complaint ID 34938624
dtype: int64
Does it seem a bit suspicious that each column takes up precisely the same amount of memory? It should — that’s because Pandas is telling us how much memory it is using, but not how much the Python strings contained in the data frame are using. In other words, this isn’t a real measure of how much memory is being used. To get that, we need to pass deep=True, which takes much longer but gives an actual answer:
df.memory_usage(deep=True)
This time, we get:
Index 132
Date received 292610976
Product 462396306
Sub-product 317189659
Issue 422208545
Sub-issue 373518838
Consumer complaint narrative 1812265991
Company public response 380293226
Company 364080378
State 256506475
ZIP code 269867646
Tags 157331396
Consumer consent provided? 286252782
Submitted via 264345302
Date sent to company 292610976
Company response to consumer 355644139
Timely response? 261984666
Consumer disputed? 160648459
Complaint ID 34938624
dtype: int64
Since this is a series, we can total it up with “sum”:
usage = df.memory_usage(deep=True).sum()
print(f'{usage:,}')
Displaying usage in an f-string lets us put commas between the digits:
6,764,694,516
The total memory usage of this data frame is… 6.7 GB! Wow, that’s a lot of memory.
Now specify categorical and date dtypes. How long did the read take? How much memory does this take?
Let’s see if we can shrink that memory usage a bit. One of the easiest ways to reduce the size of a text-based column is with a “category.” Consider a column containing country names. The country names likely repeat many times, with each mention of a country taking additional memory. If we could replace each country name with an integer, and then store the mapping from integers to country names on the side, we could save a lot of memory.
That’s precisely how a category works — except that Pandas handles it all for us automatically. Moreover, the fact that it’s a categorical column is pretty much invisible to us; we can still use all of the string functionality that Pandas makes available.
If we already have a series (including a column) of strings, then we can turn it into a category by saying:
s.astype(‘category’)
But in this case, we’ll go one better: We’ll tell Pandas that we want the text columns to be turned into categories as they’re loaded. This does mean that Pandas will need to spend some time analyzing the data in each column, finding the unique values, creating the category objects, and then rewriting the columns. So it might take a bit more time, but the memory savings could be substantial.
The way we do this is by passing the “dtype” keyword argument to “read_csv”. This argument takes a dict as a value; the dict’s keys are column names, and the values are the dtypes we want to assign to the columns. In this particular case, we’ll just want a ton of category columns:
df = pd.read_csv(filename,
low_memory=False,
dtype={"Product":'category',
"Sub-product":'category',
"Issue":'category',
"Sub-issue":'category',
"Company":'category',
"State":'category',
"Submitted via": 'category',
'Company response to consumer': 'category',
'ZIP code': 'category',
'Consumer consent provided?':'category',
'Company public response':'category',
'Timely response?':'category',
'Consumer disputed?':'category'
})
But wait, we can do even better: There are two columns that contain date information, which we can have Pandas parse into datetime columns. We do this by passing the “parse_dates” keyword argument. All together, we then have the following query:
%%timeit
df = pd.read_csv(filename,
low_memory=False,
parse_dates=['Date received', 'Date sent to company'],
dtype={"Product":'category',
"Sub-product":'category',
"Issue":'category',
"Sub-issue":'category',
"Company":'category',
"State":'category',
"Submitted via": 'category',
'Company response to consumer': 'category',
'ZIP code': 'category',
'Consumer consent provided?':'category',
'Company public response':'category',
'Timely response?':'category',
'Consumer disputed?':'category'
})
Running the above, I got the following results from %%timeit:
20.4 s ± 254 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The extra analysis and building of categories did take some extra time. True, we needed the dates to be parsed, so that was useful. But the big question is whether we saved any memory. Let’s see:
usage = df.memory_usage(deep=True).sum()
print(f'{usage:,}')
The result?
2,161,433,792
That’s right — we have reduced our memory usage by about 70 percent, simply by asking Pandas to use dtypes and categories. Not a bad savings for several minutes of work.