BW #99: Literacy and numeracy (solution)

BW #99: Literacy and numeracy (solution)

[ Still didn't get a chance to read my 2024 year in review? Check it out at https://www.bambooweekly.com/my-2024-in-review/ ]

This week, we're looking at the "Survey of Adult Skills" run by the Organization for Economic Cooperation and Development (OECD, https://oecd.org/), what the Economist calls a "club of mostly rich countries." The OECD publicized the survey's most recent results last month. The research asked people to perform a variety of tasks relevant to adult life in the modern world – from reading graphs to interpreting rules to answering questions based on a thermometer.

I first read about this survey in the Economist's December 14th issue, in the article, "Off the books: Are adults forgetting how to read?" (https://www.economist.com/finance-and-economics/2024/12/10/are-adults-forgetting-how-to-read). As the headline indicates, the survey found that older adults might be worse at reading comprehension than young ones.

This week, we'll look at the data from this second round of the Survey of Adult Skills. (A first round was conducted 10 years ago, and we could theoretically use the data to see which countries are doing better and worse, but we won't be doing that.) Along the way, we'll get some insights into what kinds of skills people do well at, and how that breaks down along different ages and countries.

Data and six questions

The OECD survey's home page is at

https://www.oecd.org/en/publications/the-survey-of-adult-skills_f70238c7-en.html

The data describes the responses that nearly 160,000 people gave to the survey. The data frame will have one row per respondent, and one column per question asked, for a total of 2,483 (!) columns.

You can access and download the data by going to this page:

https://survey.oecd.org/index.php?r=survey/index&sid=424913&lang=en

You'll then have to fill out a short survey. After filling it out, you'll immediately have access to the files in CSV format (as well as SAS and SPSS formats). And yes, it's annoying to download the files in this way, but I believe that I can't link directly to the page from which the downloads take place, or distribute them elsewhere (e.g., as a single zipfile), which would definitely be easier for everyone.

The data dictionary, in the form of an Excel file, is here:

https://www.oecd.org/content/dam/oecd/en/about/programmes/edu/piaac/data-materials/cycle-2/piaac-cy2-international-codebook.xlsx

The learning goals for this week include working with multiple files, working with oddly defined CSV files, text manipulations, joining data together, and plotting.

Here are my six tasks and questions; the Jupyter notebook can be downloaded at the end of this message.

Download all of the available PIAAC CSV files into a directory (the Netherlands hasn't yet released its data), and read all of them into a single data frame, treating all five "unknown" values as NaN -- ., .d, .n, .r, and .v.

First of all, let's load Pandas into memory:

import pandas as pd

I then downloaded all of the files from the OECD data site. They all had the pattern prgXXXp2.csv, where XXX was the three-letter code for the country whose responses were in the file. The 2 refers to the fact that this is the second phase of the survey; as I mentioned in my introduction, there are also files for the first phase, which can be used to compare not just how countries stack up against one another, but also how they stack up against themselves, 10 years ago.

Pandas provides read_csv for reading a single CSV file into a data frame. But here, we had 30 files. How can I use read_csv to read them into one data frame?

My approach is generally to read each file into a single data frame, creating a list of data frames, one from each file. I can then use pd.concat to turn the list of data frames into a single data frame. I especially enjoy using list comprehensions to achieve this.

However, I encountered a number of problems while reading in these CSV files, and as much as I enjoy using list comprehensions, they got in the way of my debugging. I thus decided to use a traditional for loop, both so that I would know which file(s) were giving me trouble (via a print statement), and what I needed to do to avoid those problems.

The CSV files needed special treatment in a few ways:

  • They used a semicolon (;) for separating fields, rather than commas or tabs. We can take care of that with the sep=';' keyword argument.
  • They have five (not four, as my question originally stated) different values to indicate that data is missing. That's nice, but Pandas only has one value, NaN, and we need to treat all of those as NaN values when reading them into memory. We can do that with the na_values keyword argument, passing a list of the strings that should be turned into NaN.
  • Some of the files are quite long. read_csv looks at the values in a column to determine what dtype to apply, but above a certain threshold, read_csv loads the file in chunks, examining each chunk. If two chunks tell it different things, you'll get a warning indicating that you should either specify the dtype explicitly or pass low_memory=False so that the data is all read at once. I did this, but then compared the timing with passing engine='parrow' to load the data using PyArrow's CSV loader. PyArrow was about 30 percent faster on my computer, so I went with it.

In the end, code looked like this:

import glob

all_dfs = []

for index, one_filename in enumerate(glob.glob('prg*csv')):
    print(f'{index}: {one_filename}')
    df = pd.read_csv(one_filename, 
                     sep=';', 
                     engine='pyarrow', 
                     na_values=['.', '.d', '.n', '.r', '.v'])
    all_dfs.append(df)

df = pd.concat(all_dfs)

Notice that I used glob, one of my favorite Python modules, to get a list of filenames matching the pattern.

After running this code, df.shape tells me that the data frame contains 157,525 rows and 2,483 columns. So the data frame is both long and wide. I'll just add that the total memory usage for this data frame is 3.2 GB. The total size of these CSV files is 1,440 MB, or about 1 GB. Once again, we see that the in-memory size of a data frame is often about 3x the size of the values in CSV. This is another place where PyArrow can make things faster and more efficient — but that's a topic for another day.

The CNTRYID column indicates the country in which the respondent lives, but as an integer. Retrieve them from row 3 of the data dictionary, and add a CNTRYNAME column to the data frame so that we'll know the country's name, not just its number.

If we're going to understand how each country is doing, we'll need to map from the country numbers (as used in CNTRYID ) to their names. The numbers-to-names translations are all in the data dictionary, which is itself in an Excel spreadsheet. Perhaps we can use that file to add a new CNTRYNAME column to our data frame, and thus avoid having to enter the data manually?

To do this, I first used read_excel to create a data frame from our Excel file. I passed the index_col keyword argument, so that the Variable column would be used as an index. This allows us to use loc to retrieve the row CNTRYID, and the column Value Scheme Detailed – a string containing the name-value pairs we'll need to perform the translation:

data_dictionary_filename = 'piaac-cy2-international-codebook.xlsx'

country_names_string = (
    pd
    .read_excel(data_dictionary_filename, index_col='Variable')
    .loc['CNTRYID', 'Value Scheme Detailed']
)

Note that country_names_string is (as the name indicates) a string, not a series or a data frame. That's because I asked for one row and one column; Pandas provided me with the string.

My next step is to create a dictionary from this string. If you were relieved to find that I didn't use a comprehension in the previous solution, you'll be disappointed to find that I'm using one here — a dict comprehension, returning a dictionary.

country_numbers_to_names_dict = {
    int(one_country.split(':')[0]) : one_country.split(':')[1].strip()
    for one_country in country_names_string.split(';')
}

This dict comprehension iterates over the list we get from splitting country_names_string on semicolons (;), which separates each key-value pair. We then split each key-value pair on the colon (:), grabbing the first part for the key and the last part for the value. I also use int to ensure that the key will be an integer, rather than a string containing digits. Note that int automatically removes any leading or trailing whitespace — so I don't need to use str.strip on the key, but I do on the value, to avoid any extra spaces.

I originally thought that it would make sense to use the map method to get a new series based on df['CNTRYID'], and then assign it to a new column in our data frame:

df['CNTRYNAME'] = df['CNTRYID'].map(country_numbers_to_names_dict)

I was rather surprised to get a warning from Pandas:

/var/folders/rr/0mnyyv811fs5vyp22gf4fxk00000gn/T/ipykernel_59872/4139732354.py:1: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

In other words, inserting a new column in this way doesn't create one new column all at once, but actually invokes an internal insert method a large number of times, which leads to poor performance and fragmented memory. I had no idea, and appreciate this warning!

I decided to instead create a new series based on df['CNTRYID'], and then (as suggested by the warning, more or less) invoke pd.concat, specifying index='columns' so that I add a new column, rather than a row:

df = pd.concat([df, 
                pd.Series(df['CNTRYID'].map(country_numbers_to_names_dict), name='CNTRYNAME')], axis='columns')

This worked well, and didn't create any warnings. Notice that I specified name='CNTRYNAME' when creating the series, so that the newly added column would be named.