BW #82: Broadband (solution)

BW #82: Broadband (solution)

This week, we looked at broadband Internet availability in countries around the world. The topic was inspired by a three-part series, "Breaking Ground," that Marketplace broadcast last week (https://www.marketplace.org/collection/breaking-ground/) about the obstacles that the United States now faces in expanding broadband access. I learned quite a bit from this series about how fiber is manufactured, and thought it might be interesting to see how the US compares with other countries.

Data and six questions

This week's data comes from the OECD's broadband statistics (https://www.oecd.org/en/topics/sub-issues/broadband-statistics.html). There are a number of different data sets that we could look at; I chose to focus on broadband penetration rates. The data set itself, in an Excel file, can be downloaded from here:

https://www.oecd.org/content/dam/oecd/en/topics/policy-sub-issues/broadband-statistics/data/1-5-fixed-and-mobile-broadband-penetration.xls

This Excel file contains two sheets, one for fixed broadband and one for mobile. We'll use both of them.

I gave you six tasks and questions to answer. As always, a link to the full Jupyter notebook that I used to perform the calculations is at the bottom of this post.

Create two data frames, one with fixed broadband info, and the other with mobile broadband info, taken from the two sheets of the Excel file. Ignore the row with overall OECD information. The country names should be the data frame's columns, and the quarters should be the rows. Combine the two data frames into a single, multi-indexed data frame.

First, let's load Pandas:

import pandas as pd

Next, we'll have to load the Excel file into a single data frame. We can use read_excel to get one or more data frames back from an Excel file. If we want more than one sheet, then we can pass the sheet_name keyword argument, indicating (with a list of integers or strings) which of the sheets we want to get back. They're returned as a dict in which the sheet names (or numeric indexes) are the keys and the data frames themselves are the values. We can thus do this:

filename = '1-5-fixed-and-mobile-broadband-penetration.xls'

all_dfs = pd.read_excel(filename, 
              sheet_name=[0,1]).values()

However, this code won't quite work. That's because each of the sheets contains non-data rows that mess up the reading of the data. We need to tell Pandas to ignore everything until the data starts – and more specifically, until the column names start – and then to ignore everything once the data ends.

To do this, we'll use the header keyword argument, telling it to start reading from line 3. And we'll pass nrows=38, so that we only read that many rows of data.

Next, since we want the country names to be the index, we can indicate index_col=0, to use them for the index. And finally, the .. strings should be treated as NaN values, so we pass na_values='..', adding it to the list of strings interpreted as NaN. (I didn't mention this explicitly, but it would have become obvious down the road.)

We thus end up with:

filename = '1-5-fixed-and-mobile-broadband-penetration.xls'

pd.read_excel(filename, 
              sheet_name=[0,1],
              header=3,
              index_col=0,
              nrows=38,
              na_values='..').values()

This is also a good start, and returns the two data frames. However, the data frames we get here are both inverted from what we want. They have dates on the columns and country names on the rows. We'll need to transpose each of the data frames. To do that, we can use a list comprehension on the data frames we get back from dict.values, invoking transpose (or its abbreviation, T):

filename = '1-5-fixed-and-mobile-broadband-penetration.xls'

[one_sheet.T for
 one_sheet in pd.read_excel(filename, 
                            sheet_name=[0,1],
                            header=3,
                            index_col=0,
                            nrows=38,
                            na_values='..').values()]      

This is great, except that we now have a list of two data frames, and we actually want them to be in a single data frame. Normally, we can combine data frames with pd.concat, and we can combine them horizontally with axis='columns'. However, we want a multi-index separating the fixed from mobile values. How can we do that?

We can use the keys keyword argument, passing a list of strings that'll serve as the outer-layer names for our data frame, serving as a reminder of which original data frame each set of data came from. Here's the final query that I performed, assigning the result of pd.concat to df:

filename = '1-5-fixed-and-mobile-broadband-penetration.xls'

df = pd.concat([one_sheet.T for
                       one_sheet in pd.read_excel(filename, 
                                                  sheet_name=[0,1],
                                                  header=3,
                                                  index_col=0,
                                                  nrows=38,
                                                  na_values='..').values()],
          keys=['fixed', 'mobile'],
          axis='columns')
      

The resulting data frame has 41 rows and 76 columns.

Turn the data frame's index (containing the quarters) into datetime values, choosing the final date of the named quarter.

First, you can tell from the fact that the question was phrased in the plural that I changed my mind while writing these questions about whether to have a single, multi-indexed data frame or two separate ones. We'll stick with the single, multi-indexed data frame in this question and in the rest of them.

The index values that we got from the Excel file were all strings of the form "Q2-2021" and "Q4-2023". In all cases, the format was Q, followed by 2 or 4, followed by a minus sign and then a year. This is fine, except that we cannot easily calculate with it. I thus decided to change these into datetime values for the final day of the second quarter (i.e., June 30th) and the final day of the fourth quarter (i.e., December 31st).

First, we need to turn the strings into something that pd.to_datetime will recognize. I chose to do that with another list comprehension, this time over all of the index values. Since each index value is a string, we can run str.replace , changing Q2- to 30-06- and Q4- to 31-12-.

If we make this transformation, and set dayfirst=True in our call to pd.to_datetime, then we'll get a datetime value. The list comprehension returns a list of datetime values, which we assign back to the index, turning our data frame into a time series:

df.index = [pd.to_datetime(one_index
                           .replace('Q2-', '30-06-')
                           .replace('Q4-', '31-12-'), dayfirst=True)
                     for one_index in df.index]

Our data frame is now what we originally wanted – the two sheets from the Excel file have been merged into a single data frame, albeit transposed from how they were presented there.