Process Google Sheets data in Python with Pandas (example: running distance over time)

Every now and then I do a little bit of data mangling for personal use using tools that I got to appreciate, mainly during professional work.

In this blog post I would like to share an example for simple and yet powerful end-to-end data processing using Google Sheets, Python, and Pandas (and I have to notice: my first blog post about data analysis with Pandas is already more than six years old, time flies by).

Raw data in a spreadsheet: a date and a distance for every run

I have a chaotic spreadsheet in Google Sheets where I keep track of my runs. It is easy to edit from the smartphone, and synchronized across devices.

This is a screenshot of a part of that spreadsheet (hiding some irrelevant columns, showing only a small fraction of the rows):

Each row in the date column refers to a specific day using the text format YYYY-MM-DD. Some rows in the km column contain numerical values. Each of these means that I have made a run of the distance given by the value, in kilometers, on the corresponding day. Missing values in the km column mean that on those days I did not do a run (or forgot to keep track of it). In the screenshot, it looks like days in the date column are represented without gaps, but that is not important.

Now you know about the kind of data that I have been entering manually for about half a year, about my runs.

Analysis goal

My goal today was to do a tiny bit of data analysis — for myself — and to then write this blog post, for you :-).

In terms of data analysis, my goal was to look into the evolution of my running performance over time. I wanted to start by looking at “running distance per week”. Specifically, my goal was to perform a rolling window analysis with a window width wider than a week (to focus on changes on longer time scales more than on fast fluctuations), and to plot the outcome.

So I built a small tool using Python and Pandas which

  • automatically fetches the current dataset from Google Sheets (as a CSV document)
  • processes and prepares the data (removing dirt, filling gaps, …)
  • performs statistical analyses
  • creates a plot (and writes a PNG graphics file)

Results first

Here is some code: https://github.com/jgehrcke/runni/blob/master/runni.py

Here is how I use it, and how you can use it, too:

# Get the code.
$ git clone https://github.com/jgehrcke/runni && cd runni
 
# Enable link sharing to the Google Sheet (so that anyone
# with the link can access the sheet). Get the corresponding
# ID/key from the URL, and set it as an environment variable
# (it's sensitive data).
$ export RUNNI_GSHEET_KEY='[snip]'
 
# Dependencies: Python 3, pandas, matplotlib, requests
 
# Run the analysis program.
$ python runni.py
...
200112-19:57:39.060 INFO: Writing PNG figure to 2020-01-12_running-distance-per-week-over-time.png

The resulting plot:
 

For each day in the data interval, a small gray data point explicitly shows the distance I ran on that very day (on most of the days this is 0 km). In the majority of the cases, every non-zero gray data point corresponds to a single run (the only run on the corresponding day), but more generally it is the distance sum of all runs on that very day.

The thick black line is the “distance per week”, derived from a rolling window analysis with a window width of 14 days.

in-Pandas data processing in more detail (the non-trivial part)

The following code block (from here) with its code comments shows the core of the in-Pandas data processing and is the main reason for why I write this blog post: I think this is a non-trivial part. In previous projects (goeffel, bouncer-log-analysis, dcos-dev-prod-analysis) I actually put a bit of thought into how to do a meaningful rolling window analysis with Pandas, and here I am simply re-using what I learned before. But some of that it is still non-trivial and deserves an explanation.

The code comments are supposed to provide a hopefully helpful level of explanation. From the comments, it should at least be obvious that several decisions need to be made before the data in the spreadsheet can be analyzed in a meaningful way in a rolling/sliding window analysis.

# Keep only those rows that have a value set in the `km` column. That is
# the criterion for having made a run on the corresponding day.
df = df[df.km.notnull()]
 
# Parse text in `date` column into a pd.Series of `datetime64` type values
# and make this series be the new index of the data frame.
df.index = pd.to_datetime(df["date"])
 
# Sort data frame by index (sort from past to future).
df = df.sort_index()
 
# Turn the data frame into a `pd.Series` object, representing the distance
# ran over time. Every row / data point in this series represents a run:
# the index field is the date (day) of the run and the value is the
# distance of the run.
km_per_run = df["km"]
 
# There may have been more than one run per day. In these cases, sum up the
# distances and have a single row represent all runs of the day.
 
# Example: two runs on 07-11:
# 2019-07-10    3.2
# 2019-07-11    4.5
# 2019-07-11    5.4
# 2019-07-17    4.5
 
# Group events per day and sum up the run distance:
km_per_run = km_per_run.groupby(km_per_run.index).sum()
 
# Outcome for above's example:
# 2019-07-10    3.2
# 2019-07-11    9.9
# 2019-07-17    4.5
 
# The time series index is expected to have gaps: days on which no run was
# recorded. Up-sample the time index to fill these gaps, with 1 day
# resolution. Fill the missing values with zeros. This is not strictly
# necessary for the subsequent analysis but makes the series easier to
# reason about, and makes the rolling window analysis a little simpler: it
# will contain one data point per day, precisely, within the represented
# time interval.
#
# Before:
#   In [28]: len(km_per_run)
#   Out[28]: 75
#
#   In[27]: km_per_run.head()
#   Out[27]:
#   2019-05-27    2.7
#   2019-06-06    2.9
#   2019-06-11    4.6
#   ...
#
# After:
#   In [30]: len(km_per_run)
#   Out[30]: 229
#
#   In [31]: km_per_run.head()
#   Out[31]:
#   2019-05-27    2.7
#   2019-05-28    0.0
#   2019-05-29    0.0
#   2019-05-30    0.0
#   ...
#
km_per_run = km_per_run.asfreq("1D", fill_value=0)
 
# Should be >= 7 to be meaningful.
window_width_days = opts.window_width_days
window = km_per_run.rolling(window="%sD" % window_width_days)
 
# For each window position get the sum of distances. For normalization,
# divide this by the window width (in days) to get values of the unit
# km/day -- and then convert to the new desired unit of km/week with an
# additional factor of 7.
km_per_week = window.sum() / (window_width_days / 7.0)
 
# During the rolling window analysis the value derived from the current
# window position is assigned to the right window boundary (i.e. to the
# newest timestamp in the window). For presentation it is more convenient
# and intuitive to have it assigned to the temporal center of the time
# window. Invoking `rolling(..., center=True)` however yields
# `NotImplementedError: center is not implemented for datetimelike and
# offset based windows`. As a workaround, shift the data by half the window
# size to 'the left': shift the timestamp index by a constant / offset.
offset = pd.DateOffset(days=window_width_days / 2.0)
km_per_week.index = km_per_week.index - offset

Closing remarks

  • What is shown above is I think a well-confined, simple example for real-world data analysis. I like the architecture of maintaining raw data in Google Sheets to then consume it via HTTP for analysis using proper tooling (the data analysis and plotting options within Google Sheets are very limited, heck). With that example, I hope I can inspire some of you people out there to do similar things. There are endless possibilities: in my spreadsheet, I have other columns such as the run duration, … :-).
  • I will keep adding data points to my spreadsheet after about every run and will keep re-generating the graph more or less regularly for my entertainment.
  • The meaning and impact of the window width in the rolling window analysis are critical. I have not explained that above. I think one of the best ways to grasp it is to visually play with it — that’s what the --window-width-days argument can help with.
  • Again, you can find the code for inspiration here: https://github.com/jgehrcke/runni

A denial-of-service attacker in my Ethernet: Philips screen 329P9H

I use a Philips P-line 329P9H screen. A strong reason for why I chose this particular model is its 1 Gbit/s Ethernet (RJ-45) port: the 329P9H has the rare property that I can attach it to my network switch with a common Ethernet cable.

When I then attach my notebook to the screen through a single USB-C cable that connection does not only transmit video data and periphery (keyboard and mouse, …) data but also Ethernet. Quite neat, right?

I thought: that way, I could easily continue enjoying the stability of a cabled Ethernet connection at my desk just like I did for most of my professional life, whereas my current notebook does not even have an RJ-45 port anymore.

I also listen to music quite a lot with my music streaming system which is connected to the Internet through the same cabled Ethernet, attached to the same Ethernet switch that my screen is attached to.

Since I have had this new screen the music stream stopped more often than before. And I couldn’t quite believe it: when the music stops and when I then unplug the screen’s Ethernet cable from the switch the music starts again, within one or two seconds.

The first time I tried this was when I ran out of options debugging the absence of Internet connectivity at my music stream device. I looked at the switch and the activity LED for the screen’s Ethernet port (on the switch) was flashing at its highest frequency. At that very moment, I did not have the notebook attached to the screen. That was fishy: where would the Ethernet activity come from if not from the (idle, sleeping) screen itself? I powered off the screen and the Ethernet (and therefore Internet) started to work again, quasi-instantaneously, for my music streaming device, and for all other devices attached to the switch.

I observed this about five times so far. It’s no coincidence, seems to be an actual fault in the screen’s Ethernet “controller”. This only seems to happen when no notebook is attached to the screen. It happens when the screen is in ‘sleep’ mode. Power-cycling the screen is sufficient to make the problem go away (until it comes back after a couple of days or weeks).

The Ethernet switch I use is consumer-grade but it is a decent 5-port 1 Gbit/s switch which has served me well over the years. Several different networking devices just worked fine with that switch. This new screen is the first device that trips up the switch.

This happened again today. I think now I leave the screen unplugged from the switch using just the wireless network from my notebook. :-(.

VS Code spell check: simplicity wins

Spell Right is a spell checker extension for VS Code. Its README is a little convoluted. Here is how to set it up on a Linux system.

Step 1: set up dictionary files:

cd $HOME/.config/Code/
mkdir Dictionaries && cd Dictionaries
git clone https://github.com/titoBouzout/Dictionaries .

Step 2: In VS Code press Ctrl+P and run

 ext install ban.spellright

Step 3: restart VS Code

Step 4: open the document to spell-check in VS Code. Then click the eye icon (similar to 👁️) in the bottom-right corner of the VS Code status bar. Select the language, press OK. The document is now being spell-checked. Screenshot:

vs code spell check with spell right

Remarks

For my use cases I prefer Spell Right over the more complex Code Spell Checker extension. Spell Right seems to be pretty similar to the simple, unobtrusive experience provided by Sublime’s built-in spell checker (which I was happy with for many years, also using the titoBouzout/Dictionaries).

Setting und using variables in a Makefile: pitfalls

A debug session. I ran into a problem in CI where accessing /var/run/docker.sock from within a container failed with a permission error. I had this specific part working before. So I did a bit of bisecting and added debug output and found the critical difference. In this case I get a permission error (EACCES):

uid=2000(buildkite-agent) gid=0(root) groups=0(root)

In this case not:

uid=2000(buildkite-agent) gid=1001 groups=1001

The difference is in the unix group membership specifics of unix user uid=2000(buildkite-agent) within the specific container. If the user is member of gid=1001 then access is allowed, if the user is member of gid=0 then access is denied.

I found that in the erroneous case I was passing this command line argument to docker run ...:

-u 2000:

Nothing after the colon. This is where the group ID (gid) belongs. docker run ... did not error out. That is no good. This input was treated the same as -u 2000 or -u 2000:0.

But why was there no gid after the colon when I have this in my Makefile?

-u $(shell id -u):${DOCKER_GID_HOST}

Because when this line was run DOCKER_GID_HOST (a Make variable, not an environment variable) was actually not set.

A shell program would catch this case (variable used, but not set) when being used with the -o nounset option, and error out. Make does not have this kind of protection. There is no general protection against using variables that are not set. As far as I know!

Okay, but why was the variable DOCKER_GID_HOST not set when I have

DOCKER_GID_HOST := $(shell getent group docker | awk -F: '{print $$3}')

right before executing docker run? Well, because this is a Makefile. Where you cannot set a variable in one line of a recipe, and use it in the next one (a pattern that we use in almost every other programming environment).

The lines of a recipe are executed in independent shells. The next line’s state is very much decoupled from the previous line’s state, that’s how Makefiles work.

This is probably the most important thing to know about Make, and one of the most common mistakes, and I certainly knew this before, and I certainly made this same mistake before. And I made it again, like probably every single time that I had to do things with Make.

Makefiles are flexible and great, and sometimes they make you waste a great deal of time compared to other development environments, man.

Stack Overflow threads on the matter, with goodies like workarounds, best practices, and generally helpful discussion:

Building Pandas from source: conflicting types for ‘_xgetbv’ (while building BLOSC)

Quick note: I tried building pandas from source (Fedora 30).

During

$ python -m pip install -r requirements-dev.txt

I ran into

blosc/shuffle.c:177:1: error: conflicting types for ‘_xgetbv’

A little more context:

$ python -m pip install -r requirements-dev.txt
 
[...]
 
Installing collected packages: blosc, bottleneck, numexpr, soupsieve, beautifulsoup4, llvmlite, numba, thrift, fastparquet, html5lib, lxml, jdcal, et-xmlfile, openpyxl, pyarrow, PyQt5-sip, pyqt5, tables, python-snappy, s3fs, sqlalchemy, xarray, xlrd, xlsxwriter, xlwt, odfpy, pyreadstat, pandas-sphinx-theme
  Running setup.py install for blosc ... error
    Complete output from command /home/jp/.pyenv/versions/374-pandas-dev/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-fydbjk8m/blosc/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-x9bwxqkv/install-record.txt --single-version-externally-managed --compile --install-headers /home/jp/.pyenv/versions/374-pandas-dev/include/site/python3.7/blosc:
    SSE2 detected
    AVX2 detected
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.7
    creating build/lib.linux-x86_64-3.7/blosc
 
[...]
 
    c-blosc/blosc/shuffle.c:177:1: error: conflicting types for ‘_xgetbv’
      177 | _xgetbv(uint32_t xcr) {
          | ^~~~~~~
    In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/immintrin.h:43,
                     from c-blosc/blosc/blosc-common.h:71,
                     from c-blosc/blosc/shuffle.h:18,
                     from c-blosc/blosc/shuffle.c:10:
    /usr/lib/gcc/x86_64-redhat-linux/9/include/xsaveintrin.h:60:1: note: previous definition of ‘_xgetbv’ was here
       60 | _xgetbv (unsigned int __A)
          | ^~~~~~~
    In file included from c-blosc/blosc/shuffle.c:11:
    c-blosc/blosc/shuffle-generic.h:61:13: warning: ‘unshuffle_generic_inline’ defined but not used [-Wunused-function]
       61 | static void unshuffle_generic_inline(const size_t type_size,
          |             ^~~~~~~~~~~~~~~~~~~~~~~~
    c-blosc/blosc/shuffle-generic.h:32:13: warning: ‘shuffle_generic_inline’ defined but not used [-Wunused-function]
       32 | static void shuffle_generic_inline(const size_t type_size,
          |             ^~~~~~~~~~~~~~~~~~~~~~
    error: command 'gcc' failed with exit status 1

The python-blosc documentation says :

Compiler specific optimisations are automatically enabled by inspecting the CPU flags building Blosc. They can be manually disabled by setting the following environmental variables: DISABLE_BLOSC_SSE2 and DISABLE_BLOSC_AVX2.

I ignorantly retried building with AVX2 instructions disabled:

$ DISABLE_BLOSC_AVX2=true pip install blosc
Collecting blosc
  Using cached https://files.pythonhosted.org/packages/6d/3b/2b707cd330a205ba5c69b5e8bfa9c05691442e45ce9ce882c4c8d343e61a/blosc-1.8.1.tar.gz
Installing collected packages: blosc
  Running setup.py install for blosc ... done
Successfully installed blosc-1.8.1
$ python -m pip install -r requirements-dev.txt
...
$ python setup.py build_ext --inplace -j 4
...

That worked. In my development setup I do not care about BLOSC performance, which is why I am OK with that workaround.

By the way, I needed to set up the following dependencies on Fedora 30:

sudo dnf install @development-tools
sudo dnf install libzstd-devel gcc-c++ snappy-devel