Category Archives: Python

Process Google Sheets data in Python with Pandas (example: running distance over time)

Every now and then I do a little bit of data mangling for personal use using tools that I got to appreciate, mainly during professional work.

In this blog post I would like to share an example for simple and yet powerful end-to-end data processing using Google Sheets, Python, and Pandas (and I have to notice: my first blog post about data analysis with Pandas is already more than six years old, time flies by).

Raw data in a spreadsheet: a date and a distance for every run

I have a chaotic spreadsheet in Google Sheets where I keep track of my runs. It is easy to edit from the smartphone, and synchronized across devices.

This is a screenshot of a part of that spreadsheet (hiding some irrelevant columns, showing only a small fraction of the rows):

Each row in the date column refers to a specific day using the text format YYYY-MM-DD. Some rows in the km column contain numerical values. Each of these means that I have made a run of the distance given by the value, in kilometers, on the corresponding day. Missing values in the km column mean that on those days I did not do a run (or forgot to keep track of it). In the screenshot, it looks like days in the date column are represented without gaps, but that is not important.

Now you know about the kind of data that I have been entering manually for about half a year, about my runs.

Analysis goal

My goal today was to do a tiny bit of data analysis — for myself — and to then write this blog post, for you :-).

In terms of data analysis, my goal was to look into the evolution of my running performance over time. I wanted to start by looking at “running distance per week”. Specifically, my goal was to perform a rolling window analysis with a window width wider than a week (to focus on changes on longer time scales more than on fast fluctuations), and to plot the outcome.

So I built a small tool using Python and Pandas which

  • automatically fetches the current dataset from Google Sheets (as a CSV document)
  • processes and prepares the data (removing dirt, filling gaps, …)
  • performs statistical analyses
  • creates a plot (and writes a PNG graphics file)

Results first

Here is some code: https://github.com/jgehrcke/runni/blob/master/runni.py

Here is how I use it, and how you can use it, too:

# Get the code.
$ git clone https://github.com/jgehrcke/runni && cd runni
 
# Enable link sharing to the Google Sheet (so that anyone
# with the link can access the sheet). Get the corresponding
# ID/key from the URL, and set it as an environment variable
# (it's sensitive data).
$ export RUNNI_GSHEET_KEY='[snip]'
 
# Dependencies: Python 3, pandas, matplotlib, requests
 
# Run the analysis program.
$ python runni.py
...
200112-19:57:39.060 INFO: Writing PNG figure to 2020-01-12_running-distance-per-week-over-time.png

The resulting plot:
 

For each day in the data interval, a small gray data point explicitly shows the distance I ran on that very day (on most of the days this is 0 km). In the majority of the cases, every non-zero gray data point corresponds to a single run (the only run on the corresponding day), but more generally it is the distance sum of all runs on that very day.

The thick black line is the “distance per week”, derived from a rolling window analysis with a window width of 14 days.

in-Pandas data processing in more detail (the non-trivial part)

The following code block (from here) with its code comments shows the core of the in-Pandas data processing and is the main reason for why I write this blog post: I think this is a non-trivial part. In previous projects (goeffel, bouncer-log-analysis, dcos-dev-prod-analysis) I actually put a bit of thought into how to do a meaningful rolling window analysis with Pandas, and here I am simply re-using what I learned before. But some of that it is still non-trivial and deserves an explanation.

The code comments are supposed to provide a hopefully helpful level of explanation. From the comments, it should at least be obvious that several decisions need to be made before the data in the spreadsheet can be analyzed in a meaningful way in a rolling/sliding window analysis.

# Keep only those rows that have a value set in the `km` column. That is
# the criterion for having made a run on the corresponding day.
df = df[df.km.notnull()]
 
# Parse text in `date` column into a pd.Series of `datetime64` type values
# and make this series be the new index of the data frame.
df.index = pd.to_datetime(df["date"])
 
# Sort data frame by index (sort from past to future).
df = df.sort_index()
 
# Turn the data frame into a `pd.Series` object, representing the distance
# ran over time. Every row / data point in this series represents a run:
# the index field is the date (day) of the run and the value is the
# distance of the run.
km_per_run = df["km"]
 
# There may have been more than one run per day. In these cases, sum up the
# distances and have a single row represent all runs of the day.
 
# Example: two runs on 07-11:
# 2019-07-10    3.2
# 2019-07-11    4.5
# 2019-07-11    5.4
# 2019-07-17    4.5
 
# Group events per day and sum up the run distance:
km_per_run = km_per_run.groupby(km_per_run.index).sum()
 
# Outcome for above's example:
# 2019-07-10    3.2
# 2019-07-11    9.9
# 2019-07-17    4.5
 
# The time series index is expected to have gaps: days on which no run was
# recorded. Up-sample the time index to fill these gaps, with 1 day
# resolution. Fill the missing values with zeros. This is not strictly
# necessary for the subsequent analysis but makes the series easier to
# reason about, and makes the rolling window analysis a little simpler: it
# will contain one data point per day, precisely, within the represented
# time interval.
#
# Before:
#   In [28]: len(km_per_run)
#   Out[28]: 75
#
#   In[27]: km_per_run.head()
#   Out[27]:
#   2019-05-27    2.7
#   2019-06-06    2.9
#   2019-06-11    4.6
#   ...
#
# After:
#   In [30]: len(km_per_run)
#   Out[30]: 229
#
#   In [31]: km_per_run.head()
#   Out[31]:
#   2019-05-27    2.7
#   2019-05-28    0.0
#   2019-05-29    0.0
#   2019-05-30    0.0
#   ...
#
km_per_run = km_per_run.asfreq("1D", fill_value=0)
 
# Should be >= 7 to be meaningful.
window_width_days = opts.window_width_days
window = km_per_run.rolling(window="%sD" % window_width_days)
 
# For each window position get the sum of distances. For normalization,
# divide this by the window width (in days) to get values of the unit
# km/day -- and then convert to the new desired unit of km/week with an
# additional factor of 7.
km_per_week = window.sum() / (window_width_days / 7.0)
 
# During the rolling window analysis the value derived from the current
# window position is assigned to the right window boundary (i.e. to the
# newest timestamp in the window). For presentation it is more convenient
# and intuitive to have it assigned to the temporal center of the time
# window. Invoking `rolling(..., center=True)` however yields
# `NotImplementedError: center is not implemented for datetimelike and
# offset based windows`. As a workaround, shift the data by half the window
# size to 'the left': shift the timestamp index by a constant / offset.
offset = pd.DateOffset(days=window_width_days / 2.0)
km_per_week.index = km_per_week.index - offset

Closing remarks

  • What is shown above is I think a well-confined, simple example for real-world data analysis. I like the architecture of maintaining raw data in Google Sheets to then consume it via HTTP for analysis using proper tooling (the data analysis and plotting options within Google Sheets are very limited, heck). With that example, I hope I can inspire some of you people out there to do similar things. There are endless possibilities: in my spreadsheet, I have other columns such as the run duration, … :-).
  • I will keep adding data points to my spreadsheet after about every run and will keep re-generating the graph more or less regularly for my entertainment.
  • The meaning and impact of the window width in the rolling window analysis are critical. I have not explained that above. I think one of the best ways to grasp it is to visually play with it — that’s what the --window-width-days argument can help with.
  • Again, you can find the code for inspiration here: https://github.com/jgehrcke/runni

Building Pandas from source: conflicting types for ‘_xgetbv’ (while building BLOSC)

Quick note: I tried building pandas from source (Fedora 30).

During

$ python -m pip install -r requirements-dev.txt

I ran into

blosc/shuffle.c:177:1: error: conflicting types for ‘_xgetbv’

A little more context:

$ python -m pip install -r requirements-dev.txt
 
[...]
 
Installing collected packages: blosc, bottleneck, numexpr, soupsieve, beautifulsoup4, llvmlite, numba, thrift, fastparquet, html5lib, lxml, jdcal, et-xmlfile, openpyxl, pyarrow, PyQt5-sip, pyqt5, tables, python-snappy, s3fs, sqlalchemy, xarray, xlrd, xlsxwriter, xlwt, odfpy, pyreadstat, pandas-sphinx-theme
  Running setup.py install for blosc ... error
    Complete output from command /home/jp/.pyenv/versions/374-pandas-dev/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-fydbjk8m/blosc/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-x9bwxqkv/install-record.txt --single-version-externally-managed --compile --install-headers /home/jp/.pyenv/versions/374-pandas-dev/include/site/python3.7/blosc:
    SSE2 detected
    AVX2 detected
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.7
    creating build/lib.linux-x86_64-3.7/blosc
 
[...]
 
    c-blosc/blosc/shuffle.c:177:1: error: conflicting types for ‘_xgetbv’
      177 | _xgetbv(uint32_t xcr) {
          | ^~~~~~~
    In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/immintrin.h:43,
                     from c-blosc/blosc/blosc-common.h:71,
                     from c-blosc/blosc/shuffle.h:18,
                     from c-blosc/blosc/shuffle.c:10:
    /usr/lib/gcc/x86_64-redhat-linux/9/include/xsaveintrin.h:60:1: note: previous definition of ‘_xgetbv’ was here
       60 | _xgetbv (unsigned int __A)
          | ^~~~~~~
    In file included from c-blosc/blosc/shuffle.c:11:
    c-blosc/blosc/shuffle-generic.h:61:13: warning: ‘unshuffle_generic_inline’ defined but not used [-Wunused-function]
       61 | static void unshuffle_generic_inline(const size_t type_size,
          |             ^~~~~~~~~~~~~~~~~~~~~~~~
    c-blosc/blosc/shuffle-generic.h:32:13: warning: ‘shuffle_generic_inline’ defined but not used [-Wunused-function]
       32 | static void shuffle_generic_inline(const size_t type_size,
          |             ^~~~~~~~~~~~~~~~~~~~~~
    error: command 'gcc' failed with exit status 1

The python-blosc documentation says :

Compiler specific optimisations are automatically enabled by inspecting the CPU flags building Blosc. They can be manually disabled by setting the following environmental variables: DISABLE_BLOSC_SSE2 and DISABLE_BLOSC_AVX2.

I ignorantly retried building with AVX2 instructions disabled:

$ DISABLE_BLOSC_AVX2=true pip install blosc
Collecting blosc
  Using cached https://files.pythonhosted.org/packages/6d/3b/2b707cd330a205ba5c69b5e8bfa9c05691442e45ce9ce882c4c8d343e61a/blosc-1.8.1.tar.gz
Installing collected packages: blosc
  Running setup.py install for blosc ... done
Successfully installed blosc-1.8.1
$ python -m pip install -r requirements-dev.txt
...
$ python setup.py build_ext --inplace -j 4
...

That worked. In my development setup I do not care about BLOSC performance, which is why I am OK with that workaround.

By the way, I needed to set up the following dependencies on Fedora 30:

sudo dnf install @development-tools
sudo dnf install libzstd-devel gcc-c++ snappy-devel

pyenv has proven itself over the years

Back around 2010 I was manually managing production-critical CPython installations, compiled in special ways. I learned a lot about how to properly isolate Python installations from each other. I also learned how important it is to never confuse a “system Python” (as many Linux distributions have one) with a Python build that you can be in control of. Since then I have never touched a system Python anymore (it is the system’s Python after all, and none of my business).

For developing, testing, and running my software, I needed to manage custom Python installations. That involved building CPython from source, many times. At some point around 2015 I adopted pyenv for doing exactly that. Back then, I also adopted pyenv-virtualenv for managing so-called “virtual environments” derived from the individual Python builds managed by pyenv.

I come to say: this method has been a joy to use over the last years. It has never let me down; it has always provided me with simplicity, reliability, a predictable outcome based on which serious engineering work could be done. I have adopted it for continuous integration pipelines; I use it for building container images, towards more reproducible test environments. It always works.

I want to share the boring stuff I did today with pyenv/pyenv-virtualenv, as I have done it for years, always with a smile, thinking and appreciating “that just worked as expected”.

Today, for a program I am developing right now, I wanted to set up a CPython 3.6.9 build, with a virtual environment derived from it to install specific dependencies into it. I started with

cd ~/.pyenv && git pull

for updating my local pyenv installation to learn about current Python releases (I have been using the same pyenv installation for years, migrated it between development machines). I issued the following command for pyenv to download a CPython 3.6.9 source archive, and to build it:

pyenv install 3.6.9

I found that the resulting build didn’t provide the lzma module because I didn’t have the corresponding header files available on my Fedora installation. I installed the relevant header files with a

dnf install xz-devel

and simply repeated

pyenv install 3.6.9

This resulted in a CPython build including the lzma module. I then went ahead and derived a virtual environment from that build, for the specific purpose of developing my current pet project:

pyenv virtualenv 3.6.9 venv369-goeffel

I then activated said virtual environment with

pyenv activate venv369-goeffel

and started to use it for specific development work.

Not spectacular, right? That’s it, precisely. The power here comes from the simplicity and reliability. A big thank you goes out to the pyenv maintainers.

I think the following shows that I should start deleting old things, but I think it also shows the use case where pyenv/pyenv-virtualenv really shines:

$ pyenv versions
* system (set by /home/jp/.pyenv/version)
  2.7.15
  2.7.15/envs/gipcdev-2715
  3.5.2
  3.5.2/envs/venv352-analysis-2
  3.5.2/envs/venv352-beets
  3.5.2/envs/venv352-bouncer
  3.5.2/envs/venv352-bouncer19
  3.5.2/envs/venv352-bouncer19-2
  3.5.2/envs/venv352-bouncer-open
  3.5.2/envs/venv352-dcos-commons-simumation
  3.5.2/envs/venv352-jupyter
  3.5.2/envs/venv352-modern-crypto
  3.6.6
  3.6.6/envs/dcos-docker
  3.6.6/envs/dcos-e2e-zk35
  3.6.6/envs/gipc-py-366-gevent14
  3.6.6/envs/messer-hdf5
  3.6.6/envs/pandas-dev-env
  3.6.6/envs/test-gipc-100-release
  3.6.6/envs/test-gipc-101-release
  3.6.6/envs/venv366-bouncer-enterprise
  3.6.6/envs/venv366-bouncer-open
  3.6.6/envs/venv366-dataanalysis
  3.6.9
  bouncer-deptest
  cryptotest
  dataanalysis
  dcos-docker
  dcos-e2e-zk35
  gipcdev-2715
  gipc-py-366-gevent14
  messer-hdf5
  pandas-dev-env
  pypy2.7-5.9.0
  pypy2.7-5.9.0/envs/pypy27-gevent-gipc-dev
  pypy27-gevent-gipc-dev
  pypy3.5-5.9.0
  pypy3.5-5.9.0/envs/pypy35-gevent-gipc-dev
  pypy3.5-6.0.0
  pypy3.5-6.0.0/envs/venvpypy35600-gipc
  pypy35-gevent-gipc-dev
  pypy-5.6.0
  pypy-5.6.0/envs/venvpypy560-gevent120-gipc
  test-gipc-100-release
  test-gipc-101-release
  venv2710-kazoo
  venv2710-pysamltest
  venv2710-samldev
  venv342
  venv342-bouncer
  venv342-bouncer2
  venv342-bouncerdeps
  venv342-bouncerdeps2
  venv342-bouncerold
  venv342-bouncertmep
  venv342-pyoidc
  venv342-samldev
  venv342-test
  venv342-tmp
  venv343-bouncer
  venv343-saml
  venv344-bouncer
  venv344-consensus
  venv344-dev
  venv344-gunicorndev
  venv345-bouncer
  venv345-cryptography-patch
  venv345-kazam
  venv352-analysis-2
  venv352-beets
  venv352-bouncer
  venv352-bouncer19
  venv352-bouncer19-2
  venv352-bouncer-open
  venv352-dcos-commons-simumation
  venv352-jupyter
  venv352-modern-crypto
  venv366-bouncer-enterprise
  venv366-bouncer-open
  venv366-dataanalysis
  venv-dcos-342
  venv-graphtool
  venv-kazam
  venvpy3boto3
  venvpypy35600-gipc
  venvpypy560-gevent120-gipc
  venvtmp

gipc 1.0.0 release

More than three years after the 0.6.0 release I have published gipc version 1.0.0 today.

Quick links

Release highlights

This release focuses on reliability and platform compatibility. It brings along a number of changes relevant for running it on Windows and macOS, as well as for running it under PyPy.

Both, gevent 1.2 and 1.3 are now officially supported. On Linux, gipc now officially supports CPython 2.7, 3.4, 3.5, 3.6, PyPy2.7, and PyPy3. On Windows, gipc officially supports gevent 1.3 on CPython 2.7, 3.4, 3.5, 3.6, and 3.7. Support for gevent 1.1 and CPython 3.3 has been dropped.

The API did not change. In view of the stability of the API over the recent years I thought it is time to officially declare it as such, and to follow the semantic versioning spec‘s point 5: “Version 1.0.0 defines the public API” :-).

For this release most of the work went into

  • fixing a small number of platform-dependent bugs (this one was interesting, and this one was pretty insightful and ugly).
  • setting up a continuous integration (CI) pipeline for Linux and macOS (on Travis CI) as well as for Windows (via AppVeyor).
  • Re-writing and re-styling significant parts of the documentation: the new docs are online and can be found at https://gehrcke.de/gipc (for comparison: old docs).
  • moving the repository from Bitbucket to GitHub (I also migrated issues using this well-engineered helper).
  • making tests more stable.
  • running the example programs as part of CI, on all supported platforms (required a number of consolidations).

Acknowledgements

I would like to thank the following people who have helped with this release, be it by submitting bug reports, by asking great questions, with testing, or with a bit of code: Heungsub Lee, James Addison, Akhil Acharya, Oliver Margetts.

Changelog for this release

For the record, the complete changelog for this release copied from CHANGELOG.rst:

New platform support:

  • Add support for PyPy on Linux. Thanks to Oliver Margetts and to Heungsub Lee for patches.

Fixes:

  • Fix a bug as of which gipc crashed when passing “raw” pipe handles between processes on Windows (see issue #63).
  • Fix can't pickle gevent._semaphore.Semaphore error on Windows.
  • Fix ModuleNotFoundError in test_wsgi_scenario.
  • Fix signal handling in example infinite_send_to_child.py.
  • Work around segmentation fault after fork on Mac OS X (affected test_wsgi_scenario and example program wsgimultiprocessing.py).

Test / continuous integration changes:

  • Fix a rare instability in test_exitcode_previous_to_join.
  • Make test_time_sync more stable.
  • Run the example programs as part of CI (run all on Linux and Mac, run most on Windows).
  • Linux main test matrix (all combinations are covered):
    • gevent dimension: gevent 1.2.x, gevent 1.3.x.
    • Python implementation dimension: CPython 2.7, 3.4, 3.5, 3.6, PyPy2.7, PyPy3.
  • Also test on Linux: CPython 3.7, pyenv-based PyPy3 and PyPy2.7 (all with gevent 1.3.x only).
  • Mac OS X tests (all with gevent 1.3.x):
    • pyenv Python builds: CPython 2.7, 3.6, PyPy3
    • system CPython
  • On Windows, test with gevent 1.3.x and CPython 2.7, 3.4, 3.5, 3.6, 3.7.

Potentially breaking changes:

  • gevent 1.1 is not tested anymore.
  • CPython 3.3 is not tested anymore.

How to raise UnicodeDecodeError in Python 3

For debugging work, I needed to manually raise UnicodeDecodeError in CPython 3(.4). Its constructor requires 5 arguments:

>>> raise UnicodeDecodeError()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: function takes exactly 5 arguments (0 given)

The docs specify which properties an already instantiated UnicodeError (the base class) has: https://docs.python.org/3/library/exceptions.html#UnicodeError

These are five properties. It is obvious that the constructor needs these five pieces of information. However, setting them requires knowing the expected order, since the constructor does not take keyword arguments. The signature also is not documented when invoking help(UnicodeDecodeError), which suggests that this interface is implemented in C (which makes sense, as it affects the bowels of CPython’s text processing).

So, the only true reference for finding out the expected order of arguments is the C code implementing the constructor. It is defined by the function UnicodeDecodeError_init in the file Objects/exceptions.c in the CPython repository. The essential part are these lines:

if (!PyArg_ParseTuple(args, "O!OnnO!",
     &PyUnicode_Type, &ude->encoding,
     &ude->object,
     &ude->start,
     &ude->end,
     &PyUnicode_Type, &ude->reason)) {
         ude->encoding = ude->object = ude->reason = NULL;
         return -1;
}

That is, the order is the following:

  1. encoding (unicode object, i.e. type str)
  2. object that was attempted to be decoded (here a bytes object makes sense, whereas in fact the requirement just is that this object must provide the Buffer interface)
  3. start (integer)
  4. end (integer)
  5. reason (type str)

Hence, now we know how to artificially raise a UnicodeDecodeError:

>>> o = b'\x00\x00'
>>> raise UnicodeDecodeError('funnycodec', o, 1, 2, 'This is just a fake reason!')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'funnycodec' codec can't decode byte 0x00 in position 1: This is just a fake reason!