Category Archives: Software architecture

Atomically switch a directory tree served by nginx

This post briefly demonstrates how to atomically switch a directory tree of static files served by nginx.

Consider the following minimal nginx config file:

$ cat conf/nginx.conf 
events {
    use epoll;
http {
    server {
        location / {
            root /static/current ;

The goal is to replace the directory /static/current atomically while nginx is running.

This snippet shows the directory layout that I started out with:

$ tree
├── conf
│   └── nginx.conf
└── static
    ├── version-1
    │   └── hello.html
    └── version-2
        └── hello.html

conf/nginx.conf is shown above. The static directory contains two sub trees, and the goal is to switch from version-1 to version-2.

For this demonstration I have started a containerized nginx from its official Docker image:

$ docker run -v $(realpath static):/static:ro -v $(realpath conf):/etc/nginx:ro -p nginx nginx -g 'daemon off;'

This mounts the ./static directory as well as the nginx configuration file into the container, and exposes nginx listening on port 8088 on the local network interface of the host machine.

Then, in the ./static directory one can choose the directory tree served by nginx by setting a symbolic link, and one can subsequently switch the directory tree atomically, as follows:

1) No symbolic link is set yet — leading to a 404 HTTP response (the path /static/current does not exist in the container from nginx’ point of view):

$ curl http://localhost:8088/hello.html
<head><title>404 Not Found</title></head>
<center><h1>404 Not Found</h1></center>

2) Set the current symlink to serve version-1:

$ cd static
$ ln -s version-1 current && curl http://localhost:8088/hello.html
hello 1

3) Prepare a new symlink for version-2 (but don’t switch yet):

$ ln -s version-2 newcurrent

4) Atomically switch to serving version-2:

$ mv -fT newcurrent current && curl http://localhost:8088/hello.html
hello 2

In step (4) It is essential to use mv -fT which changes the symlink with a rename() system call. ln -sfn would also appear to work, but it uses two system calls under the hood and therefore leaves a brief time window during which opening files can fail because the path is invalid.

Final directory layout including the symlink current (currently pointing to version-2):

$ tree
├── conf
│   └── nginx.conf
└── static
    ├── current -> version-2
    ├── version-1
    │   └── hello.html
    └── version-2
        └── hello.html

Kudos to for being a great reference.

gipc 0.6.0 released

I have just released gipc 0.6.0 introducing support for Python 3. This release has been motivated by gevent 1.1 being just around the corner, which will also introduce Python 3 support.

Changes under the hood

The new gipc version has been verified to work on CPython 2.6.9, 2.7.10, 3.3.6, and 3.4.3. Python 3.4 support required significant changes under the hood: internally, gipc uses multiprocessing.Process, whose implementation drastically changed from Python 3.3 to 3.4. Most notably, on Windows, the arguments to the hard-coded CreateProcess() system call were changed, preventing automatic inheritance of file descriptors. Hence, implementation of a new approach for file descriptor transferral between processes was required for supporting Python 3.4 as well as future Python versions. Reassuringly, the unit test suite required close-to-zero adjustments.

The docs got a fresh look

I have used this opportunity to amend the documentation: it now has a fresh look based on a brutally modified RTD theme. This provides a much better reading experience as well as great support for narrow screens (i.e. mobile support). I hope you like it:

Who’s using gipc?

I have searched the web a bit for finding interesting use cases. These great projects use gipc:

Are you successfully applying gipc in production? That is always great to hear, so please drop me a line!


As usual, the release is available via PyPI ( Please visit for finding API documentation, code examples, installation notes, and further in-depth information.

Git: list authors sorted by the time of their first contribution

I have created a micro Python script, git-authors, which parses the output of git log and outputs the authors in order of their first contribution to the repository.

By itself, this is a rather boring task. However, I think the resulting code is quite interesting, because it applies a couple of really important concepts and Python idioms within just a couple of lines. Hence, this small piece of code is not only useful in practice; it also serves an educational purpose.

The latter is what this article focuses on. What can you expect to learn? Less than ten lines of code are discussed. Center of this discussion is efficient stream processing and proper usage of data structures: with a very simple and yet efficient architecture, git-authors can analyze more than 500.000 commits of the Linux git repository with near-zero memory requirements and consumption of only 1 s of CPU time on a weak machine.

Usage: pipe formatted data from git log to git-authors

The recommended usage of git-authors is:

$ git log --encoding=utf-8 --full-history --reverse "--format=format:%at;%an;%ae" | \
    git-authors > authors-by-first-contrib.txt

A little background information might help for understanding this. git-authors expects to be fed with a data stream on standard input (stdin), composed of newline-separated chunks. Each chunk (line) is expected to represent one commit, and is expected to be of a certain format:


Furthermore, git-authors expects to retrieve these commits sorted by time, ascendingly (newest last).

git log can be configured to output the required format (via --format=format:%at;%an;%ae) and to output commits sorted by time (default) with the earliest commits first (using --reverse).

git log writes its output data to standard output (stdout). The canonical method for connecting stdout of one process to stdin of another process is a pipe, a transport mechanism provided by the operating system.

The command line options --encoding=utf-8 and --full-history should not be discussed here, for simplicity.

The input evaluation loop

Remember, the git-authors program expects to retrieve a data stream via stdin. An example snippet of such a data stream could look like this:

1113690343;Bert Wesarg;
1113690343;Ken Chen;
1113690344;Christoph Hellwig;
1113690345;Bernard Blackham;
1113690346;Jan Kara;

The core of git-authors is a loop construct built of seven lines of code. It processes named input stream in a line-by-line fashion. Let’s have a look at it:

  1. seen = set()
  2. for line in stdin:
  3.     timestamp, name, mail = line.strip().split(";")
  4.     if name not in seen:
  5.         seen.add(name)
  6.         day = time.strftime("%Y-%m-%d", time.gmtime(float(timestamp)))
  7.         stdout.write("%04d (%s): %s (%s)\n" % (len(seen), day, name, mail))

There are a couple of remarks to be made about this code:

  • This code processes the stream retrieved at standard input in a line-by-line fashion: in line 2, the script makes use of the fact that Python streams (implemented via IOBase) support the iterator protocol, meaning that they can be iterated over, whereas a single line is yielded from the stream upon each iteration (until the resource has been entirely consumed).

  • The data flow in the loop is constructed in a way that the majority of the payload data (the line’s content) is processed right away and not stored for later usage. This is a crucial concept, ensuring a small memory footprint and, even more important, a memory footprint that does (almost) not depend on the size of the input data (which also means that the memory consumption becomes largely time-independent). The minimum amount of data that this program requires to keep track of across loop iterations is a collection of unique author names already seen (repetitions are to be discarded). And that is exactly what is stored in the set called seen. How large might this set become? An estimation: how many unique author names will the largest git project ever accumulate? A couple of thousand maybe? As of summer 2015, the Linux git repository counts more than half a million commits and about 13.000 unique author names. Linux should have one of the largest if not the largest git history. It can safely be assumed that O(10^5) short strings is the maximum amount of data this program will ever need to store in memory. How much memory is required for storing a couple of thousand short strings in Python? You might want to measure this, but it is almost nothing, at least compared to how much memory is built into smartphones. When analyzing the Linux repository, the memory footprint of git-authors stayed well below 10 MB.

  • Line 3 demonstrates how powerful Python’s string methods are, especially when cascaded. It also shows how useful multiple assignment via sequence unpacking can be.

  • “Use the right data structure(s) for any given problem!” is an often-preached rule, for ensuring that the time complexity of the applied algorithm is not larger than the problem requires. Lines 1, 4, and 5 are a great example for this, I think. Here, we want to keep track of the authors that have already been observed in the stream (“seen”). This kind of problem naturally requires a data structure allowing for lookup (Have I seen you yet?) and insert (I’ve seen you!) operations. Generally, a hash table-like data structure fits these problems best, because it provides O(1) (constant) complexity for both, lookup and insertion. In Python, the dictionary implementation as well as the set implementation are both based on a hash table (in fact, these implementations share a lot of code). Both dicts and sets also provide a len() method of constant complexity, which I have made use of in line 7. Hence, the run time of this algorithm is proportional to the input size (to the number of lines in the input). It is impossible to scale better than that (every line needs to be looked at), but there are many sub-optimal ways to implement a solution that scales worse than linearly.

  • The reason why I chose to use a set instead of a dictionary is rather subtle: I think the add() semantics of set fit the given problem really well, and here we really just want to keep track of keys (the typical key-value association of the dictionary is not required here). Performance-wise, the choice shouldn’t make a significant difference.

  • Line 6 demonstrates the power of the time module. Take a Unix timestamp and generate a human-readable time string from it: easy. In my experience, strftime() is one of the most-used methods in the time module, and learning its format specifiers by heart can be really handy.

  • Line 7: old-school powerful Python string formatting with a very compact syntax. Yes, the “new” way for string formatting (PEP 3101) has been around for years, and deprecation of the old style was once planned. Truth is, however, that the old-style formatting is just too beloved and established, and will probably never even become deprecated, let alone removed. Its functionality was just extended in Python 3.5 via PEP 461.

Preparation of input and output streams

What is not shown in the snippet above is the preparation of the stdin and stdout objects. I have come up with the following method:

kwargs = {"errors": "replace", "encoding": "utf-8", "newline": "\n"}
stdin =, **kwargs)
stdout =, mode="w", **kwargs)

This is an extremely powerful recipe for obtaining the same behavior on Python 2 as well as on 3, but also on Windows as well as on POSIX-compliant platforms. There is a long story behind this which should not be the focus of this very article. In essence, Python 2 and Python 3 treat sys.stdin/sys.stdout very differently. Grabbing the underlying file descriptors by their balls via fileno() and creating TextIOWrapper stream objects on top of them is a powerful way to disable much of Python’s automagic and therefore to normalize behavior among platforms. The automagic I am referring to here especially includes Python 3’s platform-dependent automatic input decoding and output encoding, and universal newline support. Both really can add an annoying amount of complexity in certain situations, and this here is one such case.

Example run on CPython’s (inofficial) git repository

I applied git-authors to the current state of the inofficial CPython repository hosted at GitHub. As a side node, this required about 0.1 s of CPU time on my test machine. I am showing the output in full length below, because I find its content rather interesting. We have to appreciate that the commit history is not entirely broken, despite CPython having switched between different version control systems over the last 25 years. Did you know that Just van Rossum also was a committer? :-)

0001 (1990-08-09): Guido van Rossum (
0002 (1992-08-04): Sjoerd Mullender (
0003 (1992-08-13): Jack Jansen (
0004 (1993-01-10): cvs2svn (
0005 (1994-07-25): Barry Warsaw (
0006 (1996-07-23): Fred Drake (
0007 (1996-12-09): Roger E. Masse (
0008 (1997-08-13): Jeremy Hylton (
0009 (1998-03-03): Ken Manheimer (
0010 (1998-04-09): Andrew M. Kuchling (
0011 (1998-12-18): Greg Ward (
0012 (1999-01-22): Just van Rossum (
0013 (1999-11-07): Greg Stein (
0014 (2000-05-12): Gregory P. Smith (
0015 (2000-06-06): Trent Mick (
0016 (2000-06-07): Marc-André Lemburg (
0017 (2000-06-09): Mark Hammond (
0018 (2000-06-29): Fredrik Lundh (
0019 (2000-06-30): Skip Montanaro (
0020 (2000-06-30): Tim Peters (
0021 (2000-07-01): Paul Prescod (
0022 (2000-07-10): Vladimir Marangozov (
0023 (2000-07-10): Peter Schneider-Kamp (
0024 (2000-07-10): Eric S. Raymond (
0025 (2000-07-14): Thomas Wouters (
0026 (2000-07-29): Moshe Zadka (
0027 (2000-08-15): David Scherer (
0028 (2000-09-07): Thomas Heller (
0029 (2000-09-08): Martin v. Löwis (
0030 (2000-09-15): Neil Schemenauer (
0031 (2000-09-21): Lars Gustäbel (
0032 (2000-09-24): Nicholas Riley (
0033 (2000-10-03): Ka-Ping Yee (
0034 (2000-10-06): Jim Fulton (
0035 (2001-01-10): Charles G. Waldman (
0036 (2001-03-22): Steve Purcell (
0037 (2001-06-25): Steven M. Gava (
0038 (2001-07-04): Kurt B. Kaiser (
0039 (2001-07-04): unknown (
0040 (2001-07-20): Piers Lauder (
0041 (2001-08-23): Finn Bock (
0042 (2001-08-27): Michael W. Hudson (
0043 (2001-10-31): Chui Tey (
0044 (2001-12-19): Neal Norwitz (
0045 (2001-12-21): Anthony Baxter (
0046 (2002-02-17): Andrew MacIntyre (
0047 (2002-03-21): Walter Dörwald (
0048 (2002-05-12): Raymond Hettinger (
0049 (2002-05-15): Jason Tishler (
0050 (2002-05-28): Christian Tismer (
0051 (2002-06-14): Steve Holden (
0052 (2002-09-23): Tony Lownds (
0053 (2002-11-05): Gustavo Niemeyer (
0054 (2003-01-03): David Goodger (
0055 (2003-04-19): Brett Cannon (
0056 (2003-04-22): Alex Martelli (
0057 (2003-05-17): Samuele Pedroni (
0058 (2003-06-09): Andrew McNamara (
0059 (2003-10-24): Armin Rigo (
0060 (2003-12-10): Hye-Shik Chang (
0061 (2004-02-18): David Ascher (
0062 (2004-02-20): Vinay Sajip (
0063 (2004-03-21): Nicholas Bastin (
0064 (2004-03-25): Phillip J. Eby (
0065 (2004-08-04): Matthias Klose (
0066 (2004-08-09): Edward Loper (
0067 (2004-08-09): Dave Cole (
0068 (2004-08-14): Johannes Gijsbers (
0069 (2004-09-17): Sean Reifschneider (
0070 (2004-10-16): Facundo Batista (
0071 (2004-10-21): Peter Astrand (
0072 (2005-03-28): Bob Ippolito (
0073 (2005-06-03): Georg Brandl (
0074 (2005-11-16): Nick Coghlan (
0075 (2006-03-30): Ronald Oussoren (
0076 (2006-04-17): George Yoshida (
0077 (2006-04-23): Gerhard Häring (
0078 (2006-05-23): Richard Jones (
0079 (2006-05-24): Andrew Dalke (
0080 (2006-05-25): Kristján Valur Jónsson (
0081 (2006-05-25): Jack Diederich (
0082 (2006-05-26): Martin Blais (
0083 (2006-07-28): Matt Fleming (
0084 (2006-09-05): Sean Reifscheider (
0085 (2007-03-08): Collin Winter (
0086 (2007-03-11): Žiga Seilnacht (
0087 (2007-06-07): Alexandre Vassalotti (
0088 (2007-08-16): Mark Summerfield (
0089 (2007-08-18): Travis E. Oliphant (
0090 (2007-08-22): Jeffrey Yasskin (
0091 (2007-08-25): Eric Smith (
0092 (2007-08-29): Bill Janssen (
0093 (2007-10-31): Christian Heimes (
0094 (2007-11-10): Amaury Forgeot d'Arc (
0095 (2008-01-08): Mark Dickinson (
0096 (2008-03-17): Steven Bethard (
0097 (2008-03-18): Trent Nelson (
0098 (2008-03-18): David Wolever (
0099 (2008-03-25): Benjamin Peterson (
0100 (2008-03-26): Jerry Seutter (
0101 (2008-04-16): Jeroen Ruigrok van der Werven (
0102 (2008-05-13): Jesus Cea (
0103 (2008-05-24): Guilherme Polo (
0104 (2008-06-01): Robert Schuppenies (
0105 (2008-06-10): Josiah Carlson (
0106 (2008-06-10): Armin Ronacher (
0107 (2008-06-18): Jesse Noller (
0108 (2008-06-23): Senthil Kumaran (
0109 (2008-07-22): Antoine Pitrou (
0110 (2008-08-14): Hirokazu Yamamoto (
0111 (2008-12-24): Tarek Ziadé (
0112 (2009-03-30): R. David Murray (
0113 (2009-04-01): Michael Foord (
0114 (2009-04-11): Chris Withers (
0115 (2009-05-08): Philip Jenvey (
0116 (2009-06-25): Ezio Melotti (
0117 (2009-08-02): Frank Wierzbicki (
0118 (2009-09-20): Doug Hellmann (
0119 (2010-01-30): Victor Stinner (
0120 (2010-02-23): Dirkjan Ochtman (
0121 (2010-02-24): Larry Hastings (
0122 (2010-02-26): Florent Xicluna (
0123 (2010-03-25): Brian Curtin (
0124 (2010-04-01): Stefan Krah (
0125 (2010-04-10): Jean-Paul Calderone (
0126 (2010-04-18): Giampaolo Rodolà (
0127 (2010-05-26): Alexander Belopolsky (
0128 (2010-08-06): Tim Golden (
0129 (2010-08-14): Éric Araujo (
0130 (2010-08-22): Daniel Stutzbach (
0131 (2010-09-18): Brian Quinlan (
0132 (2010-11-05): David Malcolm (
0133 (2010-11-09): Ask Solem (
0134 (2010-11-10): Terry Reedy (
0135 (2010-11-10): Łukasz Langa (
0136 (2012-06-24): Ned Deily (
0137 (2011-01-14): Eli Bendersky (
0138 (2011-03-10): Eric V. Smith (
0139 (2011-03-10): R David Murray (
0140 (2011-03-12): orsenthil (
0141 (2011-03-14): Ross Lagerwall (
0142 (2011-03-14): Reid Kleckner (
0143 (2011-03-14): briancurtin (
0144 (2011-03-24): guido (
0145 (2011-03-30): Kristjan Valur Jonsson (
0146 (2011-04-04): brian.curtin (
0147 (2011-04-12): Nadeem Vawda (
0148 (2011-04-19): Giampaolo Rodola' (
0149 (2011-05-04): Alexis Metaireau (
0150 (2011-05-09): Gerhard Haering (
0151 (2011-05-09): Petri Lehtinen (
0152 (2011-05-24): Charles-François Natali (
0153 (2011-07-17): Alex Gaynor (
0154 (2011-07-27): Jason R. Coombs (
0155 (2011-08-02): Sandro Tosi (
0156 (2011-09-28): Meador Inge (
0157 (2012-01-09): Terry Jan Reedy (
0158 (2011-05-19): Tarek Ziade (
0159 (2011-05-22): Martin v. Loewis (
0160 (2011-05-31): Ralf Schmitt (
0161 (2011-09-12): Jeremy Kloth (
0162 (2012-03-14): Andrew Svetlov (
0163 (2012-03-21): krisvale (
0164 (2012-04-24): Marc-Andre Lemburg (
0165 (2012-04-30): Richard Oudkerk (
0166 (2012-05-15): Hynek Schlawack (
0167 (2012-06-20): doko (
0168 (2012-07-16): Atsuo Ishimoto (
0169 (2012-09-02): Zbigniew Jędrzejewski-Szmek (
0170 (2012-09-06): Eric Snow (
0171 (2012-09-25): Chris Jerdonek (
0172 (2012-12-27): Serhiy Storchaka (
0173 (2013-03-31): Roger Serwy (
0174 (2013-03-31): Charles-Francois Natali (
0175 (2013-05-10): Andrew Kuchling (
0176 (2013-06-14): Ethan Furman (
0177 (2013-08-12): Felix Crux (
0178 (2013-10-21): Peter Moody (
0179 (2013-10-25): bquinlan (
0180 (2013-11-04): Zachary Ware (
0181 (2013-12-02): Walter Doerwald (
0182 (2013-12-21): Donald Stufft (
0183 (2014-01-03): Daniel Holth (
0184 (2014-01-27): Yury Selivanov (
0185 (2014-04-15): Kushal Das (
0186 (2014-06-29): Berker Peksag (
0187 (2014-07-16): Tal Einat (
0188 (2014-10-08): Steve Dower (
0189 (2014-10-18): Robert Collins (
0190 (2015-03-22): Paul Moore (


Similar functionality is provided by the more full-blown frameworks grunt-git-authors and gitstats.

Some resources that might be insightful for you:

Official WordPress themes should have an official change log

Officially supported themes: TwentyXXX

My website is WordPress-backed. WordPress front-ends are called “themes”. There are official themes, released by WordPress/Automattic. And there are thousands of themes released by third parties. While the WordPress project has released many themes, not all of them are equally “important”. There is only one specific series of WordPress themes that is so-to-say most official: themes from the TwentyXXX series.

The issue: no update release notes

In this series, WordPress releases one theme per year (there was TwentyEleven, TwentyTwelve, TwentyThirteen, you get the point). The most recent one of these themes is included with every major release of WordPress. In other words: it does not get more official. Correspondingly, themes from this series enjoy long-term support by the WordPress project. That is, they retrieve maintenance updates even years after their initial release (TwentyEleven was last updated by the end of 2014, for instance). That is great, really! However, there is one very negative aspect with these updates: there are no official release notes. That’s horrible, thinking in engineering terms, and considering release ethics applied in other serious open source software projects.

Background: dependency hell

TwentyXXX theme updates are released rather silently: suddenly, the WordPress dashboard shows that there is an update. But there is no official change log or release note which one could base a decision on. Nothing, apart from an increased version number. That is different from updating WordPress plugins, where the change log usually is only one click away from the WordPress dashboard. Also, the theme version number can not be relied upon to be semantically expressive (AFAIK WordPress themes are not promised to follow semantic versioning, right?)

Now, some of you may think that newer always is better. Just update and trust the developers. But that is not how things work in real life. Generally, we should stick to the paradigm of “never change a running system”, unless […]: sometimes, an update might change behavior, which might not be desired. Sometimes an update might fix a security issue, which one should know about and update immediately. Or the update resolves a usability issue. Such considerations are true for updates for any kind of software. But, in the context of WordPress, there is an even more important topic to consider when updating a theme: an update might break child themes. Or, as expressed by xkcd: “Every change breaks someones workflow”:

A theme can be used by other developers, as a so-called parent theme, in a library fashion — it provides a programming interface. This affects many websites, like mine: a couple of years ago I have decided to base the theme used on my website (here) on the TwentyTwelve theme. I went ahead and created a child theme, which inherits most of its code from TwentyTwelve and changes layout and behavior only in a few aspects. I definitely cannot blindly press the “update” button when TwentyTwelve retrieves an update. This might immediately change the interface I developed my child against, and can consequently break any component of my child theme. Obviously, I cannot just try this out with my live/public website. So, I have to test this update before, in a development environment which is not public.

If proper release notes were available, I could possibly skip that testing and apply such an update right away if it’s just a minor one. Or, I would be alerted that there is a security hole fixed with a breaking change in the parent theme, and I’d know that I have to quickly react and re-work my child theme so that I can safely apply the update to the parent. These things need to be communicated, like in any other open source project with a decent release policy.

Concluding remarks

Yes, there are ways to reconstruct and analyze the code changes that were made. This URL structure actually is quite helpful for generating diffs between theme versions: That URL shows differences between TwentyTwelve 1.4 and 1.6. The same structure can be used for other official themes and version combinations. However, this does not replace a proper change log. WordPress is a mature, large-scale open source project with a huge developer community. Themes from the TwentyXXX series are a major component of this project. The project should provide change logs and/or release notes for every update — for compliance with expectations, and for enabling sound engineering decisions. Others want this, too:

Can any one point me to the release notes for 1.2 or a list of the applied changes? Updating from 1.1 has caused some minor, but unexpected presentation changes on one of my child themes, and I’d like to know what else has changed and what to test for before I upgrade further sites.

Songkick events for Google’s Knowledge Graph

Google can display upcoming concert events in the Knowledge Graph of musical artists (as announced in March 2014). This is a great feature, and probably many people in the field of music marketing and especially record labels aim to get this kind of data into the Knowledge Graph for their artists. However, Google does not magically find this data on its own. It needs to be informed, with a special kind of data structure (in the recently standardized JSON-LD format) contained within the artist’s website.

While of great interest to record labels, finding a proper technical solution to create and provide this data to Google still might be a challenge. I have prepared a web service that greatly simplifies the process of generating the required data structure. It pulls concert data from Songkick and translates them into the JSON-LD representation as required by Google. In the next section I explain the process by means of an example.

Web service usage example

The concert data of the band Milky Chance is published and maintained via Songkick, a service that many artists use. The following website shows — among others — all upcoming events of Milky Chance: My web service translates the data held by Songkick into the data structure that Google requires in order to make this concert data appear in their Knowledge Graph. This is the corresponding service URL that needs to be called to retrieve the data:

That URL is made of the base URL of the web service, the songkick ID of the artist (6395144 in this case), the artist name and the artist website URL. Try accessing named service URL in your browser. It currently yields this:

    "@context": "", 
    "@type": "MusicEvent", 
    "name": "Milky Chance", 
    "startDate": "2014-12-12", 
    "url": "", 
    "location": {
      "address": {
        "addressLocality": "Kiel", 
        "postalCode": "24116", 
        "streetAddress": "Eichhofstra\u00dfe 1", 
[ ... SNIP ~ 1000 lines of data ... ]
    "performer": {
      "sameAs": "", 
      "@type": "MusicGroup", 
      "name": "Milky Chance"

This piece of data needs to be included in the HTML source code of the artist website. Google then automatically finds this data and eventually displays the concert data in the Knowledge Graph (within a couple of days). That’s it — pretty simple, right? The good thing is that this method does not require layout changes to your website. This data can literally be included in any website, right now.

That is what happened in case of Milky Chance: some time ago, the data created by the web service was fed into the Milky Chance website. Consequently, their concert data is displayed in their Knowledge Graph. See for yourself: access and look out for upcoming events on the right hand side. Screenshot:


Google Knowledge Graph generated for Milky Chance. Note the upcoming events section: for this to appear, Google needs to find the event data in a special markup within the artist’s website.

So, in summary, when would you want to use this web service?

  • You have an interest in presenting the concert data of an artist in Google’s Knowledge Graph (you are record label or otherwise interested in improved marketing and user experience).
  • You have access to the artist website or know someone who has access.
  • The artist concert data already is present on Songkick or will be present in the future.

Then all you need is a specialized service URL, which you can generate with a small form I have prepared for you here:

Background: why Songkick?

Of course, the event data shown in the Knowledge Graph should be up to date and in sync with presentations of the same data in other places (bands usually display their concert data in many places: on Facebook, on their website, within third-party services, …). Fortunately, a lot of bands actually do manage this data in a central place (any other solution would be tedious). This central place/platform/service often is Songkick, because Songkick really made a nice job in providing people with what they need. My web service reflects recent changes made within Songkick.

Technical detail

The core of the web service is a piece of software that translates the data provided by Songkick into the JSON-LD data as required and specified by Google. The Songkick data is retrieved via Songkick’s JSON API (I applied for and got a Songkick API key). Large parts of this software deal with the unfortunate business of data format translation while handling certain edge cases.

The service is implemented in Python and hosted on Google App Engine. Its architecture is quite well thought-through (for instance, it uses memcache and asynchronous urlfetch wherever possible). It is ready to scale, so to say. Some technical highlights:

  • The web service enforces transport encryption (HTTPS).
  • Songkick back-end is queried via HTTPS only.
  • Songkick back-end is queried concurrently whenever possible.
  • Songkick responses are cached for several hours in order to reduce load on their service.
  • Responses of this web service are cached for several hours. These are served within milliseconds.

This is an overview of the data flow:

  1. Incoming request, specifying Songkick artist ID, artist name, and artist website.
  2. Using the Songkick API (SKA), all upcoming events are queried for this artist (one or more SKA requests, depending on number of events).
  3. For each event, the venue ID is extracted, if possible.
  4. All venues are queried for further details (this implicates as many SKA requests as venue IDs extracted).
  5. A JSON-LD representation of an event is constructed from a combination of
    • event data
    • venue data
    • user-given data (artist name and artist website)
  6. All event representations are combined and a returned.

Some notable points in this context:

  • A single request to this web service might implicate many requests to the Songkick API. This is why SKA responses are aggressively cached:
    • An example artist with 54 upcoming events requires 2 upcoming events API requests (two pages, cannot be requested concurrently) and requires roundabout 50 venue API requests (can be requested concurrently). Summed up, this implicates that my web service cannot respond earlier than three SKA round trip times take.
    • If none of the SKA responses has been cached before, the retrieval of about 2 + 50 SKA responses might easily take about 2 seconds.
    • This web services cannot be faster than SK delivers.
  • This web service applies graceful degradation when extracting data from Songkick (many special cases are handled, which is especially relevant for the venue address).

Generate your service URL

This blog post is just an introduction, and sheds some light on the implementation and decision-making. For general reference, I have prepared this document to get you started:

It contains a web form where you can enter the (currently) three input parameters required for using the service. It returns a service URL for you. This URL points to my application hosted on Google App Engine. Using this URL, the service returns the JSON data that is to be included in an artist’s website. That’s all, it’s really pretty simple.

So, please go ahead and use this tool. I’d love to retrieve some feedback. Closely look at the data it returns, and keep your eyes open for subtle bugs. If you see something weird, report it, please. I am very open for suggestions, and also interested in your questions regarding future plans, release cycle etc. Also, if you need support for (dynamically) including this kind of data in your artist’s website, feel free to contact me.