Category Archives: Software architecture

gipc 0.6.0 released

I have just released gipc 0.6.0 introducing support for Python 3. This release has been motivated by gevent 1.1 being just around the corner, which will also introduce Python 3 support.

Changes under the hood

The new gipc version has been verified to work on CPython 2.6.9, 2.7.10, 3.3.6, and 3.4.3. Python 3.4 support required significant changes under the hood: internally, gipc uses multiprocessing.Process, whose implementation drastically changed from Python 3.3 to 3.4. Most notably, on Windows, the arguments to the hard-coded CreateProcess() system call were changed, preventing automatic inheritance of file descriptors. Hence, implementation of a new approach for file descriptor transferral between processes was required for supporting Python 3.4 as well as future Python versions. Reassuringly, the unit test suite required close-to-zero adjustments.

The docs got a fresh look

I have used this opportunity to amend the documentation: it now has a fresh look based on a brutally modified RTD theme. This provides a much better reading experience as well as great support for narrow screens (i.e. mobile support). I hope you like it: https://gehrcke.de/gipc

Who’s using gipc?

I have searched the web a bit for finding interesting use cases. These great projects use gipc:

Are you successfully applying gipc in production? That is always great to hear, so please drop me a line!

Availability

As usual, the release is available via PyPI (https://pypi.python.org/pypi/gipc). Please visit https://gehrcke.de/gipc for finding API documentation, code examples, installation notes, and further in-depth information.

Git: list authors sorted by the time of their first contribution

I have created a micro Python script, git-authors, which parses the output of git log and outputs the authors in order of their first contribution to the repository.

By itself, this is a rather boring task. However, I think the resulting code is quite interesting, because it applies a couple of really important concepts and Python idioms within just a couple of lines. Hence, this small piece of code is not only useful in practice; it also serves an educational purpose.

The latter is what this article focuses on. What can you expect to learn? Less than ten lines of code are discussed. Center of this discussion is efficient stream processing and proper usage of data structures: with a very simple and yet efficient architecture, git-authors can analyze more than 500.000 commits of the Linux git repository with near-zero memory requirements and consumption of only 1 s of CPU time on a weak machine.

Usage: pipe formatted data from git log to git-authors

The recommended usage of git-authors is:

$ git log --encoding=utf-8 --full-history --reverse "--format=format:%at;%an;%ae" | \
    git-authors > authors-by-first-contrib.txt

A little background information might help for understanding this. git-authors expects to be fed with a data stream on standard input (stdin), composed of newline-separated chunks. Each chunk (line) is expected to represent one commit, and is expected to be of a certain format:

timestamp;authorname;authoremail

Furthermore, git-authors expects to retrieve these commits sorted by time, ascendingly (newest last).

git log can be configured to output the required format (via --format=format:%at;%an;%ae) and to output commits sorted by time (default) with the earliest commits first (using --reverse).

git log writes its output data to standard output (stdout). The canonical method for connecting stdout of one process to stdin of another process is a pipe, a transport mechanism provided by the operating system.

The command line options --encoding=utf-8 and --full-history should not be discussed here, for simplicity.

The input evaluation loop

Remember, the git-authors program expects to retrieve a data stream via stdin. An example snippet of such a data stream could look like this:

[...]
1113690343;Bert Wesarg;wesarg@informatik.uni-halle.de
1113690343;Ken Chen;kenneth.w.chen@intel.com
1113690344;Christoph Hellwig;hch@lst.de
1113690345;Bernard Blackham;bernard@blackham.com.au
1113690346;Jan Kara;jack@suse.cz
[...]

The core of git-authors is a loop construct built of seven lines of code. It processes named input stream in a line-by-line fashion. Let’s have a look at it:

  1. seen = set()
  2. for line in stdin:
  3.     timestamp, name, mail = line.strip().split(";")
  4.     if name not in seen:
  5.         seen.add(name)
  6.         day = time.strftime("%Y-%m-%d", time.gmtime(float(timestamp)))
  7.         stdout.write("%04d (%s): %s (%s)\n" % (len(seen), day, name, mail))

There are a couple of remarks to be made about this code:

  • This code processes the stream retrieved at standard input in a line-by-line fashion: in line 2, the script makes use of the fact that Python streams (implemented via IOBase) support the iterator protocol, meaning that they can be iterated over, whereas a single line is yielded from the stream upon each iteration (until the resource has been entirely consumed).

  • The data flow in the loop is constructed in a way that the majority of the payload data (the line’s content) is processed right away and not stored for later usage. This is a crucial concept, ensuring a small memory footprint and, even more important, a memory footprint that does (almost) not depend on the size of the input data (which also means that the memory consumption becomes largely time-independent). The minimum amount of data that this program requires to keep track of across loop iterations is a collection of unique author names already seen (repetitions are to be discarded). And that is exactly what is stored in the set called seen. How large might this set become? An estimation: how many unique author names will the largest git project ever accumulate? A couple of thousand maybe? As of summer 2015, the Linux git repository counts more than half a million commits and about 13.000 unique author names. Linux should have one of the largest if not the largest git history. It can safely be assumed that O(10^5) short strings is the maximum amount of data this program will ever need to store in memory. How much memory is required for storing a couple of thousand short strings in Python? You might want to measure this, but it is almost nothing, at least compared to how much memory is built into smartphones. When analyzing the Linux repository, the memory footprint of git-authors stayed well below 10 MB.

  • Line 3 demonstrates how powerful Python’s string methods are, especially when cascaded. It also shows how useful multiple assignment via sequence unpacking can be.

  • “Use the right data structure(s) for any given problem!” is an often-preached rule, for ensuring that the time complexity of the applied algorithm is not larger than the problem requires. Lines 1, 4, and 5 are a great example for this, I think. Here, we want to keep track of the authors that have already been observed in the stream (“seen”). This kind of problem naturally requires a data structure allowing for lookup (Have I seen you yet?) and insert (I’ve seen you!) operations. Generally, a hash table-like data structure fits these problems best, because it provides O(1) (constant) complexity for both, lookup and insertion. In Python, the dictionary implementation as well as the set implementation are both based on a hash table (in fact, these implementations share a lot of code). Both dicts and sets also provide a len() method of constant complexity, which I have made use of in line 7. Hence, the run time of this algorithm is proportional to the input size (to the number of lines in the input). It is impossible to scale better than that (every line needs to be looked at), but there are many sub-optimal ways to implement a solution that scales worse than linearly.

  • The reason why I chose to use a set instead of a dictionary is rather subtle: I think the add() semantics of set fit the given problem really well, and here we really just want to keep track of keys (the typical key-value association of the dictionary is not required here). Performance-wise, the choice shouldn’t make a significant difference.

  • Line 6 demonstrates the power of the time module. Take a Unix timestamp and generate a human-readable time string from it: easy. In my experience, strftime() is one of the most-used methods in the time module, and learning its format specifiers by heart can be really handy.

  • Line 7: old-school powerful Python string formatting with a very compact syntax. Yes, the “new” way for string formatting (PEP 3101) has been around for years, and deprecation of the old style was once planned. Truth is, however, that the old-style formatting is just too beloved and established, and will probably never even become deprecated, let alone removed. Its functionality was just extended in Python 3.5 via PEP 461.

Preparation of input and output streams

What is not shown in the snippet above is the preparation of the stdin and stdout objects. I have come up with the following method:

kwargs = {"errors": "replace", "encoding": "utf-8", "newline": "\n"}
stdin = io.open(sys.stdin.fileno(), **kwargs)
stdout = io.open(sys.stdout.fileno(), mode="w", **kwargs)

This is an extremely powerful recipe for obtaining the same behavior on Python 2 as well as on 3, but also on Windows as well as on POSIX-compliant platforms. There is a long story behind this which should not be the focus of this very article. In essence, Python 2 and Python 3 treat sys.stdin/sys.stdout very differently. Grabbing the underlying file descriptors by their balls via fileno() and creating TextIOWrapper stream objects on top of them is a powerful way to disable much of Python’s automagic and therefore to normalize behavior among platforms. The automagic I am referring to here especially includes Python 3’s platform-dependent automatic input decoding and output encoding, and universal newline support. Both really can add an annoying amount of complexity in certain situations, and this here is one such case.

Example run on CPython’s (inofficial) git repository

I applied git-authors to the current state of the inofficial CPython repository hosted at GitHub. As a side node, this required about 0.1 s of CPU time on my test machine. I am showing the output in full length below, because I find its content rather interesting. We have to appreciate that the commit history is not entirely broken, despite CPython having switched between different version control systems over the last 25 years. Did you know that Just van Rossum also was a committer? :-)

0001 (1990-08-09): Guido van Rossum (guido@python.org)
0002 (1992-08-04): Sjoerd Mullender (sjoerd@acm.org)
0003 (1992-08-13): Jack Jansen (jack.jansen@cwi.nl)
0004 (1993-01-10): cvs2svn (tools@python.org)
0005 (1994-07-25): Barry Warsaw (barry@python.org)
0006 (1996-07-23): Fred Drake (fdrake@acm.org)
0007 (1996-12-09): Roger E. Masse (rmasse@newcnri.cnri.reston.va.us)
0008 (1997-08-13): Jeremy Hylton (jeremy@alum.mit.edu)
0009 (1998-03-03): Ken Manheimer (klm@digicool.com)
0010 (1998-04-09): Andrew M. Kuchling (amk@amk.ca)
0011 (1998-12-18): Greg Ward (gward@python.net)
0012 (1999-01-22): Just van Rossum (just@lettererror.com)
0013 (1999-11-07): Greg Stein (gstein@lyra.org)
0014 (2000-05-12): Gregory P. Smith (greg@mad-scientist.com)
0015 (2000-06-06): Trent Mick (trentm@activestate.com)
0016 (2000-06-07): Marc-André Lemburg (mal@egenix.com)
0017 (2000-06-09): Mark Hammond (mhammond@skippinet.com.au)
0018 (2000-06-29): Fredrik Lundh (fredrik@pythonware.com)
0019 (2000-06-30): Skip Montanaro (skip@pobox.com)
0020 (2000-06-30): Tim Peters (tim.peters@gmail.com)
0021 (2000-07-01): Paul Prescod (prescod@prescod.net)
0022 (2000-07-10): Vladimir Marangozov (vladimir.marangozov@t-online.de)
0023 (2000-07-10): Peter Schneider-Kamp (nowonder@nowonder.de)
0024 (2000-07-10): Eric S. Raymond (esr@thyrsus.com)
0025 (2000-07-14): Thomas Wouters (thomas@python.org)
0026 (2000-07-29): Moshe Zadka (moshez@math.huji.ac.il)
0027 (2000-08-15): David Scherer (dscherer@cmu.edu)
0028 (2000-09-07): Thomas Heller (theller@ctypes.org)
0029 (2000-09-08): Martin v. Löwis (martin@v.loewis.de)
0030 (2000-09-15): Neil Schemenauer (nascheme@enme.ucalgary.ca)
0031 (2000-09-21): Lars Gustäbel (lars@gustaebel.de)
0032 (2000-09-24): Nicholas Riley (nriley@sabi.net)
0033 (2000-10-03): Ka-Ping Yee (ping@zesty.ca)
0034 (2000-10-06): Jim Fulton (jim@zope.com)
0035 (2001-01-10): Charles G. Waldman (cgw@alum.mit.edu)
0036 (2001-03-22): Steve Purcell (steve@pythonconsulting.com)
0037 (2001-06-25): Steven M. Gava (elguavas@python.net)
0038 (2001-07-04): Kurt B. Kaiser (kbk@shore.net)
0039 (2001-07-04): unknown (tools@python.org)
0040 (2001-07-20): Piers Lauder (piers@cs.su.oz.au)
0041 (2001-08-23): Finn Bock (bckfnn@worldonline.dk)
0042 (2001-08-27): Michael W. Hudson (mwh@python.net)
0043 (2001-10-31): Chui Tey (chui.tey@advdata.com.au)
0044 (2001-12-19): Neal Norwitz (nnorwitz@gmail.com)
0045 (2001-12-21): Anthony Baxter (anthonybaxter@gmail.com)
0046 (2002-02-17): Andrew MacIntyre (andymac@bullseye.apana.org.au)
0047 (2002-03-21): Walter Dörwald (walter@livinglogic.de)
0048 (2002-05-12): Raymond Hettinger (python@rcn.com)
0049 (2002-05-15): Jason Tishler (jason@tishler.net)
0050 (2002-05-28): Christian Tismer (tismer@stackless.com)
0051 (2002-06-14): Steve Holden (steve@holdenweb.com)
0052 (2002-09-23): Tony Lownds (tony@lownds.com)
0053 (2002-11-05): Gustavo Niemeyer (gustavo@niemeyer.net)
0054 (2003-01-03): David Goodger (goodger@python.org)
0055 (2003-04-19): Brett Cannon (bcannon@gmail.com)
0056 (2003-04-22): Alex Martelli (aleaxit@gmail.com)
0057 (2003-05-17): Samuele Pedroni (pedronis@openend.se)
0058 (2003-06-09): Andrew McNamara (andrewm@object-craft.com.au)
0059 (2003-10-24): Armin Rigo (arigo@tunes.org)
0060 (2003-12-10): Hye-Shik Chang (hyeshik@gmail.com)
0061 (2004-02-18): David Ascher (david.ascher@gmail.com)
0062 (2004-02-20): Vinay Sajip (vinay_sajip@yahoo.co.uk)
0063 (2004-03-21): Nicholas Bastin (nick.bastin@gmail.com)
0064 (2004-03-25): Phillip J. Eby (pje@telecommunity.com)
0065 (2004-08-04): Matthias Klose (doko@ubuntu.com)
0066 (2004-08-09): Edward Loper (edloper@gradient.cis.upenn.edu)
0067 (2004-08-09): Dave Cole (djc@object-craft.com.au)
0068 (2004-08-14): Johannes Gijsbers (jlg@dds.nl)
0069 (2004-09-17): Sean Reifschneider (jafo@tummy.com)
0070 (2004-10-16): Facundo Batista (facundobatista@gmail.com)
0071 (2004-10-21): Peter Astrand (astrand@lysator.liu.se)
0072 (2005-03-28): Bob Ippolito (bob@redivi.com)
0073 (2005-06-03): Georg Brandl (georg@python.org)
0074 (2005-11-16): Nick Coghlan (ncoghlan@gmail.com)
0075 (2006-03-30): Ronald Oussoren (ronaldoussoren@mac.com)
0076 (2006-04-17): George Yoshida (dynkin@gmail.com)
0077 (2006-04-23): Gerhard Häring (gh@ghaering.de)
0078 (2006-05-23): Richard Jones (richard@commonground.com.au)
0079 (2006-05-24): Andrew Dalke (dalke@dalkescientific.com)
0080 (2006-05-25): Kristján Valur Jónsson (kristjan@ccpgames.com)
0081 (2006-05-25): Jack Diederich (jackdied@gmail.com)
0082 (2006-05-26): Martin Blais (blais@furius.ca)
0083 (2006-07-28): Matt Fleming (mattjfleming@googlemail.com)
0084 (2006-09-05): Sean Reifscheider (jafo@tummy.com)
0085 (2007-03-08): Collin Winter (collinw@gmail.com)
0086 (2007-03-11): Žiga Seilnacht (ziga.seilnacht@gmail.com)
0087 (2007-06-07): Alexandre Vassalotti (alexandre@peadrop.com)
0088 (2007-08-16): Mark Summerfield (list@qtrac.plus.com)
0089 (2007-08-18): Travis E. Oliphant (oliphant@enthought.com)
0090 (2007-08-22): Jeffrey Yasskin (jyasskin@gmail.com)
0091 (2007-08-25): Eric Smith (eric@trueblade.com)
0092 (2007-08-29): Bill Janssen (janssen@parc.com)
0093 (2007-10-31): Christian Heimes (christian@cheimes.de)
0094 (2007-11-10): Amaury Forgeot d'Arc (amauryfa@gmail.com)
0095 (2008-01-08): Mark Dickinson (dickinsm@gmail.com)
0096 (2008-03-17): Steven Bethard (steven.bethard@gmail.com)
0097 (2008-03-18): Trent Nelson (trent.nelson@snakebite.org)
0098 (2008-03-18): David Wolever (david@wolever.net)
0099 (2008-03-25): Benjamin Peterson (benjamin@python.org)
0100 (2008-03-26): Jerry Seutter (jseutter@gmail.com)
0101 (2008-04-16): Jeroen Ruigrok van der Werven (asmodai@in-nomine.org)
0102 (2008-05-13): Jesus Cea (jcea@jcea.es)
0103 (2008-05-24): Guilherme Polo (ggpolo@gmail.com)
0104 (2008-06-01): Robert Schuppenies (okkotonushi@googlemail.com)
0105 (2008-06-10): Josiah Carlson (josiah.carlson@gmail.com)
0106 (2008-06-10): Armin Ronacher (armin.ronacher@active-4.com)
0107 (2008-06-18): Jesse Noller (jnoller@gmail.com)
0108 (2008-06-23): Senthil Kumaran (orsenthil@gmail.com)
0109 (2008-07-22): Antoine Pitrou (solipsis@pitrou.net)
0110 (2008-08-14): Hirokazu Yamamoto (ocean-city@m2.ccsnet.ne.jp)
0111 (2008-12-24): Tarek Ziadé (ziade.tarek@gmail.com)
0112 (2009-03-30): R. David Murray (rdmurray@bitdance.com)
0113 (2009-04-01): Michael Foord (fuzzyman@voidspace.org.uk)
0114 (2009-04-11): Chris Withers (chris@simplistix.co.uk)
0115 (2009-05-08): Philip Jenvey (pjenvey@underboss.org)
0116 (2009-06-25): Ezio Melotti (ezio.melotti@gmail.com)
0117 (2009-08-02): Frank Wierzbicki (fwierzbicki@gmail.com)
0118 (2009-09-20): Doug Hellmann (doug.hellmann@gmail.com)
0119 (2010-01-30): Victor Stinner (victor.stinner@haypocalc.com)
0120 (2010-02-23): Dirkjan Ochtman (dirkjan@ochtman.nl)
0121 (2010-02-24): Larry Hastings (larry@hastings.org)
0122 (2010-02-26): Florent Xicluna (florent.xicluna@gmail.com)
0123 (2010-03-25): Brian Curtin (brian.curtin@gmail.com)
0124 (2010-04-01): Stefan Krah (stefan@bytereef.org)
0125 (2010-04-10): Jean-Paul Calderone (exarkun@divmod.com)
0126 (2010-04-18): Giampaolo Rodolà (g.rodola@gmail.com)
0127 (2010-05-26): Alexander Belopolsky (alexander.belopolsky@gmail.com)
0128 (2010-08-06): Tim Golden (mail@timgolden.me.uk)
0129 (2010-08-14): Éric Araujo (merwok@netwok.org)
0130 (2010-08-22): Daniel Stutzbach (daniel@stutzbachenterprises.com)
0131 (2010-09-18): Brian Quinlan (brian@sweetapp.com)
0132 (2010-11-05): David Malcolm (dmalcolm@redhat.com)
0133 (2010-11-09): Ask Solem (askh@opera.com)
0134 (2010-11-10): Terry Reedy (tjreedy@udel.edu)
0135 (2010-11-10): Łukasz Langa (lukasz@langa.pl)
0136 (2012-06-24): Ned Deily (nad@acm.org)
0137 (2011-01-14): Eli Bendersky (eliben@gmail.com)
0138 (2011-03-10): Eric V. Smith (eric@trueblade.com)
0139 (2011-03-10): R David Murray (rdmurray@bitdance.com)
0140 (2011-03-12): orsenthil (orsenthil@gmail.com)
0141 (2011-03-14): Ross Lagerwall (rosslagerwall@gmail.com)
0142 (2011-03-14): Reid Kleckner (reid@kleckner.net)
0143 (2011-03-14): briancurtin (brian.curtin@gmail.com)
0144 (2011-03-24): guido (guido@google.com)
0145 (2011-03-30): Kristjan Valur Jonsson (sweskman@gmail.com)
0146 (2011-04-04): brian.curtin (brian@python.org)
0147 (2011-04-12): Nadeem Vawda (nadeem.vawda@gmail.com)
0148 (2011-04-19): Giampaolo Rodola' (g.rodola@gmail.com)
0149 (2011-05-04): Alexis Metaireau (alexis@notmyidea.org)
0150 (2011-05-09): Gerhard Haering (gh@ghaering.de)
0151 (2011-05-09): Petri Lehtinen (petri@digip.org)
0152 (2011-05-24): Charles-François Natali (neologix@free.fr)
0153 (2011-07-17): Alex Gaynor (alex.gaynor@gmail.com)
0154 (2011-07-27): Jason R. Coombs (jaraco@jaraco.com)
0155 (2011-08-02): Sandro Tosi (sandro.tosi@gmail.com)
0156 (2011-09-28): Meador Inge (meadori@gmail.com)
0157 (2012-01-09): Terry Jan Reedy (tjreedy@udel.edu)
0158 (2011-05-19): Tarek Ziade (tarek@ziade.org)
0159 (2011-05-22): Martin v. Loewis (martin@v.loewis.de)
0160 (2011-05-31): Ralf Schmitt (ralf@systemexit.de)
0161 (2011-09-12): Jeremy Kloth (jeremy.kloth@gmail.com)
0162 (2012-03-14): Andrew Svetlov (andrew.svetlov@gmail.com)
0163 (2012-03-21): krisvale (sweskman@gmail.com)
0164 (2012-04-24): Marc-Andre Lemburg (mal@egenix.com)
0165 (2012-04-30): Richard Oudkerk (shibturn@gmail.com)
0166 (2012-05-15): Hynek Schlawack (hs@ox.cx)
0167 (2012-06-20): doko (doko@ubuntu.com)
0168 (2012-07-16): Atsuo Ishimoto (ishimoto@gembook.org)
0169 (2012-09-02): Zbigniew Jędrzejewski-Szmek (zbyszek@in.waw.pl)
0170 (2012-09-06): Eric Snow (ericsnowcurrently@gmail.com)
0171 (2012-09-25): Chris Jerdonek (chris.jerdonek@gmail.com)
0172 (2012-12-27): Serhiy Storchaka (storchaka@gmail.com)
0173 (2013-03-31): Roger Serwy (roger.serwy@gmail.com)
0174 (2013-03-31): Charles-Francois Natali (cf.natali@gmail.com)
0175 (2013-05-10): Andrew Kuchling (amk@amk.ca)
0176 (2013-06-14): Ethan Furman (ethan@stoneleaf.us)
0177 (2013-08-12): Felix Crux (felixc@felixcrux.com)
0178 (2013-10-21): Peter Moody (python@hda3.com)
0179 (2013-10-25): bquinlan (brian@sweetapp.com)
0180 (2013-11-04): Zachary Ware (zachary.ware@gmail.com)
0181 (2013-12-02): Walter Doerwald (walter@livinglogic.de)
0182 (2013-12-21): Donald Stufft (donald@stufft.io)
0183 (2014-01-03): Daniel Holth (dholth@fastmail.fm)
0184 (2014-01-27): Yury Selivanov (yselivanov@sprymix.com)
0185 (2014-04-15): Kushal Das (kushaldas@gmail.com)
0186 (2014-06-29): Berker Peksag (berker.peksag@gmail.com)
0187 (2014-07-16): Tal Einat (taleinat@gmail.com)
0188 (2014-10-08): Steve Dower (steve.dower@microsoft.com)
0189 (2014-10-18): Robert Collins (rbtcollins@hp.com)
0190 (2015-03-22): Paul Moore (p.f.moore@gmail.com)

Resources

Similar functionality is provided by the more full-blown frameworks grunt-git-authors and gitstats.

Some resources that might be insightful for you:

Official WordPress themes should have an official change log

Officially supported themes: TwentyXXX

My website is WordPress-backed. WordPress front-ends are called “themes”. There are official themes, released by WordPress/Automattic. And there are thousands of themes released by third parties. While the WordPress project has released many themes, not all of them are equally “important”. There is only one specific series of WordPress themes that is so-to-say most official: themes from the TwentyXXX series.

The issue: no update release notes

In this series, WordPress releases one theme per year (there was TwentyEleven, TwentyTwelve, TwentyThirteen, you get the point). The most recent one of these themes is included with every major release of WordPress. In other words: it does not get more official. Correspondingly, themes from this series enjoy long-term support by the WordPress project. That is, they retrieve maintenance updates even years after their initial release (TwentyEleven was last updated by the end of 2014, for instance). That is great, really! However, there is one very negative aspect with these updates: there are no official release notes. That’s horrible, thinking in engineering terms, and considering release ethics applied in other serious open source software projects.

Background: dependency hell

TwentyXXX theme updates are released rather silently: suddenly, the WordPress dashboard shows that there is an update. But there is no official change log or release note which one could base a decision on. Nothing, apart from an increased version number. That is different from updating WordPress plugins, where the change log usually is only one click away from the WordPress dashboard. Also, the theme version number can not be relied upon to be semantically expressive (AFAIK WordPress themes are not promised to follow semantic versioning, right?)

Now, some of you may think that newer always is better. Just update and trust the developers. But that is not how things work in real life. Generally, we should stick to the paradigm of “never change a running system”, unless […]: sometimes, an update might change behavior, which might not be desired. Sometimes an update might fix a security issue, which one should know about and update immediately. Or the update resolves a usability issue. Such considerations are true for updates for any kind of software. But, in the context of WordPress, there is an even more important topic to consider when updating a theme: an update might break child themes. Or, as expressed by xkcd: “Every change breaks someones workflow”:

http://xkcd.com/1172

http://xkcd.com/1172

A theme can be used by other developers, as a so-called parent theme, in a library fashion — it provides a programming interface. This affects many websites, like mine: a couple of years ago I have decided to base the theme used on my website (here) on the TwentyTwelve theme. I went ahead and created a child theme, which inherits most of its code from TwentyTwelve and changes layout and behavior only in a few aspects. I definitely cannot blindly press the “update” button when TwentyTwelve retrieves an update. This might immediately change the interface I developed my child against, and can consequently break any component of my child theme. Obviously, I cannot just try this out with my live/public website. So, I have to test this update before, in a development environment which is not public.

If proper release notes were available, I could possibly skip that testing and apply such an update right away if it’s just a minor one. Or, I would be alerted that there is a security hole fixed with a breaking change in the parent theme, and I’d know that I have to quickly react and re-work my child theme so that I can safely apply the update to the parent. These things need to be communicated, like in any other open source project with a decent release policy.

Concluding remarks

Yes, there are ways to reconstruct and analyze the code changes that were made. This URL structure actually is quite helpful for generating diffs between theme versions: https://themes.trac.wordpress.org/changeset?old_path=/twentytwelve/1.4&new_path=/twentytwelve/1.6. That URL shows differences between TwentyTwelve 1.4 and 1.6. The same structure can be used for other official themes and version combinations. However, this does not replace a proper change log. WordPress is a mature, large-scale open source project with a huge developer community. Themes from the TwentyXXX series are a major component of this project. The project should provide change logs and/or release notes for every update — for compliance with expectations, and for enabling sound engineering decisions. Others want this, too:

Can any one point me to the release notes for 1.2 or a list of the applied changes? Updating from 1.1 has caused some minor, but unexpected presentation changes on one of my child themes, and I’d like to know what else has changed and what to test for before I upgrade further sites.

Songkick events for Google’s Knowledge Graph

Google can display upcoming concert events in the Knowledge Graph of musical artists (as announced in March 2014). This is a great feature, and probably many people in the field of music marketing and especially record labels aim to get this kind of data into the Knowledge Graph for their artists. However, Google does not magically find this data on its own. It needs to be informed, with a special kind of data structure (in the recently standardized JSON-LD format) contained within the artist’s website.

While of great interest to record labels, finding a proper technical solution to create and provide this data to Google still might be a challenge. I have prepared a web service that greatly simplifies the process of generating the required data structure. It pulls concert data from Songkick and translates them into the JSON-LD representation as required by Google. In the next section I explain the process by means of an example.

Web service usage example

The concert data of the band Milky Chance is published and maintained via Songkick, a service that many artists use. The following website shows — among others — all upcoming events of Milky Chance: http://www.songkick.com/artists/6395144-milky-chance. My web service translates the data held by Songkick into the data structure that Google requires in order to make this concert data appear in their Knowledge Graph. This is the corresponding service URL that needs to be called to retrieve the data:

https://jsonld-events.appspot.com/api/songkick/artist?skid=6395144&name=Milky+Chance&weburl=http%3A%2F%2Fmilkychanceofficial.com

That URL is made of the base URL of the web service, the songkick ID of the artist (6395144 in this case), the artist name and the artist website URL. Try accessing named service URL in your browser. It currently yields this:

[
  {
    "@context": "http://schema.org", 
    "@type": "MusicEvent", 
    "name": "Milky Chance", 
    "startDate": "2014-12-12", 
    "url": "http://www.songkick.com/concerts/21926613-milky-chance-at-max-nachttheater?utm_source=30793&utm_medium=partner", 
    "location": {
      "address": {
        "addressLocality": "Kiel", 
        "postalCode": "24116", 
        "streetAddress": "Eichhofstra\u00dfe 1", 
 
[ ... SNIP ~ 1000 lines of data ... ]
 
    "performer": {
      "sameAs": "http://milkychanceofficial.com", 
      "@type": "MusicGroup", 
      "name": "Milky Chance"
    }
  }
]

This piece of data needs to be included in the HTML source code of the artist website. Google then automatically finds this data and eventually displays the concert data in the Knowledge Graph (within a couple of days). That’s it — pretty simple, right? The good thing is that this method does not require layout changes to your website. This data can literally be included in any website, right now.

That is what happened in case of Milky Chance: some time ago, the data created by the web service was fed into the Milky Chance website. Consequently, their concert data is displayed in their Knowledge Graph. See for yourself: access https://www.google.com/search?q=milky+chance and look out for upcoming events on the right hand side. Screenshot:

milkychance_google_knowledgegraph

Google Knowledge Graph generated for Milky Chance. Note the upcoming events section: for this to appear, Google needs to find the event data in a special markup within the artist’s website.

So, in summary, when would you want to use this web service?

  • You have an interest in presenting the concert data of an artist in Google’s Knowledge Graph (you are record label or otherwise interested in improved marketing and user experience).
  • You have access to the artist website or know someone who has access.
  • The artist concert data already is present on Songkick or will be present in the future.

Then all you need is a specialized service URL, which you can generate with a small form I have prepared for you here: http://gehrcke.de/google-jsonld-events

Background: why Songkick?

Of course, the event data shown in the Knowledge Graph should be up to date and in sync with presentations of the same data in other places (bands usually display their concert data in many places: on Facebook, on their website, within third-party services, …). Fortunately, a lot of bands actually do manage this data in a central place (any other solution would be tedious). This central place/platform/service often is Songkick, because Songkick really made a nice job in providing people with what they need. My web service reflects recent changes made within Songkick.

Technical detail

The core of the web service is a piece of software that translates the data provided by Songkick into the JSON-LD data as required and specified by Google. The Songkick data is retrieved via Songkick’s JSON API (I applied for and got a Songkick API key). Large parts of this software deal with the unfortunate business of data format translation while handling certain edge cases.

The service is implemented in Python and hosted on Google App Engine. Its architecture is quite well thought-through (for instance, it uses memcache and asynchronous urlfetch wherever possible). It is ready to scale, so to say. Some technical highlights:

  • The web service enforces transport encryption (HTTPS).
  • Songkick back-end is queried via HTTPS only.
  • Songkick back-end is queried concurrently whenever possible.
  • Songkick responses are cached for several hours in order to reduce load on their service.
  • Responses of this web service are cached for several hours. These are served within milliseconds.

This is an overview of the data flow:

  1. Incoming request, specifying Songkick artist ID, artist name, and artist website.
  2. Using the Songkick API (SKA), all upcoming events are queried for this artist (one or more SKA requests, depending on number of events).
  3. For each event, the venue ID is extracted, if possible.
  4. All venues are queried for further details (this implicates as many SKA requests as venue IDs extracted).
  5. A JSON-LD representation of an event is constructed from a combination of
    • event data
    • venue data
    • user-given data (artist name and artist website)
  6. All event representations are combined and a returned.

Some notable points in this context:

  • A single request to this web service might implicate many requests to the Songkick API. This is why SKA responses are aggressively cached:
    • An example artist with 54 upcoming events requires 2 upcoming events API requests (two pages, cannot be requested concurrently) and requires roundabout 50 venue API requests (can be requested concurrently). Summed up, this implicates that my web service cannot respond earlier than three SKA round trip times take.
    • If none of the SKA responses has been cached before, the retrieval of about 2 + 50 SKA responses might easily take about 2 seconds.
    • This web services cannot be faster than SK delivers.
  • This web service applies graceful degradation when extracting data from Songkick (many special cases are handled, which is especially relevant for the venue address).

Generate your service URL

This blog post is just an introduction, and sheds some light on the implementation and decision-making. For general reference, I have prepared this document to get you started:

http://gehrcke.de/google-jsonld-events

It contains a web form where you can enter the (currently) three input parameters required for using the service. It returns a service URL for you. This URL points to my application hosted on Google App Engine. Using this URL, the service returns the JSON data that is to be included in an artist’s website. That’s all, it’s really pretty simple.

So, please go ahead and use this tool. I’d love to retrieve some feedback. Closely look at the data it returns, and keep your eyes open for subtle bugs. If you see something weird, report it, please. I am very open for suggestions, and also interested in your questions regarding future plans, release cycle etc. Also, if you need support for (dynamically) including this kind of data in your artist’s website, feel free to contact me.

Sharing state in AngularJS: be aware of $watch issues and race conditions during app initialization

This article is about concise and precise communication of shared state updates from AngularJS services to AngularJS controllers. It warns about race conditions upon AngularJS application bootstrap, and points out advantages of $broadcast over $watch. The topics discussed in this article are supported by minimal working code examples. Finally, this article provides code that can hopefully serve as a best-practice snippet for your own application.

Note: this article has been written with AngularJS version 1.3.X in mind. Future versions of Angular, especially the announced version 2.0, might behave differently.

Introduction to the problem

I have worked with AngularJS for a couple of days now, designing an application that needs to interact with a web service. In this application, I use a small local database (basically a large JavaScript object) that is used by different views in different ways. From time to time, this database object requires to be updated by a remote resource. In the AngularJS ecosystem it seems obvious that such data should be part of an application-wide shared state object and that it needs to be managed by a central entity: an AngularJS service (remember: services in AngularJS can be considered as globally available entities, i.e. they are the perfect choice for communicating between controllers and for sharing state). The two main questions that came to my mind considering this scenario:

  1. How should I handle the automatic initial retrieval of remote data upon application startup?
  2. How should I communicate updates of this piece of shared data to controllers?

The answers to these questions must make sure that the following boundary conditions are fulfilled: controllers need to be informed about all state updates (including the initial one) independently of

  • the application startup time (which is defined by the computing power of the device and the application complexity) and independently of
  • the latency between request and response when querying the remote resource.

An obvious solution (with not-so-obvious issues)

The HTML

Let us get right into code and discuss a possible solution, by means of a small working example. This is the HTML:

<!DOCTYPE html>
<html data-ng-app="testApp">
  <head>
    <script data-require="angular.js@1.3.1" data-semver="1.3.1" src="//code.angularjs.org/1.3.1/angular.js"></script>
  </head>
  <body data-ng-controller="Ctrl">
    Please watch the JavaScript console.<br>
    <button ng-click="buttonclick(false)">updateState(constant)</button>
    <button ng-click="buttonclick(true)">updateState(random)</button>
    <script src="script.js"></script>
  </body>
</html>

It includes the AngularJS framework and custom JavaScript code from script.js. The ng main module is called testApp and the body is subject to the ng controller called Ctrl. There are two buttons whose meaning is explained later.

The service ‘StateService’

So, what do we have in script.js? There is the obligatory line for defining the application’s main module:

var app = angular.module('testApp', []);

And there is the definition of a service for this application:

var UPDATE_STATE_DELAY = 1000;
 
app.factory('StateService', ['$rootScope', '$timeout',
function($rootScope, $timeout) {
 
  console.log('StateService: startup.');
  var service = {state: {data: null}};
 
  service.updateState = function(rnd) {
    console.log("StateService: updateState(). Retrieving data...");
    $timeout(function() {
      console.log("StateService: ...got data, assign it to state.data.");
      if (rnd)
        service.state.data = Math.floor(Math.random()*1000);
      else
        service.state.data = "constantpayload";
    }, UPDATE_STATE_DELAY);
  };
 
  // Update state automatically once upon service (app) startup.
  service.updateState();
 
  return service;
}]);

I have called it 'StateService' because this service should just be responsible for sharing state between controllers. The property service.state.data is what simulates the shared data — this is what controllers are interested in! This piece of data is first initialized with null.

Subsequently, an updateState() method is defined. It simulates delayed retrieval of data from a remote resource via a timeout-controlled async call which eventually results in assignment of “new” data to service.state.data. This method can be called in two ways:

  • One way results in service.state.data being set to a hard-coded string.
  • The other results service.state.data being set to a random number.

The length of the delay after which the pseudo remote data comes in is set to about 1 second, as defined by var UPDATE_STATE_DELAY = 1000.

The service factory (that piece of code shown above) is automatically executed by AngularJS when loading the application. It is important to note that before the service factory returns the service object, service.updateState() is called. That is, when the application bootstraps and this service becomes initialized, it automatically performs one state update. This triggers “the automatic initial retrieval of remote data upon application startup” I talked about in the introduction.

Consequently, about 1 second after this service has been initialized, the service.state.data object is updated with pseudo remote data. Subsequent calls to updateState() can only be triggered externally, as I will show later.

The controller ‘Ctrl’

StateService in place. So far, so good. This is how a controller can look which makes use of it:

app.controller('Ctrl', ['$scope', 'StateService',
function($scope, stateService) {
 
  function useStateData() {
    console.log("Ctrl: useStateData(): " + stateService.state.data);
  }
 
  function init() {
    console.log('Ctrl: init. Install watcher for stateService.state.data.');
    $scope.$watch(
      function() {return stateService.state.data;},
      function(newValue, oldValue) {
        console.log("Ctrl: stateService.state.data watcher: triggered.");
        if (newValue !== oldValue) {
          console.log("Ctrl: stateService.state.data watcher: data changed, use data.");
          useStateData();
        }
        else
          console.log("Ctrl: stateService.state.data watcher: data did not change: " + oldValue);
      }
    );
  }
 
  init();
 
  $scope.buttonclick = function(random) {
    console.log("Ctrl: Call stateService.updateState() due to button click.");
    stateService.updateState(random);
  };
}]);

For being able to communicate state changes from the StateService to the controller, the service is injected into the controller as the stateService object. That just means: we can use this object within the code body of the controller to access service properties, including stateService.state.data.

In the controller, first of all, I define a dummy function called useStateData(). Its sole purpose is to simulate complex usage of the shared state data. In this case, if the function is called, the data is simply logged to the console.

Subsequently, an init() function is defined and called right after that (I could have put that code right into the body of the controller, but further below in the article the call to init() is wrapped with a timeout, and that is why I already separate it here).

Now we come to the essential part: In summary, the basic idea is to have a mechanism applied in the controller that automatically calls useStateData() after stateService.state.data has changed.

For automatic communication of state changes from the service to the controller, AngularJS provides different mechanisms. In very simplistic scenarios we could just bind stateService.state.data to any of the model properties in the controller’s scope and rely on Angular’s “automatic” two-way binding. However, in this article the goal is to discuss more complex scenarios where we need to take absolute control of the state update and where we want to react to a state update in a more general way, i.e. by calling a function in response to the update (here, this is useStateData()).

That is what Angular’s $scope.$watch() is good for. It gets (at least) two arguments. A “watcher function” is defined with the first argument. In this case here, this watcher function just returns the value of stateService.state.data. This watcher function is called in every Angular event loop iteration (upon each call to $digest()). If the value that it watches changes between two iterations, the listener function is called. The listener is defined by the second argument to $scope.$watch(). In our simple example here, the purpose of the listener function is to just use the data, i.e. to call useStateData().

The controller contains some additional code that gives a purpose to the two buttons included in the HTML shown before. One button calls updateState(true), triggering a state update in which the data is set to a random number. The other button calls updateState(false) where the data is set to a hard-coded string (a constant).

Fine, sounds good so far, the controller is ready to respond to state updates. But wait …

Three traps with $scope.$watch()

Run the example shown above via this plunk and watch your JavaScript console. This is the output right after (< 1 s) loading the application:

StateService: startup.
StateService: updateState(). Retrieving data...
Ctrl: init. Install watcher for stateService.state.data.
Ctrl: stateService.state.data watcher: triggered.
Ctrl: stateService.state.data watcher: data did not change: null

trap 1: $watch() listener requires case analysis

Let us go through things in order. First, the service is initialized and triggers updateState(), as planned. We expect a state update about 1 second after that. Next thing in the log is output emitted by the controller code: it installs the watcher via $scope.$watch(). Immediately after that the watcher already calls the listener function. The pseudo remote update still did not happen, so why is that function being called? This is explained in the Angular docs:

After a watcher is registered with the scope, the listener fn is called asynchronously to initialize the watcher. In rare cases, this is undesirable because the listener is called when the result of watchExpression didn’t change. To detect this scenario within the listener fn, you can compare the newVal and oldVal.

Wuah, what? I did not explain this before, but this is the reason why the listener function code shown above requires to have a case analysis. We need to manually compare the old value to the new value via

function(newValue, oldValue) {
  console.log("Ctrl: stateService.state.data watcher: triggered.");
  if (newValue !== oldValue) {
    console.log("Ctrl: stateService.state.data watcher: data changed, use data.");
    useStateData();
  }
  else
    console.log("Ctrl: stateService.state.data watcher: data did not change: " + oldValue);
}

If you prefer to simply rely on the trigger and forget to do the case analysis, you may already have a hard-to-debug issue in your code:

function() {
  console.log("Ctrl: stateService.state.data watcher: triggered, use data.");
  useStateData();
  // Wait, maybe that here was just called due to the watcher init, oops!
}

Okay, looking at the log output above again, indeed, the first invocation of the listener function resulted in “data did not change: null”. That is, newValue !== oldValue was false. I have not put timestamps into the log, but the following lines are the remaining output of the application (they appeared after about 1 second):

StateService: ...got data, assign it to state.data.
Ctrl: stateService.state.data watcher: triggered.
Ctrl: stateService.state.data watcher: data changed, use data.
Ctrl: useStateData(): constantpayload

As expected, the StateService retrieves its pseudo remote data and re-assigns its state.data object. The $timeout service triggers an Angular event loop iteration, i.e. the assignment is wrapped by an Angular-internal call to $digest(). Consequently, the change is observed by Angular and the listener function of the installed watcher gets called. This time, the (annoying) case analysis makes useStateData() being called. It prints the updated data.

Until here, we have found a way to communicate a state change from a service to a controller, via $watch(). Sounds great. However, this method involves potential false-positive calls to the listener function. To properly deal with this awkward situation, a case analysis is required within the very same. This case analysis is, in my opinion, either a mean trap if you forgot to implement it or unnecessarily bloated code. It simply should not be required.

trap 2: $watch() might swallow special updates

Let us proceed with the same minimal working example. The application is initialized. The state service retrieved its initial update from a pseudo remote source and notified the controller about this update. Now, you can go ahead and play with the button “updateState(random)” of the minimal working example. The console log should display something in these lines for each button click:

Ctrl: Call stateService.updateState() due to button click.
StateService: updateState(). Retrieving data...
StateService: ...got data, assign it to state.data.
Ctrl: stateService.state.data watcher: triggered.
Ctrl: stateService.state.data watcher: data changed, use data.
Ctrl: useStateData(): 148

The chain is working: a button click results in a timeout being set. After about 1 second the data property of the state service gets assigned a new (random number) value. The watcher detects the change and immediately calls the listener function which, in turn, calls the useStateData() method of the controller.

Now, please press “updateState(constant)”, two times at least. What is happening? This is the log (after the second click):

Ctrl: Call stateService.updateState() due to button click.
StateService: updateState(). Retrieving data...
StateService: ...got data, assign it to state.data.

The button click is logged. The StateService invokes its update function. After about 1 second, the string “constantpayload” is again assigned to the data property of the state object. As expected, so far. And….? The listener function in the controller does not get called. Never. Why? Because before the update and after the update the watched property, data, was pointing to the same object. In my code example, the same string object (created from one single string literal) is re-assigned to data upon every click on named button. That is, data‘s reference never changes. And, according to the AngularJS docs, the $watch()-internal comparison is done by reference (that is the default, at least). Hence, if I had written

service.state.data = new String("constantpayload");

in the stateService.updateState() function, the listener function would be triggered upon each click on discussed button, because a new string object would be created each time and data‘s reference would change.

Let us reflect. Just a minute ago, in the case discussed before, special $watch() behavior forced us to do a manual case analysis in the listener function in order to decide whether there was a real update or not. Now we found a situation in which we do not even get into the position to manually process an update event in the listener function, because Angular’s $watch() mechanism decided internally that this was not an update. Discussing whether not changing the value during an update can be considered an update or not is a philosophical question. Meaning: it should not be answered for you, this is too much of artificial intelligence. You might want to deal with this question yourself in your controller, e.g. for knowing when the last update occurred, even if the data did not change. If you have hard-coded objects in your application and combine these with $watch(), you might end up with rather complex code paths that you possibly did not expect to even exist. All of this is documented, but it is a trap.

Hence, my opinion is that this behavior of $watch() is too subtle to be considered for concise event transmission.

(At the same time, I appreciate that in many situations developers are not interested in propagating such updates that are no real updates, and are just fine with how $watch() behaves, be it by accident or by strategy).

trap 3: $watch() might seriously affect application performance

This one is really important for architectural decisions. Consider a scenario in which the shared state object is just a “container object” with a rather complex internal structure with many properties that can potentially change during an update. Then, as we have learned before, $watch() cannot simply detect changes in this object. The watched property always points to the container object, i.e. this reference does not change when the internals change. AngularJS provides two solutions to this: $watchCollection() and $watch() with the third argument (objectEquality) set to true. In both cases, the computational complexity of change detection depends on the complexity of the watched object. $watch(watcher, listener, true) performs a thorough analysis of the watched object, it “compares for object equality using angular.equals instead of comparing for reference equality.” The docs warn:

This therefore means that watching complex objects will have adverse memory and performance implications.

You can read more about the intrinsics of $watch() in the “Scope” part of the AngularJS developer guide. In fact, this analysis requires the container object to become deeply inspected for changes. This implicates saving a deep copy of the container object and a comparison of many values. This is costly on its own. But the important thing is: this is executed upon each $digest() round-trip of the framework. That is: often. And definitely upon each user interaction. Consequently, I would say that one should never watch complex objects in such fashion, because the associated complexity usually is not required. In a software project, the complexity of watched objects might grow from release to release, and developers might not be aware of the performance implications, especially in collaborative works. I find that the computational complexity for detecting an update and sending a notification about the very same should ideally never depend on the size of an object, it should just be O(1). Let’s face it: people use $watch() for getting notified, they might forget about its performance implications, and that is why $watch() should be O(1) or throw an error, in my opinion. But this questions the entire dirty-checking approach of Angular, so this is out of scope right now. Anyway, the associated complexity is hidden behind the scenes and will only become visible upon profiling. Just be aware of it.

In the beginning of the article I stated that I want to have a database-like object as part of the shared state. Clearly, a $watch()-based method for automatic change detection is not a good option, as of trap number 3. But also traps number 1 and 2 let me not like $watch() too much. You feel it, we work ourselves more and more into the direction of simple event broadcasts, and we will get to those further down in the article. But before getting there, let us discuss another crucial issue with the architecture shown so far: a race condition.

Race condition: initial remote resource query vs. application startup time

Upon application start, named little database needs to be populated with data from a remote resource. It makes sense to request this data from within the service initialization code, as shown above, via the automatic call to updateState(). Obviously, the point in time when the corresponding response arrives over the wire is not predictable. That is a racer. Let us name him racer A. We do not know how long it takes for him to arrive.

An AngularJS application starts up piece-wise. Various services, controllers and directives need to be initialized. The exact order and timing of actions depends on (at least)

  • the complexity of the application,
  • the order in which things are coded,
  • the way in which Angular is designed to bootstrap itself,
  • the computational performance of the device loading the application, and
  • the load on the device loading the application.

Hence, it is unpredictable at which point in time some controller code is executed which registers a watcher/listener for a certain event. Formally spoken, we can not predict how much time T passes between

  • initial code execution of the shared state service and
  • initial code execution of any given controller consuming this service.

That is racer B. Racer B needs the unknown time T to arrive. Clearly, racer A and B compete. And that is the race condition: depending on the outcome of the race, the status update event might be available before or after certain view controllers register corresponding event listeners.

The code shown so far assumes that T is small compared to the time required for the service to obtain its initial update from the remote resource: the first update event is expected to fly in after the watcher has been installed. Clearly, if that assumption is wrong, the first update event is simply missed by the controller.

Demonstration

I have prepared this plunk for demonstrating the race condition: http://embed.plnkr.co/y52thV.

It contains the same code as shown before, with two small modifications: the pseudo remote resource delay is reduced to half a second, and the controller initialization is artificially delayed by one second. That is, state.data is changed before the watcher is installed via $scope.$watch(). The controller does not automatically become notified about the initial state update.

This race condition and all discussed $watch-related traps are fixed/non-existing in the solution provided in the next section.

A better solution

$broadcast() / $emit() instead of $watch()

Many on-line resources discourage overusing $broadcast / $emit in AngularJS applications. While that may be good advice in principle, I want to use this opportunity to speak in favor of $broadcast. I think that in my described use case this technique is a perfect fit. Compared to the $watch-based solution discussed above, the simple $broadcast / $emit event semantics have clear advantages. Why is that? Because $broadcast allows for cleanly decoupling three processes:

  1. Construction/modification of the shared data.
  2. Update detection.
  3. Event transmission.

These three processes are inseparably intertwined when one uses $watch(). Having them decoupled provides flexibility. This flexibility can be translated into the following advantages:

  1. “Change detection” code is not executed upon each $digest() cycle. It needs to be explicitly invoked and can usually be derived from foreign triggers (such as an AJAX call done callback/promise).
  2. Event transmission is of constant complexity (O(1)). It will always be, even if the “watched object” changes.
  3. There is no artificial intelligence working behind the scenes that re-interprets what a data change might have meant. The situation becomes as simple as possible: one event has one meaning. If that is what is wanted, then the event becomes emitted. Event emission and event absorption both are under precise control of the developer.

I have therefore modified the architecture shown before:

  • After having retrieved data from the remote resource, the service now broadcasts the event state_updated through the $rootScope. This event gets emitted to all scopes, and is therefore visible to all controllers (although in our example there is only one controller).
  • The controller installs a listener for this event and simply calls useStateData() when the event flies in. No case analysis required — we know what this event means, its emission is under our precise control, and we react to it always in the same way.

This is the code:

var UPDATE_STATE_DELAY = 500;
var CONTROLLER_INIT_DELAY = 1000;
var app = angular.module('testApp', []);
 
 
app.factory('StateService', ['$rootScope', '$timeout',
function($rootScope, $timeout) {
 
  console.log('StateService: startup.');
  var service = {state: {data: null}};
 
  service.updateState = function() {
    // Simulate data retrieval from a remote resource: data assignment (and
    // event broadcast) happens some time after service initialization.
    console.log("StateService: updateState(). Retrieving data...");
    $timeout(function() {
      console.log("StateService: ...got data, broadcast state_updated");
      service.state.data = "payload";
      $rootScope.$broadcast('state_updated');
    }, UPDATE_STATE_DELAY);
  };
 
  // Update state automatically once upon service (app) startup.
  service.updateState();
 
  return service;
}]);
 
 
app.controller('Ctrl', ['$scope', '$timeout', 'StateService',
function($scope, $timeout, stateService) {
 
  function useStateData() {
    console.log("Ctrl: useStateData(): " + stateService.state.data);
  }
 
  function init() {
    console.log('Ctrl: init. Install event handler for state_updated');
    // Install event handler, for being responsive to future state updates.
    // Handler is attached to local $scope, so it gets automatically destroyed
    // upon controller destruction.
    $scope.$on('state_updated', function () {
      console.log("Ctrl: state_updated event retrieved. Use data.");
      useStateData();
    });
    // If there have been state updates in the past (between application start
    // and controller initialization), handle the last one of those updates.
    if (stateService.state.data) {
      console.log("Ctrl: init: there is some data already. Use it!");
      useStateData();
    }
  }
 
  // Simulate longish app init time: delay execution of this controller init.  
  $timeout(function() {
    init();
  }, CONTROLLER_INIT_DELAY);
 
  // Provide the user with a method to trigger updateState() via button click.
  $scope.buttonclick = function () {
    console.log("Ctrl: Call stateService.updateState() due to button click.");
    stateService.updateState();
  };
}]);

$broadcast event handlers created in controllers and listening on $rootScope need to be destroyed manually if not needed anymore, otherwise they survive as long as the application lives, possibly resulting in a memory leak. This can be prevented by destroying such event listeners upon controller destruction. As noted in the code right above, this is not necessary when listening on the child scope: Controller destruction triggers destruction of its child scope, which itself triggers destruction of all event handlers. Great.

Strictly spoken, the complexity of calling $broadcast() depends on the number of child scopes existing in the application at the time of event emission. This number usually is not large at all and about constant. Using $emit(), event emission can be made a real O(1) operation. It notifies just the root scope and therefore does not require iterating through the child scopes. However, when doing so, one needs to inject the root scope into controllers, and attach event handlers to it. As stated before, such handlers should be manually removed upon controller destruction. This benchmark shows that for 100 child scopes, $emit() is significantly faster than $broadcast().

Race condition abandoned

The race condition discussed before got abandoned from the last code example, by simply calling useStateData() in the controller if stateService.state.data is not nullright after installing the event handler. Why does this work and doesn’t this introduce even more subtle race conditions? Can’t this make useStateData() being called twice on the same data?

The main reason why that works is that we can make certain assumptions about the execution flow, as discussed in the following paragraph. Let us have a careful look at init() in the controller code:

  1.   function init() {
  2.     $scope.$on('state_updated', function () {
  3.       useStateData();
  4.     });
  5.     if (stateService.state.data) {
  6.       useStateData();
  7.     }
  8.   }

The first action is that the event handler is installed. The essential insight is: the handler function will for sure not be invoked before init() returns. Why? JavaScript can be considered single-threaded (there is no simultaneous code execution, there is only one (virtual) execution thread). In fact, JavaScript functions are not re-entrant, they rather are atomic execution units. That is, once the execution flow enters init(), it does not leave it until init()‘s end is reached. There is simply no time slice for the registered event handler to be invoked before init() returns. That means: if there have been state updates in the past (before init() was invoked),

  • the event listener is installed after the last update event was emitted by the service,
  • stateService.state.data is not null anymore when init() reaches line 5 (the developer needs to guarantee that no update ever resets that property to null) and, consequently,
  • useStateData() in line 6 becomes invoked.

Any (previous or future) foreign call to StateService.updateState() from elsewhere in the application results — at some point in time — in execution of this function (defined in the service code):

function() {
  service.state.data = "payload";
  $rootScope.$broadcast('state_updated');
}

This itself is an atomic execution unit where data modification and event emission are condensed within a single transaction (they do not go at all or they go together). As of the above considerations, this execution unit is not invoked before the end of the init() function is reached. Consequently, the code in init() guarantees that the two calls to useStateData() (lines 3 & 6) are always separated by an assignment (via the = operator) to service.state.data.

Best-practice MWE

The following piece of code is based on all considerations made above and cleaned from comments and console output. Play with it (run it using the “Preview” tab) and feel free to reuse it:

(Download plunk)

Summary

I hope to have shown to you that in certain cases a $watch()-based solution may result in undesired code behavior, and that using $broadcast()– or $emit()-based communication of state updates might yield simpler and yet more reliable code. Also, please remember that $watch() has the potential to produce a severe performance regression. In the last part of the article I pointed out that one should not accidentally make startup code depend on the difference between application loading time and remote resource query latency. This introduces race conditions which usually are difficult to reproduce and debug.

Thanks for reading, and of course I’d be happy to retrieve some feedback.