WP-GeSHi-Highlight 1.3.0 released

I have released version 1.3.0 of WP-GeSHi-Highlight. WP-GeSHi-Highlight is a popular code syntax highlighting plugin for WordPress, based on the established PHP highlighting library GeSHi.

These are the changes compared to version 1.2.3:

  • Enhance compatibility of the default stylesheet with a large range of themes by increasing the specificity of certain CSS selectors and by adding more style directives. This ensures a better out-of-the-box experience. Thanks to Pascal Krause for reporting an incompatilibity with Twenty Ten.
  • Increase compatibility with CDNs: fix double slash appearing in CSS file URL.
  • Remove redundant call to wp_register_style().
  • Change style sheet ID prefix, add newline characters to GeSHi CSS code output.
  • Improve code documentation and readability.

In fact, this update contains a considerably consolidated default stylesheet (with just a minimal set of changes). I have verified that it works well with all recent versions of the important official themes:

  • Twenty Ten (version 1.9)
  • Twenty Eleven (version 2.1)
  • Twenty Twelve (version 1.7)
  • Twenty Thirteen (version 1.5)
  • Twenty Fourteen (version 1.4)
  • Twenty Fifteen (version 1.2)

I have also tested the new default style sheet with the following popular themes: Vantage, Customizr, ColorWay, Zerif Lite, Responsive, Storefront, Virtue, evolve, Make, Sparkling, Spacious, Enigma, Sydney, Point, Interface, SinglePage.

I want to use this opportunity to stress how much I like the recent community developments around WP-GeSHi-Highlight. Almost five years have passed since I first released this plugin to the WP repository, and I never really put effort into spreading it. Nevertheless, according to the WordPress statistics, it is used on more than one thousand websites. I have searched the web to find a few examples about who uses this plugin for which purpose, and want to share two of my findings:

  • In this StackOverflow thread, Aaron Bertrand, a ninja database expert, revealed that he likes the GeSHi-based approach so much (and he uses WP-GeSHi-Highlight in his blogs), that he was willing to provide a huge bounty to someone helping him out to improve the syntax highlighting for the T-SQL language in subtle ways. Another ninja joined the discussion for helping Aaron, and Aaron had more wishes like “There’s always one little extra thing, right? I promise this is it. I want to color the N prefix on Unicode strings in red.”. They found their solution, great. I proposed to them to contribute their findings upstream, straight to GeSHi, so that I, further downstream, can incorporate them in WP-GeSHi-Highlight at some point. I love this.

  • On Make Create Reiterate, Tina Dvyver uses WP-GeSHi-Highlight (and writes about it) as of its support for an exotic language in the world of highlighters, X++. What I like about her blog is that she actually took her time to adjust the default stylesheet to her needs, and that she came up with a neat way to hide the dots in the numbered list, when using line numbering. She writes about it here.

One more thing I want to mention: in the past days I had a productive discussion with two users of this plugin, in this support thread. Turns out that WP-GeSHi-Highlight was not responsible for the issue reported there, but this discussion still lead to fixing the double-slash issue mentioned above in the change log. It also lead to me knowing of one more person who actively transitioned from WP-Syntax to WP-GeSHi-Highlight, and to one more review on the wordpress.org plugin page.

I realize that I should put more effort into spreading this plugin, as it has obvious architectural advantages over heavy client-side highlighters. Unfortunately, most of the “Top X syntax highlighting plugins for WordPress” blog posts around the web do not really clarify the significant difference between back-end highlighters and client-side ones. I’ll possibly cover that in a future blog post.

Git: list authors sorted by the time of their first contribution

I have created a micro Python script, git-authors, which parses the output of git log and outputs the authors in order of their first contribution to the repository.

By itself, this is a rather boring task. However, I think the resulting code is quite interesting, because it applies a couple of really important concepts and Python idioms within just a couple of lines. Hence, this small piece of code is not only useful in practice; it also serves an educational purpose.

The latter is what this article focuses on. What can you expect to learn? Less than ten lines of code are discussed. Center of this discussion is efficient stream processing and proper usage of data structures: with a very simple and yet efficient architecture, git-authors can analyze more than 500.000 commits of the Linux git repository with near-zero memory requirements and consumption of only 1 s of CPU time on a weak machine.

Usage: pipe formatted data from git log to git-authors

The recommended usage of git-authors is:

$ git log --encoding=utf-8 --full-history --reverse "--format=format:%at;%an;%ae" | \
    git-authors > authors-by-first-contrib.txt

A little background information might help for understanding this. git-authors expects to be fed with a data stream on standard input (stdin), composed of newline-separated chunks. Each chunk (line) is expected to represent one commit, and is expected to be of a certain format:

timestamp;authorname;authoremail

Furthermore, git-authors expects to retrieve these commits sorted by time, ascendingly (newest last).

git log can be configured to output the required format (via --format=format:%at;%an;%ae) and to output commits sorted by time (default) with the earliest commits first (using --reverse).

git log writes its output data to standard output (stdout). The canonical method for connecting stdout of one process to stdin of another process is a pipe, a transport mechanism provided by the operating system.

The command line options --encoding=utf-8 and --full-history should not be discussed here, for simplicity.

The input evaluation loop

Remember, the git-authors program expects to retrieve a data stream via stdin. An example snippet of such a data stream could look like this:

[...]
1113690343;Bert Wesarg;wesarg@informatik.uni-halle.de
1113690343;Ken Chen;kenneth.w.chen@intel.com
1113690344;Christoph Hellwig;hch@lst.de
1113690345;Bernard Blackham;bernard@blackham.com.au
1113690346;Jan Kara;jack@suse.cz
[...]

The core of git-authors is a loop construct built of seven lines of code. It processes named input stream in a line-by-line fashion. Let’s have a look at it:

  1. seen = set()
  2. for line in stdin:
  3.     timestamp, name, mail = line.strip().split(";")
  4.     if name not in seen:
  5.         seen.add(name)
  6.         day = time.strftime("%Y-%m-%d", time.gmtime(float(timestamp)))
  7.         stdout.write("%04d (%s): %s (%s)\n" % (len(seen), day, name, mail))

There are a couple of remarks to be made about this code:

  • This code processes the stream retrieved at standard input in a line-by-line fashion: in line 2, the script makes use of the fact that Python streams (implemented via IOBase) support the iterator protocol, meaning that they can be iterated over, whereas a single line is yielded from the stream upon each iteration (until the resource has been entirely consumed).

  • The data flow in the loop is constructed in a way that the majority of the payload data (the line’s content) is processed right away and not stored for later usage. This is a crucial concept, ensuring a small memory footprint and, even more important, a memory footprint that does (almost) not depend on the size of the input data (which also means that the memory consumption becomes largely time-independent). The minimum amount of data that this program requires to keep track of across loop iterations is a collection of unique author names already seen (repetitions are to be discarded). And that is exactly what is stored in the set called seen. How large might this set become? An estimation: how many unique author names will the largest git project ever accumulate? A couple of thousand maybe? As of summer 2015, the Linux git repository counts more than half a million commits and about 13.000 unique author names. Linux should have one of the largest if not the largest git history. It can safely be assumed that O(10^5) short strings is the maximum amount of data this program will ever need to store in memory. How much memory is required for storing a couple of thousand short strings in Python? You might want to measure this, but it is almost nothing, at least compared to how much memory is built into smartphones. When analyzing the Linux repository, the memory footprint of git-authors stayed well below 10 MB.

  • Line 3 demonstrates how powerful Python’s string methods are, especially when cascaded. It also shows how useful multiple assignment via sequence unpacking can be.

  • “Use the right data structure(s) for any given problem!” is an often-preached rule, for ensuring that the time complexity of the applied algorithm is not larger than the problem requires. Lines 1, 4, and 5 are a great example for this, I think. Here, we want to keep track of the authors that have already been observed in the stream (“seen”). This kind of problem naturally requires a data structure allowing for lookup (Have I seen you yet?) and insert (I’ve seen you!) operations. Generally, a hash table-like data structure fits these problems best, because it provides O(1) (constant) complexity for both, lookup and insertion. In Python, the dictionary implementation as well as the set implementation are both based on a hash table (in fact, these implementations share a lot of code). Both dicts and sets also provide a len() method of constant complexity, which I have made use of in line 7. Hence, the run time of this algorithm is proportional to the input size (to the number of lines in the input). It is impossible to scale better than that (every line needs to be looked at), but there are many sub-optimal ways to implement a solution that scales worse than linearly.

  • The reason why I chose to use a set instead of a dictionary is rather subtle: I think the add() semantics of set fit the given problem really well, and here we really just want to keep track of keys (the typical key-value association of the dictionary is not required here). Performance-wise, the choice shouldn’t make a significant difference.

  • Line 6 demonstrates the power of the time module. Take a Unix timestamp and generate a human-readable time string from it: easy. In my experience, strftime() is one of the most-used methods in the time module, and learning its format specifiers by heart can be really handy.

  • Line 7: old-school powerful Python string formatting with a very compact syntax. Yes, the “new” way for string formatting (PEP 3101) has been around for years, and deprecation of the old style was once planned. Truth is, however, that the old-style formatting is just too beloved and established, and will probably never even become deprecated, let alone removed. Its functionality was just extended in Python 3.5 via PEP 461.

Preparation of input and output streams

What is not shown in the snippet above is the preparation of the stdin and stdout objects. I have come up with the following method:

kwargs = {"errors": "replace", "encoding": "utf-8", "newline": "\n"}
stdin = io.open(sys.stdin.fileno(), **kwargs)
stdout = io.open(sys.stdout.fileno(), mode="w", **kwargs)

This is an extremely powerful recipe for obtaining the same behavior on Python 2 as well as on 3, but also on Windows as well as on POSIX-compliant platforms. There is a long story behind this which should not be the focus of this very article. In essence, Python 2 and Python 3 treat sys.stdin/sys.stdout very differently. Grabbing the underlying file descriptors by their balls via fileno() and creating TextIOWrapper stream objects on top of them is a powerful way to disable much of Python’s automagic and therefore to normalize behavior among platforms. The automagic I am referring to here especially includes Python 3’s platform-dependent automatic input decoding and output encoding, and universal newline support. Both really can add an annoying amount of complexity in certain situations, and this here is one such case.

Example run on CPython’s (inofficial) git repository

I applied git-authors to the current state of the inofficial CPython repository hosted at GitHub. As a side node, this required about 0.1 s of CPU time on my test machine. I am showing the output in full length below, because I find its content rather interesting. We have to appreciate that the commit history is not entirely broken, despite CPython having switched between different version control systems over the last 25 years. Did you know that Just van Rossum also was a committer? :-)

0001 (1990-08-09): Guido van Rossum (guido@python.org)
0002 (1992-08-04): Sjoerd Mullender (sjoerd@acm.org)
0003 (1992-08-13): Jack Jansen (jack.jansen@cwi.nl)
0004 (1993-01-10): cvs2svn (tools@python.org)
0005 (1994-07-25): Barry Warsaw (barry@python.org)
0006 (1996-07-23): Fred Drake (fdrake@acm.org)
0007 (1996-12-09): Roger E. Masse (rmasse@newcnri.cnri.reston.va.us)
0008 (1997-08-13): Jeremy Hylton (jeremy@alum.mit.edu)
0009 (1998-03-03): Ken Manheimer (klm@digicool.com)
0010 (1998-04-09): Andrew M. Kuchling (amk@amk.ca)
0011 (1998-12-18): Greg Ward (gward@python.net)
0012 (1999-01-22): Just van Rossum (just@lettererror.com)
0013 (1999-11-07): Greg Stein (gstein@lyra.org)
0014 (2000-05-12): Gregory P. Smith (greg@mad-scientist.com)
0015 (2000-06-06): Trent Mick (trentm@activestate.com)
0016 (2000-06-07): Marc-André Lemburg (mal@egenix.com)
0017 (2000-06-09): Mark Hammond (mhammond@skippinet.com.au)
0018 (2000-06-29): Fredrik Lundh (fredrik@pythonware.com)
0019 (2000-06-30): Skip Montanaro (skip@pobox.com)
0020 (2000-06-30): Tim Peters (tim.peters@gmail.com)
0021 (2000-07-01): Paul Prescod (prescod@prescod.net)
0022 (2000-07-10): Vladimir Marangozov (vladimir.marangozov@t-online.de)
0023 (2000-07-10): Peter Schneider-Kamp (nowonder@nowonder.de)
0024 (2000-07-10): Eric S. Raymond (esr@thyrsus.com)
0025 (2000-07-14): Thomas Wouters (thomas@python.org)
0026 (2000-07-29): Moshe Zadka (moshez@math.huji.ac.il)
0027 (2000-08-15): David Scherer (dscherer@cmu.edu)
0028 (2000-09-07): Thomas Heller (theller@ctypes.org)
0029 (2000-09-08): Martin v. Löwis (martin@v.loewis.de)
0030 (2000-09-15): Neil Schemenauer (nascheme@enme.ucalgary.ca)
0031 (2000-09-21): Lars Gustäbel (lars@gustaebel.de)
0032 (2000-09-24): Nicholas Riley (nriley@sabi.net)
0033 (2000-10-03): Ka-Ping Yee (ping@zesty.ca)
0034 (2000-10-06): Jim Fulton (jim@zope.com)
0035 (2001-01-10): Charles G. Waldman (cgw@alum.mit.edu)
0036 (2001-03-22): Steve Purcell (steve@pythonconsulting.com)
0037 (2001-06-25): Steven M. Gava (elguavas@python.net)
0038 (2001-07-04): Kurt B. Kaiser (kbk@shore.net)
0039 (2001-07-04): unknown (tools@python.org)
0040 (2001-07-20): Piers Lauder (piers@cs.su.oz.au)
0041 (2001-08-23): Finn Bock (bckfnn@worldonline.dk)
0042 (2001-08-27): Michael W. Hudson (mwh@python.net)
0043 (2001-10-31): Chui Tey (chui.tey@advdata.com.au)
0044 (2001-12-19): Neal Norwitz (nnorwitz@gmail.com)
0045 (2001-12-21): Anthony Baxter (anthonybaxter@gmail.com)
0046 (2002-02-17): Andrew MacIntyre (andymac@bullseye.apana.org.au)
0047 (2002-03-21): Walter Dörwald (walter@livinglogic.de)
0048 (2002-05-12): Raymond Hettinger (python@rcn.com)
0049 (2002-05-15): Jason Tishler (jason@tishler.net)
0050 (2002-05-28): Christian Tismer (tismer@stackless.com)
0051 (2002-06-14): Steve Holden (steve@holdenweb.com)
0052 (2002-09-23): Tony Lownds (tony@lownds.com)
0053 (2002-11-05): Gustavo Niemeyer (gustavo@niemeyer.net)
0054 (2003-01-03): David Goodger (goodger@python.org)
0055 (2003-04-19): Brett Cannon (bcannon@gmail.com)
0056 (2003-04-22): Alex Martelli (aleaxit@gmail.com)
0057 (2003-05-17): Samuele Pedroni (pedronis@openend.se)
0058 (2003-06-09): Andrew McNamara (andrewm@object-craft.com.au)
0059 (2003-10-24): Armin Rigo (arigo@tunes.org)
0060 (2003-12-10): Hye-Shik Chang (hyeshik@gmail.com)
0061 (2004-02-18): David Ascher (david.ascher@gmail.com)
0062 (2004-02-20): Vinay Sajip (vinay_sajip@yahoo.co.uk)
0063 (2004-03-21): Nicholas Bastin (nick.bastin@gmail.com)
0064 (2004-03-25): Phillip J. Eby (pje@telecommunity.com)
0065 (2004-08-04): Matthias Klose (doko@ubuntu.com)
0066 (2004-08-09): Edward Loper (edloper@gradient.cis.upenn.edu)
0067 (2004-08-09): Dave Cole (djc@object-craft.com.au)
0068 (2004-08-14): Johannes Gijsbers (jlg@dds.nl)
0069 (2004-09-17): Sean Reifschneider (jafo@tummy.com)
0070 (2004-10-16): Facundo Batista (facundobatista@gmail.com)
0071 (2004-10-21): Peter Astrand (astrand@lysator.liu.se)
0072 (2005-03-28): Bob Ippolito (bob@redivi.com)
0073 (2005-06-03): Georg Brandl (georg@python.org)
0074 (2005-11-16): Nick Coghlan (ncoghlan@gmail.com)
0075 (2006-03-30): Ronald Oussoren (ronaldoussoren@mac.com)
0076 (2006-04-17): George Yoshida (dynkin@gmail.com)
0077 (2006-04-23): Gerhard Häring (gh@ghaering.de)
0078 (2006-05-23): Richard Jones (richard@commonground.com.au)
0079 (2006-05-24): Andrew Dalke (dalke@dalkescientific.com)
0080 (2006-05-25): Kristján Valur Jónsson (kristjan@ccpgames.com)
0081 (2006-05-25): Jack Diederich (jackdied@gmail.com)
0082 (2006-05-26): Martin Blais (blais@furius.ca)
0083 (2006-07-28): Matt Fleming (mattjfleming@googlemail.com)
0084 (2006-09-05): Sean Reifscheider (jafo@tummy.com)
0085 (2007-03-08): Collin Winter (collinw@gmail.com)
0086 (2007-03-11): Žiga Seilnacht (ziga.seilnacht@gmail.com)
0087 (2007-06-07): Alexandre Vassalotti (alexandre@peadrop.com)
0088 (2007-08-16): Mark Summerfield (list@qtrac.plus.com)
0089 (2007-08-18): Travis E. Oliphant (oliphant@enthought.com)
0090 (2007-08-22): Jeffrey Yasskin (jyasskin@gmail.com)
0091 (2007-08-25): Eric Smith (eric@trueblade.com)
0092 (2007-08-29): Bill Janssen (janssen@parc.com)
0093 (2007-10-31): Christian Heimes (christian@cheimes.de)
0094 (2007-11-10): Amaury Forgeot d'Arc (amauryfa@gmail.com)
0095 (2008-01-08): Mark Dickinson (dickinsm@gmail.com)
0096 (2008-03-17): Steven Bethard (steven.bethard@gmail.com)
0097 (2008-03-18): Trent Nelson (trent.nelson@snakebite.org)
0098 (2008-03-18): David Wolever (david@wolever.net)
0099 (2008-03-25): Benjamin Peterson (benjamin@python.org)
0100 (2008-03-26): Jerry Seutter (jseutter@gmail.com)
0101 (2008-04-16): Jeroen Ruigrok van der Werven (asmodai@in-nomine.org)
0102 (2008-05-13): Jesus Cea (jcea@jcea.es)
0103 (2008-05-24): Guilherme Polo (ggpolo@gmail.com)
0104 (2008-06-01): Robert Schuppenies (okkotonushi@googlemail.com)
0105 (2008-06-10): Josiah Carlson (josiah.carlson@gmail.com)
0106 (2008-06-10): Armin Ronacher (armin.ronacher@active-4.com)
0107 (2008-06-18): Jesse Noller (jnoller@gmail.com)
0108 (2008-06-23): Senthil Kumaran (orsenthil@gmail.com)
0109 (2008-07-22): Antoine Pitrou (solipsis@pitrou.net)
0110 (2008-08-14): Hirokazu Yamamoto (ocean-city@m2.ccsnet.ne.jp)
0111 (2008-12-24): Tarek Ziadé (ziade.tarek@gmail.com)
0112 (2009-03-30): R. David Murray (rdmurray@bitdance.com)
0113 (2009-04-01): Michael Foord (fuzzyman@voidspace.org.uk)
0114 (2009-04-11): Chris Withers (chris@simplistix.co.uk)
0115 (2009-05-08): Philip Jenvey (pjenvey@underboss.org)
0116 (2009-06-25): Ezio Melotti (ezio.melotti@gmail.com)
0117 (2009-08-02): Frank Wierzbicki (fwierzbicki@gmail.com)
0118 (2009-09-20): Doug Hellmann (doug.hellmann@gmail.com)
0119 (2010-01-30): Victor Stinner (victor.stinner@haypocalc.com)
0120 (2010-02-23): Dirkjan Ochtman (dirkjan@ochtman.nl)
0121 (2010-02-24): Larry Hastings (larry@hastings.org)
0122 (2010-02-26): Florent Xicluna (florent.xicluna@gmail.com)
0123 (2010-03-25): Brian Curtin (brian.curtin@gmail.com)
0124 (2010-04-01): Stefan Krah (stefan@bytereef.org)
0125 (2010-04-10): Jean-Paul Calderone (exarkun@divmod.com)
0126 (2010-04-18): Giampaolo Rodolà (g.rodola@gmail.com)
0127 (2010-05-26): Alexander Belopolsky (alexander.belopolsky@gmail.com)
0128 (2010-08-06): Tim Golden (mail@timgolden.me.uk)
0129 (2010-08-14): Éric Araujo (merwok@netwok.org)
0130 (2010-08-22): Daniel Stutzbach (daniel@stutzbachenterprises.com)
0131 (2010-09-18): Brian Quinlan (brian@sweetapp.com)
0132 (2010-11-05): David Malcolm (dmalcolm@redhat.com)
0133 (2010-11-09): Ask Solem (askh@opera.com)
0134 (2010-11-10): Terry Reedy (tjreedy@udel.edu)
0135 (2010-11-10): Łukasz Langa (lukasz@langa.pl)
0136 (2012-06-24): Ned Deily (nad@acm.org)
0137 (2011-01-14): Eli Bendersky (eliben@gmail.com)
0138 (2011-03-10): Eric V. Smith (eric@trueblade.com)
0139 (2011-03-10): R David Murray (rdmurray@bitdance.com)
0140 (2011-03-12): orsenthil (orsenthil@gmail.com)
0141 (2011-03-14): Ross Lagerwall (rosslagerwall@gmail.com)
0142 (2011-03-14): Reid Kleckner (reid@kleckner.net)
0143 (2011-03-14): briancurtin (brian.curtin@gmail.com)
0144 (2011-03-24): guido (guido@google.com)
0145 (2011-03-30): Kristjan Valur Jonsson (sweskman@gmail.com)
0146 (2011-04-04): brian.curtin (brian@python.org)
0147 (2011-04-12): Nadeem Vawda (nadeem.vawda@gmail.com)
0148 (2011-04-19): Giampaolo Rodola' (g.rodola@gmail.com)
0149 (2011-05-04): Alexis Metaireau (alexis@notmyidea.org)
0150 (2011-05-09): Gerhard Haering (gh@ghaering.de)
0151 (2011-05-09): Petri Lehtinen (petri@digip.org)
0152 (2011-05-24): Charles-François Natali (neologix@free.fr)
0153 (2011-07-17): Alex Gaynor (alex.gaynor@gmail.com)
0154 (2011-07-27): Jason R. Coombs (jaraco@jaraco.com)
0155 (2011-08-02): Sandro Tosi (sandro.tosi@gmail.com)
0156 (2011-09-28): Meador Inge (meadori@gmail.com)
0157 (2012-01-09): Terry Jan Reedy (tjreedy@udel.edu)
0158 (2011-05-19): Tarek Ziade (tarek@ziade.org)
0159 (2011-05-22): Martin v. Loewis (martin@v.loewis.de)
0160 (2011-05-31): Ralf Schmitt (ralf@systemexit.de)
0161 (2011-09-12): Jeremy Kloth (jeremy.kloth@gmail.com)
0162 (2012-03-14): Andrew Svetlov (andrew.svetlov@gmail.com)
0163 (2012-03-21): krisvale (sweskman@gmail.com)
0164 (2012-04-24): Marc-Andre Lemburg (mal@egenix.com)
0165 (2012-04-30): Richard Oudkerk (shibturn@gmail.com)
0166 (2012-05-15): Hynek Schlawack (hs@ox.cx)
0167 (2012-06-20): doko (doko@ubuntu.com)
0168 (2012-07-16): Atsuo Ishimoto (ishimoto@gembook.org)
0169 (2012-09-02): Zbigniew Jędrzejewski-Szmek (zbyszek@in.waw.pl)
0170 (2012-09-06): Eric Snow (ericsnowcurrently@gmail.com)
0171 (2012-09-25): Chris Jerdonek (chris.jerdonek@gmail.com)
0172 (2012-12-27): Serhiy Storchaka (storchaka@gmail.com)
0173 (2013-03-31): Roger Serwy (roger.serwy@gmail.com)
0174 (2013-03-31): Charles-Francois Natali (cf.natali@gmail.com)
0175 (2013-05-10): Andrew Kuchling (amk@amk.ca)
0176 (2013-06-14): Ethan Furman (ethan@stoneleaf.us)
0177 (2013-08-12): Felix Crux (felixc@felixcrux.com)
0178 (2013-10-21): Peter Moody (python@hda3.com)
0179 (2013-10-25): bquinlan (brian@sweetapp.com)
0180 (2013-11-04): Zachary Ware (zachary.ware@gmail.com)
0181 (2013-12-02): Walter Doerwald (walter@livinglogic.de)
0182 (2013-12-21): Donald Stufft (donald@stufft.io)
0183 (2014-01-03): Daniel Holth (dholth@fastmail.fm)
0184 (2014-01-27): Yury Selivanov (yselivanov@sprymix.com)
0185 (2014-04-15): Kushal Das (kushaldas@gmail.com)
0186 (2014-06-29): Berker Peksag (berker.peksag@gmail.com)
0187 (2014-07-16): Tal Einat (taleinat@gmail.com)
0188 (2014-10-08): Steve Dower (steve.dower@microsoft.com)
0189 (2014-10-18): Robert Collins (rbtcollins@hp.com)
0190 (2015-03-22): Paul Moore (p.f.moore@gmail.com)

Resources

Similar functionality is provided by the more full-blown frameworks grunt-git-authors and gitstats.

Some resources that might be insightful for you:

Insights into SoundCloud culture

SoundCloud (cf. Wikipedia) is a young Berlin-based company “under the laws of England & Wales”, and with Swedish origin. Six years ago, in 2009, they acquired their first big funding. Since then, they experienced a tremendous growth and were able to regularly raise investment capital. Until today, SoundCloud has created and put itself into a unique position with a convincing product (which I am using — and paying — myself), but can also be considered as competitor of big brands such as Spotify and Beats Music. In fact, according to the Wall Street Journal, SoundCloud can be expected to join the party of billion-dollar IT companies quite soon:

SoundCloud, a popular music and audio-sharing service, is in discussions to raise about $150 million in new financing at a valuation that is expected to top $1.2 billion, according to two people with knowledge of the negotiations.

Having these facts in mind, it is impressive to hear that SoundCloud still only employs 300 people, in just a handful of offices around the world. Just like me, you might be curious about getting to know details of this kind, and about the SoundCloud story in itself. So, I was really eager to listen to Episode 17 of “Hipster & Hack“, featuring an interview with David Noël (Twitter profile, LinkedIn profile). David has accompanied SoundCloud for six years now and currently leads Internal Communications. He clearly is in the position to provide authoritative information about how SoundCloud’s vision was translated into reality over time, but also about how culture and communication within SoundCloud evolved. The latter is what he mainly talks about in the interview, providing insights about the structure and tools applied for defining a culture, for keeping it under control, and for communicating it to employees from the very first moment on until even after they have left the company. David defines culture as the living manifestation of core values and comes to insightful statements such as

Living your values is your culture at any moment in time.

In the interview, we learn that one of SoundCloud’s core values is being open, in the context of internal communications. The culture and communication topic really seems to have a high priority in the company, judging based on methods like the “all-hands” meeting that David refers to in the interview. Personally, I cannot overstate how much I value this, coming from classical research where such elements are often just neglected.

So, if that raises your interest, I recommend listening to three quite likable guys here (listening to minutes 4 to 29 suffices, the rest is enjoyable overhead ;-)):

JavaScript in a browser: create a binary blob from a base64-encoded string, and decode it using UTF-8.

For an AngularJS-based web application I am currently working on, I want to put arbitrary text information into a URL, send it to the client, and have it decoded by the browser. With the right tools this should be a trivial process, shouldn’t it?

The basic idea is:

  1. Start with the original text, which is a sequence of unicode code points.
  2. Encode the original text into binary data; with the UTF-8 codec.
  3. base64-encode the resulting binary data (and replace the URL-unsafe characters / and + with, for instance, – and _).
  4. The result is a URL-safe string. Send it. Let’s assume it ends up in a browser (in window.location, for instance).
  5. Invert the entire procedure in the browser:
    1. The data arrives as DOMString type (unicode text, so to say).
    2. Transform URL-safe base64 representation to canonical base64 (replace _ and – characters).
    3. Decode the base64 string into a real binary data type (Uint8Array, for instance).
    4. Decode the binary blob into a DOMString containing the original text, using the UTF-8 codec.

Unfortunately, there so far are no official and no established ways for performing steps 5.3 and 5.4 in a browser environment. There is no obvious way for obtaining a binary blob from base64-encoded data. Further below, I’ll show three different methods for executing this step. Proceeding from here, I realized that there also is still no established way for decoding a binary blob into a DOMString using a given codec (UTF-8 in this case). I’ll show two different methods for performing this task.

The original text and its URL-safe representation

I’ll start with a Python snippet defining the original text and creating its URL-safe base64 representation:

# -*- coding: utf-8 -*-
 
from __future__ import unicode_literals
from base64 import urlsafe_b64encode
 
text = """
«küßЌύБЇ”ﻈﻉﻌﻍﻎ㌀㌁㌂❶❷❸⍝⍞⍟⍠ோௌ«ταБЬℓσ»:
n20%of٩(-̮̮̃-̃)۶٩(●̮̮̃•̃)۶٩(͡๏̯͡๏)۶٩(-̮̮̃•̃)!॒॑॓ஙசປຜἑἔℇ∆∇▆▇█
✆✇✈✉✌✍うぇえぉおㆅㆇ㉫㉬㍍㍎㍏沈拾若掠略亮兩凉梁糧
ﯗﯘﯙﯚﯝﯠﯡﯢカキクケコサシス'kosme':"κόσμε"/?#+퟿
"""
 
data = text.encode("utf-8")
datab64 = urlsafe_b64encode(data)
 
print("text length: %s" % len(text))
print("urlsafe base64 representation of the binary data:\n\n%s" % datab64)

The original text (the variable named text) is meant to contain code points from many different Unicode character blocks. The following resources helped me assembling this test text:

As you can see, the text is first encoded using the UTF-8 codec. The resulting binary data then is put into urlsafe_b64encode(), yielding a URL-safe byte sequence. So, execution of the named Python script yields the following URL-safe representation of the original text:

CsKra8O8w5_DrsK74oCc0IzPjdCR0IfigJ3vu4jvu4nvu4zvu43vu47jjIDjjIHjjILinbbinbfinbjijZ3ijZ7ijZ_ijaDgr4vgr4zCq8-EzrHQkdCs4oSTz4PCuzoKbjIwJW9m2akoLcyuzK7Mgy3MgynbttmpKOKXj8yuzK7Mg-KAosyDKdu22akozaHguY_Mr82h4LmPKdu22akoLcyuzK7Mg-KAosyDKSHgpZHgpZLgpZPgrpngrprgupvgupzhvJHhvJTihIfiiIbiiIfilobilofilogK4pyG4pyH4pyI4pyJ4pyM4pyN44GG44GH44GI44GJ44GK44aF44aH44mr44ms442N442O442P76Wy76Wz76W076W176W276W376W476W576W676W7Cu-vl--vmO-vme-vmu-vne-voO-voe-vou-9tu-9t--9uO-9ue-9uu-9u--9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K

The Python script also tells you that the text is 167 characters long, which is a useful reference for later comparison.

Decoding the URL-safe representation in the browser

Here is the test document that implements four different methods for obtaining the original text back from the URL-safe representation (it just doesn’t show anything yet!): http://gehrcke.de/files/perm/blog/2015/jsdecode/test.html

Remember, the text we want to transport is encoded in a URL-safe way, so just for fun I want to make use of this fact in this small demonstration here, and communicate the information via the URL. To that end, the test document executes JavaScript code that extracts a string from the anchor/hash part of the URL:

// Get URL-safe text representation from the address bar.
// This is a DOMString type, as indicated by the text_ variable name prefix.
var text_urlsafe_b64data = window.location.hash.substring(1);

Okay, let’s put the test data into the URL (the output from the Python script above):

http://gehrcke.de/files/perm/blog/2015/jsdecode/test.html#CsKra8O8w5_DrsK74oCc0IzPjdCR0IfigJ3vu4jvu4nvu4zvu43vu47jjIDjjIHjjILinbbinbfinbjijZ3ijZ7ijZ_ijaDgr4vgr4zCq8-EzrHQkdCs4oSTz4PCuzoKbjIwJW9m2akoLcyuzK7Mgy3MgynbttmpKOKXj8yuzK7Mg-KAosyDKdu22akozaHguY_Mr82h4LmPKdu22akoLcyuzK7Mg-KAosyDKSHgpZHgpZLgpZPgrpngrprgupvgupzhvJHhvJTihIfiiIbiiIfilobilofilogK4pyG4pyH4pyI4pyJ4pyM4pyN44GG44GH44GI44GJ44GK44aF44aH44mr44ms442N442O442P76Wy76Wz76W076W176W276W376W476W576W676W7Cu-vl–vmO-vme-vmu-vne-voO-voe-vou-9tu-9t–9uO-9ue-9uu-9u–9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K

This is a long URL now, indeed. But it is not too long ;-).

When you access the URL above, an HTML document should show up. It should have four panels, whereas each panel should show the exact same text as initially defined in the Python script above, including all newlines. Each of the individual panels is the result of a different decoding implementation. I recommend looking at the source code of this HTML document, I have commented the different methods sufficiently. I will now quickly go through the different methods.

The first step common to all four methods is to convert from URL-safe base64 encoding to canonical base64 encoding:

function urlsafeb64_to_b64(s) {
  // Replace - with + and _ with /
  return s.replace(/-/g, '+').replace(/_/g, '/');
}
// Create canonical base64 data from whatever Python's urlsafe_b64encode() produced.
var text_b46data = urlsafeb64_to_b64(text_urlsafe_b64data);

Method 1

Starting from there, the ugliest way to obtain the original text is what is designated “Method 1″ in the source of the test document:

var text_original_1 = decodeURIComponent(escape(window.atob(text_b46data)));

This is a hacky all-in-one solution. It uses the deprecated escape() function and implicitly performs the UTF-8 decoding. Who the hell can explain why this really works? Monsur can:
monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. However, this really is a black magic approach with ugly semantics, and the tools involved never were designed for this purpose. There is no specification that guarantees proper behavior. I recommend to not use this method, especially for its really bad semantics and the use of a now deprecated function. However, if you love to confuse your peers with cryptic one-liners, then this is the way to go.

Method 2

This article states that there is “a better, more faithful and less expensive solution” involving native binary data types. In my opinion, this distinct two-step process is easy to understand and has quite clear semantics. So, my favorite decoding scheme is what is designated “Method 2″ in the source of the test document:

// Step 1: decode the base64-encoded data into a binary blob (a Uint8Array).
var binary_utf8data_1 = base64DecToArr(text_b46data);
// Step 2: decode the binary data into a DOMString. Use a custom UTF-8 decoder.
var text_original_2 = UTF8ArrToStr(binary_utf8data_1);

The functions base64DecToArr() and UTF8ArrToStr() are lightweight custom implementations, taken from the Mozilla Knowledge Base. They should work in old as well as modern browsers, and should have a decent performance. The custom functions are not really lengthy and can be shipped with your application. Just look at the source of test.html.

Method 3

The custom UTF8ArrToStr() function used in method 2 can at some point be replaced by a TextDecoder()-based method, which is part of the so-called encoding standard. This standard is a WHATWG living standard, and still labeled to be an experimental feature. Nevertheless, it is already available in modern Firefox and Chrome versions, and there also is a promising polyfill project on GitHub. Prior to using TextDecoder(), the base64-encoded data (a DOMString) must still be decoded into binary data, so the first part is the same as in method 2:

var binary_utf8data_1 = base64DecToArr(text_b46data);
var text_original_3 = new TextDecoder("utf-8").decode(binary_utf8data_1);

Method 4

The fourth method I am showing here uses an alternative approach for base64DecToArr(), i.e. for decoding the base64-encoded data (DOMString) into binary data (Uint8Array). It is shorter and easier to understand than base64DecToArr(), but presumably also of lower performance. Let’s look at base64_to_uint8array() (based on this answer on StackOverflow):

function base64_to_uint8array(s) {
  var byteChars = atob(s);
  var l = byteChars.length;
  var byteNumbers = new Array(l);
  for (var i = 0; i < l; i++) {
    byteNumbers[i] = byteChars.charCodeAt(i);
  }
  return new Uint8Array(byteNumbers);
}

Let’s combine it with the already introduced UTF8ArrToStr() (see method 2):

var binary_utf8data_2 = base64_to_uint8array(text_b46data)
var text_original_4 = UTF8ArrToStr(binary_utf8data_2);

Final words

By carefully looking at the rendered test document, one can infer that all four methods work for the test data used here. In my application scenario I am currently using method 4, since the conversion I am doing there is not performance-critical (in which case I would use method 2). A disadvantage of method 2 would be the usage of the atob() function, which is not available in IE 8 and 9. If this was a core component of an application, I’d probably start using the TextDecoder()-based method with a polyfill for older browsers. The disadvantage here is that the polyfill itself is quite a heavy dependency.

I hope these examples are of use, let me know what you think.

In-memory SQLite database and Flask: a threading trap

In a Flask development environment people love to use SQLAlchemy with Python’s built-in sqlite backend. When configured with

app.config['SQLALCHEMY_DATABASE_URI'] = "sqlite://"

the database, created via db = SQLAlchemy(app) is stored in memory instead of being persisted to disk. This is a nice feature for development and testing.

With some model classes being defined, at some point the tables should actually be created within the database, which is what the db.create_all() call is usually used for. But where should this be invoked and what difference does it make? Let’s look at two possible places:

  • In the app’s bootstrap code, before app.run() is called (such as in the __init__.py file of your application package.
  • After the development server has been started via app.run(), in a route handler.

What difference does it make? A big one, in certain cases: the call happens in different threads (you can easily convince yourself of this by calling threading.current_thread() in the two places, the result is different).

You might think this should be an implementation detail of how Flask’s development server works under the hood. I agree, and usually this detail is not important to be aware of. However, in case of an in-memory SQLite database this implementation detail makes a significant difference: the two threads see independent databases. And the mean thing is: the same db object represents different databases, depending on the thread from which it is being used.

Example: say you call db.create_all() and pre-populate the database in the main thread, before invoking app.run(). When you then access the database via the same db object (equivalent id()) from within a route handler (which does not run in the main thread), the interaction takes place with a different database, which of course is still empty, the tables are not created. A simple read/query would yield unexpected results.

Again, in other words: although you push around and interact the same db object in your modules, you might still access different databases, depending on from which thread the interaction takes place.

This is not well-documented and leads to nasty errors. I guess most often people run into OperationalError: (OperationalError) no such table:, although they think they have already created the corresponding table.

There are two reasonable solutions:

  • Bootstrap the database from within a route handler. All route handlers run within the same thread, so all route handlers interact with the same database.
  • Do not use the in-memory feature. Then db represents the same physical database across threads.

People have been bitten by this, as can be seen in these StackOverflow threads:

This has been observed with Python 2.7.5 and Flask 0.10.1.