Category Archives: Python

How to raise UnicodeDecodeError in Python 3

For debugging work, I needed to manually raise UnicodeDecodeError in CPython 3(.4). Its constructor requires 5 arguments:

>>> raise UnicodeDecodeError()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: function takes exactly 5 arguments (0 given)

The docs specify which properties an already instantiated UnicodeError (the base class) has: https://docs.python.org/3/library/exceptions.html#UnicodeError

These are five properties. It is obvious that the constructor needs these five pieces of information. However, setting them requires knowing the expected order, since the constructor does not take keyword arguments. The signature also is not documented when invoking help(UnicodeDecodeError), which suggests that this interface is implemented in C (which makes sense, as it affects the bowels of CPython’s text processing).

So, the only true reference for finding out the expected order of arguments is the C code implementing the constructor. It is defined by the function UnicodeDecodeError_init in the file Objects/exceptions.c in the CPython repository. The essential part are these lines:

if (!PyArg_ParseTuple(args, "O!OnnO!",
     &PyUnicode_Type, &ude->encoding,
     &ude->object,
     &ude->start,
     &ude->end,
     &PyUnicode_Type, &ude->reason)) {
         ude->encoding = ude->object = ude->reason = NULL;
         return -1;
}

That is, the order is the following:

  1. encoding (unicode object, i.e. type str)
  2. object that was attempted to be decoded (here a bytes object makes sense, whereas in fact the requirement just is that this object must provide the Buffer interface)
  3. start (integer)
  4. end (integer)
  5. reason (type str)

Hence, now we know how to artificially raise a UnicodeDecodeError:

>>> o = b'\x00\x00'
>>> raise UnicodeDecodeError('funnycodec', o, 1, 2, 'This is just a fake reason!')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'funnycodec' codec can't decode byte 0x00 in position 1: This is just a fake reason!

gipc 0.6.0 released

I have just released gipc 0.6.0 introducing support for Python 3. This release has been motivated by gevent 1.1 being just around the corner, which will also introduce Python 3 support.

Changes under the hood

The new gipc version has been verified to work on CPython 2.6.9, 2.7.10, 3.3.6, and 3.4.3. Python 3.4 support required significant changes under the hood: internally, gipc uses multiprocessing.Process, whose implementation drastically changed from Python 3.3 to 3.4. Most notably, on Windows, the arguments to the hard-coded CreateProcess() system call were changed, preventing automatic inheritance of file descriptors. Hence, implementation of a new approach for file descriptor transferral between processes was required for supporting Python 3.4 as well as future Python versions. Reassuringly, the unit test suite required close-to-zero adjustments.

The docs got a fresh look

I have used this opportunity to amend the documentation: it now has a fresh look based on a brutally modified RTD theme. This provides a much better reading experience as well as great support for narrow screens (i.e. mobile support). I hope you like it: https://gehrcke.de/gipc

Who’s using gipc?

I have searched the web a bit for finding interesting use cases. These great projects use gipc:

Are you successfully applying gipc in production? That is always great to hear, so please drop me a line!

Availability

As usual, the release is available via PyPI (https://pypi.python.org/pypi/gipc). Please visit https://gehrcke.de/gipc for finding API documentation, code examples, installation notes, and further in-depth information.

Git: list authors sorted by the time of their first contribution

I have created a micro Python script, git-authors, which parses the output of git log and outputs the authors in order of their first contribution to the repository.

By itself, this is a rather boring task. However, I think the resulting code is quite interesting, because it applies a couple of really important concepts and Python idioms within just a couple of lines. Hence, this small piece of code is not only useful in practice; it also serves an educational purpose.

The latter is what this article focuses on. What can you expect to learn? Less than ten lines of code are discussed. Center of this discussion is efficient stream processing and proper usage of data structures: with a very simple and yet efficient architecture, git-authors can analyze more than 500.000 commits of the Linux git repository with near-zero memory requirements and consumption of only 1 s of CPU time on a weak machine.

Usage: pipe formatted data from git log to git-authors

The recommended usage of git-authors is:

$ git log --encoding=utf-8 --full-history --reverse "--format=format:%at;%an;%ae" | \
    git-authors > authors-by-first-contrib.txt

A little background information might help for understanding this. git-authors expects to be fed with a data stream on standard input (stdin), composed of newline-separated chunks. Each chunk (line) is expected to represent one commit, and is expected to be of a certain format:

timestamp;authorname;authoremail

Furthermore, git-authors expects to retrieve these commits sorted by time, ascendingly (newest last).

git log can be configured to output the required format (via --format=format:%at;%an;%ae) and to output commits sorted by time (default) with the earliest commits first (using --reverse).

git log writes its output data to standard output (stdout). The canonical method for connecting stdout of one process to stdin of another process is a pipe, a transport mechanism provided by the operating system.

The command line options --encoding=utf-8 and --full-history should not be discussed here, for simplicity.

The input evaluation loop

Remember, the git-authors program expects to retrieve a data stream via stdin. An example snippet of such a data stream could look like this:

[...]
1113690343;Bert Wesarg;wesarg@informatik.uni-halle.de
1113690343;Ken Chen;kenneth.w.chen@intel.com
1113690344;Christoph Hellwig;hch@lst.de
1113690345;Bernard Blackham;bernard@blackham.com.au
1113690346;Jan Kara;jack@suse.cz
[...]

The core of git-authors is a loop construct built of seven lines of code. It processes named input stream in a line-by-line fashion. Let’s have a look at it:

  1. seen = set()
  2. for line in stdin:
  3.     timestamp, name, mail = line.strip().split(";")
  4.     if name not in seen:
  5.         seen.add(name)
  6.         day = time.strftime("%Y-%m-%d", time.gmtime(float(timestamp)))
  7.         stdout.write("%04d (%s): %s (%s)\n" % (len(seen), day, name, mail))

There are a couple of remarks to be made about this code:

  • This code processes the stream retrieved at standard input in a line-by-line fashion: in line 2, the script makes use of the fact that Python streams (implemented via IOBase) support the iterator protocol, meaning that they can be iterated over, whereas a single line is yielded from the stream upon each iteration (until the resource has been entirely consumed).

  • The data flow in the loop is constructed in a way that the majority of the payload data (the line’s content) is processed right away and not stored for later usage. This is a crucial concept, ensuring a small memory footprint and, even more important, a memory footprint that does (almost) not depend on the size of the input data (which also means that the memory consumption becomes largely time-independent). The minimum amount of data that this program requires to keep track of across loop iterations is a collection of unique author names already seen (repetitions are to be discarded). And that is exactly what is stored in the set called seen. How large might this set become? An estimation: how many unique author names will the largest git project ever accumulate? A couple of thousand maybe? As of summer 2015, the Linux git repository counts more than half a million commits and about 13.000 unique author names. Linux should have one of the largest if not the largest git history. It can safely be assumed that O(10^5) short strings is the maximum amount of data this program will ever need to store in memory. How much memory is required for storing a couple of thousand short strings in Python? You might want to measure this, but it is almost nothing, at least compared to how much memory is built into smartphones. When analyzing the Linux repository, the memory footprint of git-authors stayed well below 10 MB.

  • Line 3 demonstrates how powerful Python’s string methods are, especially when cascaded. It also shows how useful multiple assignment via sequence unpacking can be.

  • “Use the right data structure(s) for any given problem!” is an often-preached rule, for ensuring that the time complexity of the applied algorithm is not larger than the problem requires. Lines 1, 4, and 5 are a great example for this, I think. Here, we want to keep track of the authors that have already been observed in the stream (“seen”). This kind of problem naturally requires a data structure allowing for lookup (Have I seen you yet?) and insert (I’ve seen you!) operations. Generally, a hash table-like data structure fits these problems best, because it provides O(1) (constant) complexity for both, lookup and insertion. In Python, the dictionary implementation as well as the set implementation are both based on a hash table (in fact, these implementations share a lot of code). Both dicts and sets also provide a len() method of constant complexity, which I have made use of in line 7. Hence, the run time of this algorithm is proportional to the input size (to the number of lines in the input). It is impossible to scale better than that (every line needs to be looked at), but there are many sub-optimal ways to implement a solution that scales worse than linearly.

  • The reason why I chose to use a set instead of a dictionary is rather subtle: I think the add() semantics of set fit the given problem really well, and here we really just want to keep track of keys (the typical key-value association of the dictionary is not required here). Performance-wise, the choice shouldn’t make a significant difference.

  • Line 6 demonstrates the power of the time module. Take a Unix timestamp and generate a human-readable time string from it: easy. In my experience, strftime() is one of the most-used methods in the time module, and learning its format specifiers by heart can be really handy.

  • Line 7: old-school powerful Python string formatting with a very compact syntax. Yes, the “new” way for string formatting (PEP 3101) has been around for years, and deprecation of the old style was once planned. Truth is, however, that the old-style formatting is just too beloved and established, and will probably never even become deprecated, let alone removed. Its functionality was just extended in Python 3.5 via PEP 461.

Preparation of input and output streams

What is not shown in the snippet above is the preparation of the stdin and stdout objects. I have come up with the following method:

kwargs = {"errors": "replace", "encoding": "utf-8", "newline": "\n"}
stdin = io.open(sys.stdin.fileno(), **kwargs)
stdout = io.open(sys.stdout.fileno(), mode="w", **kwargs)

This is an extremely powerful recipe for obtaining the same behavior on Python 2 as well as on 3, but also on Windows as well as on POSIX-compliant platforms. There is a long story behind this which should not be the focus of this very article. In essence, Python 2 and Python 3 treat sys.stdin/sys.stdout very differently. Grabbing the underlying file descriptors by their balls via fileno() and creating TextIOWrapper stream objects on top of them is a powerful way to disable much of Python’s automagic and therefore to normalize behavior among platforms. The automagic I am referring to here especially includes Python 3’s platform-dependent automatic input decoding and output encoding, and universal newline support. Both really can add an annoying amount of complexity in certain situations, and this here is one such case.

Example run on CPython’s (inofficial) git repository

I applied git-authors to the current state of the inofficial CPython repository hosted at GitHub. As a side node, this required about 0.1 s of CPU time on my test machine. I am showing the output in full length below, because I find its content rather interesting. We have to appreciate that the commit history is not entirely broken, despite CPython having switched between different version control systems over the last 25 years. Did you know that Just van Rossum also was a committer? :-)

0001 (1990-08-09): Guido van Rossum (guido@python.org)
0002 (1992-08-04): Sjoerd Mullender (sjoerd@acm.org)
0003 (1992-08-13): Jack Jansen (jack.jansen@cwi.nl)
0004 (1993-01-10): cvs2svn (tools@python.org)
0005 (1994-07-25): Barry Warsaw (barry@python.org)
0006 (1996-07-23): Fred Drake (fdrake@acm.org)
0007 (1996-12-09): Roger E. Masse (rmasse@newcnri.cnri.reston.va.us)
0008 (1997-08-13): Jeremy Hylton (jeremy@alum.mit.edu)
0009 (1998-03-03): Ken Manheimer (klm@digicool.com)
0010 (1998-04-09): Andrew M. Kuchling (amk@amk.ca)
0011 (1998-12-18): Greg Ward (gward@python.net)
0012 (1999-01-22): Just van Rossum (just@lettererror.com)
0013 (1999-11-07): Greg Stein (gstein@lyra.org)
0014 (2000-05-12): Gregory P. Smith (greg@mad-scientist.com)
0015 (2000-06-06): Trent Mick (trentm@activestate.com)
0016 (2000-06-07): Marc-André Lemburg (mal@egenix.com)
0017 (2000-06-09): Mark Hammond (mhammond@skippinet.com.au)
0018 (2000-06-29): Fredrik Lundh (fredrik@pythonware.com)
0019 (2000-06-30): Skip Montanaro (skip@pobox.com)
0020 (2000-06-30): Tim Peters (tim.peters@gmail.com)
0021 (2000-07-01): Paul Prescod (prescod@prescod.net)
0022 (2000-07-10): Vladimir Marangozov (vladimir.marangozov@t-online.de)
0023 (2000-07-10): Peter Schneider-Kamp (nowonder@nowonder.de)
0024 (2000-07-10): Eric S. Raymond (esr@thyrsus.com)
0025 (2000-07-14): Thomas Wouters (thomas@python.org)
0026 (2000-07-29): Moshe Zadka (moshez@math.huji.ac.il)
0027 (2000-08-15): David Scherer (dscherer@cmu.edu)
0028 (2000-09-07): Thomas Heller (theller@ctypes.org)
0029 (2000-09-08): Martin v. Löwis (martin@v.loewis.de)
0030 (2000-09-15): Neil Schemenauer (nascheme@enme.ucalgary.ca)
0031 (2000-09-21): Lars Gustäbel (lars@gustaebel.de)
0032 (2000-09-24): Nicholas Riley (nriley@sabi.net)
0033 (2000-10-03): Ka-Ping Yee (ping@zesty.ca)
0034 (2000-10-06): Jim Fulton (jim@zope.com)
0035 (2001-01-10): Charles G. Waldman (cgw@alum.mit.edu)
0036 (2001-03-22): Steve Purcell (steve@pythonconsulting.com)
0037 (2001-06-25): Steven M. Gava (elguavas@python.net)
0038 (2001-07-04): Kurt B. Kaiser (kbk@shore.net)
0039 (2001-07-04): unknown (tools@python.org)
0040 (2001-07-20): Piers Lauder (piers@cs.su.oz.au)
0041 (2001-08-23): Finn Bock (bckfnn@worldonline.dk)
0042 (2001-08-27): Michael W. Hudson (mwh@python.net)
0043 (2001-10-31): Chui Tey (chui.tey@advdata.com.au)
0044 (2001-12-19): Neal Norwitz (nnorwitz@gmail.com)
0045 (2001-12-21): Anthony Baxter (anthonybaxter@gmail.com)
0046 (2002-02-17): Andrew MacIntyre (andymac@bullseye.apana.org.au)
0047 (2002-03-21): Walter Dörwald (walter@livinglogic.de)
0048 (2002-05-12): Raymond Hettinger (python@rcn.com)
0049 (2002-05-15): Jason Tishler (jason@tishler.net)
0050 (2002-05-28): Christian Tismer (tismer@stackless.com)
0051 (2002-06-14): Steve Holden (steve@holdenweb.com)
0052 (2002-09-23): Tony Lownds (tony@lownds.com)
0053 (2002-11-05): Gustavo Niemeyer (gustavo@niemeyer.net)
0054 (2003-01-03): David Goodger (goodger@python.org)
0055 (2003-04-19): Brett Cannon (bcannon@gmail.com)
0056 (2003-04-22): Alex Martelli (aleaxit@gmail.com)
0057 (2003-05-17): Samuele Pedroni (pedronis@openend.se)
0058 (2003-06-09): Andrew McNamara (andrewm@object-craft.com.au)
0059 (2003-10-24): Armin Rigo (arigo@tunes.org)
0060 (2003-12-10): Hye-Shik Chang (hyeshik@gmail.com)
0061 (2004-02-18): David Ascher (david.ascher@gmail.com)
0062 (2004-02-20): Vinay Sajip (vinay_sajip@yahoo.co.uk)
0063 (2004-03-21): Nicholas Bastin (nick.bastin@gmail.com)
0064 (2004-03-25): Phillip J. Eby (pje@telecommunity.com)
0065 (2004-08-04): Matthias Klose (doko@ubuntu.com)
0066 (2004-08-09): Edward Loper (edloper@gradient.cis.upenn.edu)
0067 (2004-08-09): Dave Cole (djc@object-craft.com.au)
0068 (2004-08-14): Johannes Gijsbers (jlg@dds.nl)
0069 (2004-09-17): Sean Reifschneider (jafo@tummy.com)
0070 (2004-10-16): Facundo Batista (facundobatista@gmail.com)
0071 (2004-10-21): Peter Astrand (astrand@lysator.liu.se)
0072 (2005-03-28): Bob Ippolito (bob@redivi.com)
0073 (2005-06-03): Georg Brandl (georg@python.org)
0074 (2005-11-16): Nick Coghlan (ncoghlan@gmail.com)
0075 (2006-03-30): Ronald Oussoren (ronaldoussoren@mac.com)
0076 (2006-04-17): George Yoshida (dynkin@gmail.com)
0077 (2006-04-23): Gerhard Häring (gh@ghaering.de)
0078 (2006-05-23): Richard Jones (richard@commonground.com.au)
0079 (2006-05-24): Andrew Dalke (dalke@dalkescientific.com)
0080 (2006-05-25): Kristján Valur Jónsson (kristjan@ccpgames.com)
0081 (2006-05-25): Jack Diederich (jackdied@gmail.com)
0082 (2006-05-26): Martin Blais (blais@furius.ca)
0083 (2006-07-28): Matt Fleming (mattjfleming@googlemail.com)
0084 (2006-09-05): Sean Reifscheider (jafo@tummy.com)
0085 (2007-03-08): Collin Winter (collinw@gmail.com)
0086 (2007-03-11): Žiga Seilnacht (ziga.seilnacht@gmail.com)
0087 (2007-06-07): Alexandre Vassalotti (alexandre@peadrop.com)
0088 (2007-08-16): Mark Summerfield (list@qtrac.plus.com)
0089 (2007-08-18): Travis E. Oliphant (oliphant@enthought.com)
0090 (2007-08-22): Jeffrey Yasskin (jyasskin@gmail.com)
0091 (2007-08-25): Eric Smith (eric@trueblade.com)
0092 (2007-08-29): Bill Janssen (janssen@parc.com)
0093 (2007-10-31): Christian Heimes (christian@cheimes.de)
0094 (2007-11-10): Amaury Forgeot d'Arc (amauryfa@gmail.com)
0095 (2008-01-08): Mark Dickinson (dickinsm@gmail.com)
0096 (2008-03-17): Steven Bethard (steven.bethard@gmail.com)
0097 (2008-03-18): Trent Nelson (trent.nelson@snakebite.org)
0098 (2008-03-18): David Wolever (david@wolever.net)
0099 (2008-03-25): Benjamin Peterson (benjamin@python.org)
0100 (2008-03-26): Jerry Seutter (jseutter@gmail.com)
0101 (2008-04-16): Jeroen Ruigrok van der Werven (asmodai@in-nomine.org)
0102 (2008-05-13): Jesus Cea (jcea@jcea.es)
0103 (2008-05-24): Guilherme Polo (ggpolo@gmail.com)
0104 (2008-06-01): Robert Schuppenies (okkotonushi@googlemail.com)
0105 (2008-06-10): Josiah Carlson (josiah.carlson@gmail.com)
0106 (2008-06-10): Armin Ronacher (armin.ronacher@active-4.com)
0107 (2008-06-18): Jesse Noller (jnoller@gmail.com)
0108 (2008-06-23): Senthil Kumaran (orsenthil@gmail.com)
0109 (2008-07-22): Antoine Pitrou (solipsis@pitrou.net)
0110 (2008-08-14): Hirokazu Yamamoto (ocean-city@m2.ccsnet.ne.jp)
0111 (2008-12-24): Tarek Ziadé (ziade.tarek@gmail.com)
0112 (2009-03-30): R. David Murray (rdmurray@bitdance.com)
0113 (2009-04-01): Michael Foord (fuzzyman@voidspace.org.uk)
0114 (2009-04-11): Chris Withers (chris@simplistix.co.uk)
0115 (2009-05-08): Philip Jenvey (pjenvey@underboss.org)
0116 (2009-06-25): Ezio Melotti (ezio.melotti@gmail.com)
0117 (2009-08-02): Frank Wierzbicki (fwierzbicki@gmail.com)
0118 (2009-09-20): Doug Hellmann (doug.hellmann@gmail.com)
0119 (2010-01-30): Victor Stinner (victor.stinner@haypocalc.com)
0120 (2010-02-23): Dirkjan Ochtman (dirkjan@ochtman.nl)
0121 (2010-02-24): Larry Hastings (larry@hastings.org)
0122 (2010-02-26): Florent Xicluna (florent.xicluna@gmail.com)
0123 (2010-03-25): Brian Curtin (brian.curtin@gmail.com)
0124 (2010-04-01): Stefan Krah (stefan@bytereef.org)
0125 (2010-04-10): Jean-Paul Calderone (exarkun@divmod.com)
0126 (2010-04-18): Giampaolo Rodolà (g.rodola@gmail.com)
0127 (2010-05-26): Alexander Belopolsky (alexander.belopolsky@gmail.com)
0128 (2010-08-06): Tim Golden (mail@timgolden.me.uk)
0129 (2010-08-14): Éric Araujo (merwok@netwok.org)
0130 (2010-08-22): Daniel Stutzbach (daniel@stutzbachenterprises.com)
0131 (2010-09-18): Brian Quinlan (brian@sweetapp.com)
0132 (2010-11-05): David Malcolm (dmalcolm@redhat.com)
0133 (2010-11-09): Ask Solem (askh@opera.com)
0134 (2010-11-10): Terry Reedy (tjreedy@udel.edu)
0135 (2010-11-10): Łukasz Langa (lukasz@langa.pl)
0136 (2012-06-24): Ned Deily (nad@acm.org)
0137 (2011-01-14): Eli Bendersky (eliben@gmail.com)
0138 (2011-03-10): Eric V. Smith (eric@trueblade.com)
0139 (2011-03-10): R David Murray (rdmurray@bitdance.com)
0140 (2011-03-12): orsenthil (orsenthil@gmail.com)
0141 (2011-03-14): Ross Lagerwall (rosslagerwall@gmail.com)
0142 (2011-03-14): Reid Kleckner (reid@kleckner.net)
0143 (2011-03-14): briancurtin (brian.curtin@gmail.com)
0144 (2011-03-24): guido (guido@google.com)
0145 (2011-03-30): Kristjan Valur Jonsson (sweskman@gmail.com)
0146 (2011-04-04): brian.curtin (brian@python.org)
0147 (2011-04-12): Nadeem Vawda (nadeem.vawda@gmail.com)
0148 (2011-04-19): Giampaolo Rodola' (g.rodola@gmail.com)
0149 (2011-05-04): Alexis Metaireau (alexis@notmyidea.org)
0150 (2011-05-09): Gerhard Haering (gh@ghaering.de)
0151 (2011-05-09): Petri Lehtinen (petri@digip.org)
0152 (2011-05-24): Charles-François Natali (neologix@free.fr)
0153 (2011-07-17): Alex Gaynor (alex.gaynor@gmail.com)
0154 (2011-07-27): Jason R. Coombs (jaraco@jaraco.com)
0155 (2011-08-02): Sandro Tosi (sandro.tosi@gmail.com)
0156 (2011-09-28): Meador Inge (meadori@gmail.com)
0157 (2012-01-09): Terry Jan Reedy (tjreedy@udel.edu)
0158 (2011-05-19): Tarek Ziade (tarek@ziade.org)
0159 (2011-05-22): Martin v. Loewis (martin@v.loewis.de)
0160 (2011-05-31): Ralf Schmitt (ralf@systemexit.de)
0161 (2011-09-12): Jeremy Kloth (jeremy.kloth@gmail.com)
0162 (2012-03-14): Andrew Svetlov (andrew.svetlov@gmail.com)
0163 (2012-03-21): krisvale (sweskman@gmail.com)
0164 (2012-04-24): Marc-Andre Lemburg (mal@egenix.com)
0165 (2012-04-30): Richard Oudkerk (shibturn@gmail.com)
0166 (2012-05-15): Hynek Schlawack (hs@ox.cx)
0167 (2012-06-20): doko (doko@ubuntu.com)
0168 (2012-07-16): Atsuo Ishimoto (ishimoto@gembook.org)
0169 (2012-09-02): Zbigniew Jędrzejewski-Szmek (zbyszek@in.waw.pl)
0170 (2012-09-06): Eric Snow (ericsnowcurrently@gmail.com)
0171 (2012-09-25): Chris Jerdonek (chris.jerdonek@gmail.com)
0172 (2012-12-27): Serhiy Storchaka (storchaka@gmail.com)
0173 (2013-03-31): Roger Serwy (roger.serwy@gmail.com)
0174 (2013-03-31): Charles-Francois Natali (cf.natali@gmail.com)
0175 (2013-05-10): Andrew Kuchling (amk@amk.ca)
0176 (2013-06-14): Ethan Furman (ethan@stoneleaf.us)
0177 (2013-08-12): Felix Crux (felixc@felixcrux.com)
0178 (2013-10-21): Peter Moody (python@hda3.com)
0179 (2013-10-25): bquinlan (brian@sweetapp.com)
0180 (2013-11-04): Zachary Ware (zachary.ware@gmail.com)
0181 (2013-12-02): Walter Doerwald (walter@livinglogic.de)
0182 (2013-12-21): Donald Stufft (donald@stufft.io)
0183 (2014-01-03): Daniel Holth (dholth@fastmail.fm)
0184 (2014-01-27): Yury Selivanov (yselivanov@sprymix.com)
0185 (2014-04-15): Kushal Das (kushaldas@gmail.com)
0186 (2014-06-29): Berker Peksag (berker.peksag@gmail.com)
0187 (2014-07-16): Tal Einat (taleinat@gmail.com)
0188 (2014-10-08): Steve Dower (steve.dower@microsoft.com)
0189 (2014-10-18): Robert Collins (rbtcollins@hp.com)
0190 (2015-03-22): Paul Moore (p.f.moore@gmail.com)

Resources

Similar functionality is provided by the more full-blown frameworks grunt-git-authors and gitstats.

Some resources that might be insightful for you:

JavaScript in a browser: create a binary blob from a base64-encoded string, and decode it using UTF-8.

For an AngularJS-based web application I am currently working on, I want to put arbitrary text information into a URL, send it to the client, and have it decoded by the browser. With the right tools this should be a trivial process, shouldn’t it?

The basic idea is:

  1. Start with the original text, which is a sequence of unicode code points.
  2. Encode the original text into binary data; with the UTF-8 codec.
  3. base64-encode the resulting binary data (and replace the URL-unsafe characters / and + with, for instance, – and _).
  4. The result is a URL-safe string. Send it. Let’s assume it ends up in a browser (in window.location, for instance).
  5. Invert the entire procedure in the browser:
    1. The data arrives as DOMString type (unicode text, so to say).
    2. Transform URL-safe base64 representation to canonical base64 (replace _ and – characters).
    3. Decode the base64 string into a real binary data type (Uint8Array, for instance).
    4. Decode the binary blob into a DOMString containing the original text, using the UTF-8 codec.

Unfortunately, there so far are no official and no established ways for performing steps 5.3 and 5.4 in a browser environment. There is no obvious way for obtaining a binary blob from base64-encoded data. Further below, I’ll show three different methods for executing this step. Proceeding from here, I realized that there also is still no established way for decoding a binary blob into a DOMString using a given codec (UTF-8 in this case). I’ll show two different methods for performing this task.

The original text and its URL-safe representation

I’ll start with a Python snippet defining the original text and creating its URL-safe base64 representation:

# -*- coding: utf-8 -*-
 
from __future__ import unicode_literals
from base64 import urlsafe_b64encode
 
text = """
«küßЌύБЇ”ﻈﻉﻌﻍﻎ㌀㌁㌂❶❷❸⍝⍞⍟⍠ோௌ«ταБЬℓσ»:
n20%of٩(-̮̮̃-̃)۶٩(●̮̮̃•̃)۶٩(͡๏̯͡๏)۶٩(-̮̮̃•̃)!॒॑॓ஙசປຜἑἔℇ∆∇▆▇█
✆✇✈✉✌✍うぇえぉおㆅㆇ㉫㉬㍍㍎㍏沈拾若掠略亮兩凉梁糧
ﯗﯘﯙﯚﯝﯠﯡﯢカキクケコサシス'kosme':"κόσμε"/?#+퟿
"""
 
data = text.encode("utf-8")
datab64 = urlsafe_b64encode(data)
 
print("text length: %s" % len(text))
print("urlsafe base64 representation of the binary data:\n\n%s" % datab64)

The original text (the variable named text) is meant to contain code points from many different Unicode character blocks. The following resources helped me assembling this test text:

As you can see, the text is first encoded using the UTF-8 codec. The resulting binary data then is put into urlsafe_b64encode(), yielding a URL-safe byte sequence. So, execution of the named Python script yields the following URL-safe representation of the original text:

CsKra8O8w5_DrsK74oCc0IzPjdCR0IfigJ3vu4jvu4nvu4zvu43vu47jjIDjjIHjjILinbbinbfinbjijZ3ijZ7ijZ_ijaDgr4vgr4zCq8-EzrHQkdCs4oSTz4PCuzoKbjIwJW9m2akoLcyuzK7Mgy3MgynbttmpKOKXj8yuzK7Mg-KAosyDKdu22akozaHguY_Mr82h4LmPKdu22akoLcyuzK7Mg-KAosyDKSHgpZHgpZLgpZPgrpngrprgupvgupzhvJHhvJTihIfiiIbiiIfilobilofilogK4pyG4pyH4pyI4pyJ4pyM4pyN44GG44GH44GI44GJ44GK44aF44aH44mr44ms442N442O442P76Wy76Wz76W076W176W276W376W476W576W676W7Cu-vl--vmO-vme-vmu-vne-voO-voe-vou-9tu-9t--9uO-9ue-9uu-9u--9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K

The Python script also tells you that the text is 167 characters long, which is a useful reference for later comparison.

Decoding the URL-safe representation in the browser

Here is the test document that implements four different methods for obtaining the original text back from the URL-safe representation (it just doesn’t show anything yet!): http://gehrcke.de/files/perm/blog/2015/jsdecode/test.html

Remember, the text we want to transport is encoded in a URL-safe way, so just for fun I want to make use of this fact in this small demonstration here, and communicate the information via the URL. To that end, the test document executes JavaScript code that extracts a string from the anchor/hash part of the URL:

// Get URL-safe text representation from the address bar.
// This is a DOMString type, as indicated by the text_ variable name prefix.
var text_urlsafe_b64data = window.location.hash.substring(1);

Okay, let’s put the test data into the URL (the output from the Python script above):

http://gehrcke.de/files/perm/blog/2015/jsdecode/test.html#CsKra8O8w5_DrsK74oCc0IzPjdCR0IfigJ3vu4jvu4nvu4zvu43vu47jjIDjjIHjjILinbbinbfinbjijZ3ijZ7ijZ_ijaDgr4vgr4zCq8-EzrHQkdCs4oSTz4PCuzoKbjIwJW9m2akoLcyuzK7Mgy3MgynbttmpKOKXj8yuzK7Mg-KAosyDKdu22akozaHguY_Mr82h4LmPKdu22akoLcyuzK7Mg-KAosyDKSHgpZHgpZLgpZPgrpngrprgupvgupzhvJHhvJTihIfiiIbiiIfilobilofilogK4pyG4pyH4pyI4pyJ4pyM4pyN44GG44GH44GI44GJ44GK44aF44aH44mr44ms442N442O442P76Wy76Wz76W076W176W276W376W476W576W676W7Cu-vl–vmO-vme-vmu-vne-voO-voe-vou-9tu-9t–9uO-9ue-9uu-9u–9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K

This is a long URL now, indeed. But it is not too long ;-).

When you access the URL above, an HTML document should show up. It should have four panels, whereas each panel should show the exact same text as initially defined in the Python script above, including all newlines. Each of the individual panels is the result of a different decoding implementation. I recommend looking at the source code of this HTML document, I have commented the different methods sufficiently. I will now quickly go through the different methods.

The first step common to all four methods is to convert from URL-safe base64 encoding to canonical base64 encoding:

function urlsafeb64_to_b64(s) {
  // Replace - with + and _ with /
  return s.replace(/-/g, '+').replace(/_/g, '/');
}
// Create canonical base64 data from whatever Python's urlsafe_b64encode() produced.
var text_b46data = urlsafeb64_to_b64(text_urlsafe_b64data);

Method 1

Starting from there, the ugliest way to obtain the original text is what is designated “Method 1” in the source of the test document:

var text_original_1 = decodeURIComponent(escape(window.atob(text_b46data)));

This is a hacky all-in-one solution. It uses the deprecated escape() function and implicitly performs the UTF-8 decoding. Who the hell can explain why this really works? Monsur can:
monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. However, this really is a black magic approach with ugly semantics, and the tools involved never were designed for this purpose. There is no specification that guarantees proper behavior. I recommend to not use this method, especially for its really bad semantics and the use of a now deprecated function. However, if you love to confuse your peers with cryptic one-liners, then this is the way to go.

Method 2

This article states that there is “a better, more faithful and less expensive solution” involving native binary data types. In my opinion, this distinct two-step process is easy to understand and has quite clear semantics. So, my favorite decoding scheme is what is designated “Method 2” in the source of the test document:

// Step 1: decode the base64-encoded data into a binary blob (a Uint8Array).
var binary_utf8data_1 = base64DecToArr(text_b46data);
// Step 2: decode the binary data into a DOMString. Use a custom UTF-8 decoder.
var text_original_2 = UTF8ArrToStr(binary_utf8data_1);

The functions base64DecToArr() and UTF8ArrToStr() are lightweight custom implementations, taken from the Mozilla Knowledge Base. They should work in old as well as modern browsers, and should have a decent performance. The custom functions are not really lengthy and can be shipped with your application. Just look at the source of test.html.

Method 3

The custom UTF8ArrToStr() function used in method 2 can at some point be replaced by a TextDecoder()-based method, which is part of the so-called encoding standard. This standard is a WHATWG living standard, and still labeled to be an experimental feature. Nevertheless, it is already available in modern Firefox and Chrome versions, and there also is a promising polyfill project on GitHub. Prior to using TextDecoder(), the base64-encoded data (a DOMString) must still be decoded into binary data, so the first part is the same as in method 2:

var binary_utf8data_1 = base64DecToArr(text_b46data);
var text_original_3 = new TextDecoder("utf-8").decode(binary_utf8data_1);

Method 4

The fourth method I am showing here uses an alternative approach for base64DecToArr(), i.e. for decoding the base64-encoded data (DOMString) into binary data (Uint8Array). It is shorter and easier to understand than base64DecToArr(), but presumably also of lower performance. Let’s look at base64_to_uint8array() (based on this answer on StackOverflow):

function base64_to_uint8array(s) {
  var byteChars = atob(s);
  var l = byteChars.length;
  var byteNumbers = new Array(l);
  for (var i = 0; i < l; i++) {
    byteNumbers[i] = byteChars.charCodeAt(i);
  }
  return new Uint8Array(byteNumbers);
}

Let’s combine it with the already introduced UTF8ArrToStr() (see method 2):

var binary_utf8data_2 = base64_to_uint8array(text_b46data)
var text_original_4 = UTF8ArrToStr(binary_utf8data_2);

Final words

By carefully looking at the rendered test document, one can infer that all four methods work for the test data used here. In my application scenario I am currently using method 4, since the conversion I am doing there is not performance-critical (in which case I would use method 2). A disadvantage of method 2 would be the usage of the atob() function, which is not available in IE 8 and 9. If this was a core component of an application, I’d probably start using the TextDecoder()-based method with a polyfill for older browsers. The disadvantage here is that the polyfill itself is quite a heavy dependency.

I hope these examples are of use, let me know what you think.

In-memory SQLite database and Flask: a threading trap

In a Flask development environment people love to use SQLAlchemy with Python’s built-in sqlite backend. When configured with

app.config['SQLALCHEMY_DATABASE_URI'] = "sqlite://"

the database, created via db = SQLAlchemy(app) is stored in memory instead of being persisted to disk. This is a nice feature for development and testing.

With some model classes being defined, at some point the tables should actually be created within the database, which is what the db.create_all() call is usually used for. But where should this be invoked and what difference does it make? Let’s look at two possible places:

  • In the app’s bootstrap code, before app.run() is called (such as in the __init__.py file of your application package.
  • After the development server has been started via app.run(), in a route handler.

What difference does it make? A big one, in certain cases: the call happens in different threads (you can easily convince yourself of this by calling threading.current_thread() in the two places, the result is different).

You might think this should be an implementation detail of how Flask’s development server works under the hood. I agree, and usually this detail is not important to be aware of. However, in case of an in-memory SQLite database this implementation detail makes a significant difference: the two threads see independent databases. And the mean thing is: the same db object represents different databases, depending on the thread from which it is being used.

Example: say you call db.create_all() and pre-populate the database in the main thread, before invoking app.run(). When you then access the database via the same db object (equivalent id()) from within a route handler (which does not run in the main thread), the interaction takes place with a different database, which of course is still empty, the tables are not created. A simple read/query would yield unexpected results.

Again, in other words: although you push around and interact the same db object in your modules, you might still access different databases, depending on from which thread the interaction takes place.

This is not well-documented and leads to nasty errors. I guess most often people run into OperationalError: (OperationalError) no such table:, although they think they have already created the corresponding table.

There are two reasonable solutions:

  • Bootstrap the database from within a route handler. All route handlers run within the same thread, so all route handlers interact with the same database.
  • Do not use the in-memory feature. Then db represents the same physical database across threads.

People have been bitten by this, as can be seen in these StackOverflow threads:

This has been observed with Python 2.7.5 and Flask 0.10.1.