I have created a micro Python script, git-authors, which parses the output of git log and outputs the authors in order of their first contribution to the repository.
By itself, this is a rather boring task. However, I think the resulting code is quite interesting, because it applies a couple of really important concepts and Python idioms within just a couple of lines. Hence, this small piece of code is not only useful in practice; it also serves an educational purpose.
The latter is what this article focuses on. What can you expect to learn? Less than ten lines of code are discussed. Center of this discussion is efficient stream processing and proper usage of data structures: with a very simple and yet efficient architecture, git-authors
can analyze more than 500.000 commits of the Linux git repository with near-zero memory requirements and consumption of only 1 s of CPU time on a weak machine.
Usage: pipe formatted data from git log
to git-authors
The recommended usage of git-authors
is:
$ git log --encoding=utf-8 --full-history --reverse "--format=format:%at;%an;%ae" | \ git-authors > authors-by-first-contrib.txt
A little background information might help for understanding this. git-authors
expects to be fed with a data stream on standard input (stdin
), composed of newline-separated chunks. Each chunk (line) is expected to represent one commit, and is expected to be of a certain format:
timestamp;authorname;authoremail
Furthermore, git-authors
expects to retrieve these commits sorted by time, ascendingly (newest last).
git log
can be configured to output the required format (via --format=format:%at;%an;%ae
) and to output commits sorted by time (default) with the earliest commits first (using --reverse
).
git log
writes its output data to standard output (stdout
). The canonical method for connecting stdout
of one process to stdin
of another process is a pipe, a transport mechanism provided by the operating system.
The command line options --encoding=utf-8
and --full-history
should not be discussed here, for simplicity.
The input evaluation loop
Remember, the git-authors
program expects to retrieve a data stream via stdin
. An example snippet of such a data stream could look like this:
[...] 1113690343;Bert Wesarg;wesarg@informatik.uni-halle.de 1113690343;Ken Chen;kenneth.w.chen@intel.com 1113690344;Christoph Hellwig;hch@lst.de 1113690345;Bernard Blackham;bernard@blackham.com.au 1113690346;Jan Kara;jack@suse.cz [...]
The core of git-authors
is a loop construct built of seven lines of code. It processes named input stream in a line-by-line fashion. Let’s have a look at it:
seen = set()
for line in stdin:
timestamp, name, mail = line.strip().split(";")
if name not in seen:
seen.add(name)
day = time.strftime("%Y-%m-%d", time.gmtime(float(timestamp)))
stdout.write("%04d (%s): %s (%s)\n" % (len(seen), day, name, mail))
There are a couple of remarks to be made about this code:
- This code processes the stream retrieved at standard input in a line-by-line fashion: in line 2, the script makes use of the fact that Python streams (implemented via IOBase) support the iterator protocol, meaning that they can be iterated over, whereas a single line is yielded from the stream upon each iteration (until the resource has been entirely consumed).
-
The data flow in the loop is constructed in a way that the majority of the payload data (the line’s content) is processed right away and not stored for later usage. This is a crucial concept, ensuring a small memory footprint and, even more important, a memory footprint that does (almost) not depend on the size of the input data (which also means that the memory consumption becomes largely time-independent). The minimum amount of data that this program requires to keep track of across loop iterations is a collection of unique author names already seen (repetitions are to be discarded). And that is exactly what is stored in the set called
seen
. How large might this set become? An estimation: how many unique author names will the largest git project ever accumulate? A couple of thousand maybe? As of summer 2015, the Linux git repository counts more than half a million commits and about 13.000 unique author names. Linux should have one of the largest if not the largest git history. It can safely be assumed that O(10^5) short strings is the maximum amount of data this program will ever need to store in memory. How much memory is required for storing a couple of thousand short strings in Python? You might want to measure this, but it is almost nothing, at least compared to how much memory is built into smartphones. When analyzing the Linux repository, the memory footprint ofgit-authors
stayed well below 10 MB. -
Line 3 demonstrates how powerful Python’s string methods are, especially when cascaded. It also shows how useful multiple assignment via sequence unpacking can be.
-
“Use the right data structure(s) for any given problem!” is an often-preached rule, for ensuring that the time complexity of the applied algorithm is not larger than the problem requires. Lines 1, 4, and 5 are a great example for this, I think. Here, we want to keep track of the authors that have already been observed in the stream (“seen”). This kind of problem naturally requires a data structure allowing for lookup (Have I seen you yet?) and insert (I’ve seen you!) operations. Generally, a hash table-like data structure fits these problems best, because it provides
O(1)
(constant) complexity for both, lookup and insertion. In Python, the dictionary implementation as well as the set implementation are both based on a hash table (in fact, these implementations share a lot of code). Both dicts and sets also provide alen()
method of constant complexity, which I have made use of in line 7. Hence, the run time of this algorithm is proportional to the input size (to the number of lines in the input). It is impossible to scale better than that (every line needs to be looked at), but there are many sub-optimal ways to implement a solution that scales worse than linearly. -
The reason why I chose to use a set instead of a dictionary is rather subtle: I think the
add()
semantics ofset
fit the given problem really well, and here we really just want to keep track of keys (the typical key-value association of the dictionary is not required here). Performance-wise, the choice shouldn’t make a significant difference. -
Line 6 demonstrates the power of the
time
module. Take a Unix timestamp and generate a human-readable time string from it: easy. In my experience,strftime()
is one of the most-used methods in thetime
module, and learning its format specifiers by heart can be really handy. -
Line 7: old-school powerful Python string formatting with a very compact syntax. Yes, the “new” way for string formatting (PEP 3101) has been around for years, and deprecation of the old style was once planned. Truth is, however, that the old-style formatting is just too beloved and established, and will probably never even become deprecated, let alone removed. Its functionality was just extended in Python 3.5 via PEP 461.
Preparation of input and output streams
What is not shown in the snippet above is the preparation of the stdin
and stdout
objects. I have come up with the following method:
kwargs = {"errors": "replace", "encoding": "utf-8", "newline": "\n"} stdin = io.open(sys.stdin.fileno(), **kwargs) stdout = io.open(sys.stdout.fileno(), mode="w", **kwargs)
This is an extremely powerful recipe for obtaining the same behavior on Python 2 as well as on 3, but also on Windows as well as on POSIX-compliant platforms. There is a long story behind this which should not be the focus of this very article. In essence, Python 2 and Python 3 treat sys.stdin
/sys.stdout
very differently. Grabbing the underlying file descriptors by their balls via fileno()
and creating TextIOWrapper stream objects on top of them is a powerful way to disable much of Python’s automagic and therefore to normalize behavior among platforms. The automagic I am referring to here especially includes Python 3’s platform-dependent automatic input decoding and output encoding, and universal newline support. Both really can add an annoying amount of complexity in certain situations, and this here is one such case.
Example run on CPython’s (inofficial) git repository
I applied git-authors
to the current state of the inofficial CPython repository hosted at GitHub. As a side node, this required about 0.1 s of CPU time on my test machine. I am showing the output in full length below, because I find its content rather interesting. We have to appreciate that the commit history is not entirely broken, despite CPython having switched between different version control systems over the last 25 years. Did you know that Just van Rossum also was a committer? :-)
0001 (1990-08-09): Guido van Rossum (guido@python.org) 0002 (1992-08-04): Sjoerd Mullender (sjoerd@acm.org) 0003 (1992-08-13): Jack Jansen (jack.jansen@cwi.nl) 0004 (1993-01-10): cvs2svn (tools@python.org) 0005 (1994-07-25): Barry Warsaw (barry@python.org) 0006 (1996-07-23): Fred Drake (fdrake@acm.org) 0007 (1996-12-09): Roger E. Masse (rmasse@newcnri.cnri.reston.va.us) 0008 (1997-08-13): Jeremy Hylton (jeremy@alum.mit.edu) 0009 (1998-03-03): Ken Manheimer (klm@digicool.com) 0010 (1998-04-09): Andrew M. Kuchling (amk@amk.ca) 0011 (1998-12-18): Greg Ward (gward@python.net) 0012 (1999-01-22): Just van Rossum (just@lettererror.com) 0013 (1999-11-07): Greg Stein (gstein@lyra.org) 0014 (2000-05-12): Gregory P. Smith (greg@mad-scientist.com) 0015 (2000-06-06): Trent Mick (trentm@activestate.com) 0016 (2000-06-07): Marc-André Lemburg (mal@egenix.com) 0017 (2000-06-09): Mark Hammond (mhammond@skippinet.com.au) 0018 (2000-06-29): Fredrik Lundh (fredrik@pythonware.com) 0019 (2000-06-30): Skip Montanaro (skip@pobox.com) 0020 (2000-06-30): Tim Peters (tim.peters@gmail.com) 0021 (2000-07-01): Paul Prescod (prescod@prescod.net) 0022 (2000-07-10): Vladimir Marangozov (vladimir.marangozov@t-online.de) 0023 (2000-07-10): Peter Schneider-Kamp (nowonder@nowonder.de) 0024 (2000-07-10): Eric S. Raymond (esr@thyrsus.com) 0025 (2000-07-14): Thomas Wouters (thomas@python.org) 0026 (2000-07-29): Moshe Zadka (moshez@math.huji.ac.il) 0027 (2000-08-15): David Scherer (dscherer@cmu.edu) 0028 (2000-09-07): Thomas Heller (theller@ctypes.org) 0029 (2000-09-08): Martin v. Löwis (martin@v.loewis.de) 0030 (2000-09-15): Neil Schemenauer (nascheme@enme.ucalgary.ca) 0031 (2000-09-21): Lars Gustäbel (lars@gustaebel.de) 0032 (2000-09-24): Nicholas Riley (nriley@sabi.net) 0033 (2000-10-03): Ka-Ping Yee (ping@zesty.ca) 0034 (2000-10-06): Jim Fulton (jim@zope.com) 0035 (2001-01-10): Charles G. Waldman (cgw@alum.mit.edu) 0036 (2001-03-22): Steve Purcell (steve@pythonconsulting.com) 0037 (2001-06-25): Steven M. Gava (elguavas@python.net) 0038 (2001-07-04): Kurt B. Kaiser (kbk@shore.net) 0039 (2001-07-04): unknown (tools@python.org) 0040 (2001-07-20): Piers Lauder (piers@cs.su.oz.au) 0041 (2001-08-23): Finn Bock (bckfnn@worldonline.dk) 0042 (2001-08-27): Michael W. Hudson (mwh@python.net) 0043 (2001-10-31): Chui Tey (chui.tey@advdata.com.au) 0044 (2001-12-19): Neal Norwitz (nnorwitz@gmail.com) 0045 (2001-12-21): Anthony Baxter (anthonybaxter@gmail.com) 0046 (2002-02-17): Andrew MacIntyre (andymac@bullseye.apana.org.au) 0047 (2002-03-21): Walter Dörwald (walter@livinglogic.de) 0048 (2002-05-12): Raymond Hettinger (python@rcn.com) 0049 (2002-05-15): Jason Tishler (jason@tishler.net) 0050 (2002-05-28): Christian Tismer (tismer@stackless.com) 0051 (2002-06-14): Steve Holden (steve@holdenweb.com) 0052 (2002-09-23): Tony Lownds (tony@lownds.com) 0053 (2002-11-05): Gustavo Niemeyer (gustavo@niemeyer.net) 0054 (2003-01-03): David Goodger (goodger@python.org) 0055 (2003-04-19): Brett Cannon (bcannon@gmail.com) 0056 (2003-04-22): Alex Martelli (aleaxit@gmail.com) 0057 (2003-05-17): Samuele Pedroni (pedronis@openend.se) 0058 (2003-06-09): Andrew McNamara (andrewm@object-craft.com.au) 0059 (2003-10-24): Armin Rigo (arigo@tunes.org) 0060 (2003-12-10): Hye-Shik Chang (hyeshik@gmail.com) 0061 (2004-02-18): David Ascher (david.ascher@gmail.com) 0062 (2004-02-20): Vinay Sajip (vinay_sajip@yahoo.co.uk) 0063 (2004-03-21): Nicholas Bastin (nick.bastin@gmail.com) 0064 (2004-03-25): Phillip J. Eby (pje@telecommunity.com) 0065 (2004-08-04): Matthias Klose (doko@ubuntu.com) 0066 (2004-08-09): Edward Loper (edloper@gradient.cis.upenn.edu) 0067 (2004-08-09): Dave Cole (djc@object-craft.com.au) 0068 (2004-08-14): Johannes Gijsbers (jlg@dds.nl) 0069 (2004-09-17): Sean Reifschneider (jafo@tummy.com) 0070 (2004-10-16): Facundo Batista (facundobatista@gmail.com) 0071 (2004-10-21): Peter Astrand (astrand@lysator.liu.se) 0072 (2005-03-28): Bob Ippolito (bob@redivi.com) 0073 (2005-06-03): Georg Brandl (georg@python.org) 0074 (2005-11-16): Nick Coghlan (ncoghlan@gmail.com) 0075 (2006-03-30): Ronald Oussoren (ronaldoussoren@mac.com) 0076 (2006-04-17): George Yoshida (dynkin@gmail.com) 0077 (2006-04-23): Gerhard Häring (gh@ghaering.de) 0078 (2006-05-23): Richard Jones (richard@commonground.com.au) 0079 (2006-05-24): Andrew Dalke (dalke@dalkescientific.com) 0080 (2006-05-25): Kristján Valur Jónsson (kristjan@ccpgames.com) 0081 (2006-05-25): Jack Diederich (jackdied@gmail.com) 0082 (2006-05-26): Martin Blais (blais@furius.ca) 0083 (2006-07-28): Matt Fleming (mattjfleming@googlemail.com) 0084 (2006-09-05): Sean Reifscheider (jafo@tummy.com) 0085 (2007-03-08): Collin Winter (collinw@gmail.com) 0086 (2007-03-11): Žiga Seilnacht (ziga.seilnacht@gmail.com) 0087 (2007-06-07): Alexandre Vassalotti (alexandre@peadrop.com) 0088 (2007-08-16): Mark Summerfield (list@qtrac.plus.com) 0089 (2007-08-18): Travis E. Oliphant (oliphant@enthought.com) 0090 (2007-08-22): Jeffrey Yasskin (jyasskin@gmail.com) 0091 (2007-08-25): Eric Smith (eric@trueblade.com) 0092 (2007-08-29): Bill Janssen (janssen@parc.com) 0093 (2007-10-31): Christian Heimes (christian@cheimes.de) 0094 (2007-11-10): Amaury Forgeot d'Arc (amauryfa@gmail.com) 0095 (2008-01-08): Mark Dickinson (dickinsm@gmail.com) 0096 (2008-03-17): Steven Bethard (steven.bethard@gmail.com) 0097 (2008-03-18): Trent Nelson (trent.nelson@snakebite.org) 0098 (2008-03-18): David Wolever (david@wolever.net) 0099 (2008-03-25): Benjamin Peterson (benjamin@python.org) 0100 (2008-03-26): Jerry Seutter (jseutter@gmail.com) 0101 (2008-04-16): Jeroen Ruigrok van der Werven (asmodai@in-nomine.org) 0102 (2008-05-13): Jesus Cea (jcea@jcea.es) 0103 (2008-05-24): Guilherme Polo (ggpolo@gmail.com) 0104 (2008-06-01): Robert Schuppenies (okkotonushi@googlemail.com) 0105 (2008-06-10): Josiah Carlson (josiah.carlson@gmail.com) 0106 (2008-06-10): Armin Ronacher (armin.ronacher@active-4.com) 0107 (2008-06-18): Jesse Noller (jnoller@gmail.com) 0108 (2008-06-23): Senthil Kumaran (orsenthil@gmail.com) 0109 (2008-07-22): Antoine Pitrou (solipsis@pitrou.net) 0110 (2008-08-14): Hirokazu Yamamoto (ocean-city@m2.ccsnet.ne.jp) 0111 (2008-12-24): Tarek Ziadé (ziade.tarek@gmail.com) 0112 (2009-03-30): R. David Murray (rdmurray@bitdance.com) 0113 (2009-04-01): Michael Foord (fuzzyman@voidspace.org.uk) 0114 (2009-04-11): Chris Withers (chris@simplistix.co.uk) 0115 (2009-05-08): Philip Jenvey (pjenvey@underboss.org) 0116 (2009-06-25): Ezio Melotti (ezio.melotti@gmail.com) 0117 (2009-08-02): Frank Wierzbicki (fwierzbicki@gmail.com) 0118 (2009-09-20): Doug Hellmann (doug.hellmann@gmail.com) 0119 (2010-01-30): Victor Stinner (victor.stinner@haypocalc.com) 0120 (2010-02-23): Dirkjan Ochtman (dirkjan@ochtman.nl) 0121 (2010-02-24): Larry Hastings (larry@hastings.org) 0122 (2010-02-26): Florent Xicluna (florent.xicluna@gmail.com) 0123 (2010-03-25): Brian Curtin (brian.curtin@gmail.com) 0124 (2010-04-01): Stefan Krah (stefan@bytereef.org) 0125 (2010-04-10): Jean-Paul Calderone (exarkun@divmod.com) 0126 (2010-04-18): Giampaolo Rodolà (g.rodola@gmail.com) 0127 (2010-05-26): Alexander Belopolsky (alexander.belopolsky@gmail.com) 0128 (2010-08-06): Tim Golden (mail@timgolden.me.uk) 0129 (2010-08-14): Éric Araujo (merwok@netwok.org) 0130 (2010-08-22): Daniel Stutzbach (daniel@stutzbachenterprises.com) 0131 (2010-09-18): Brian Quinlan (brian@sweetapp.com) 0132 (2010-11-05): David Malcolm (dmalcolm@redhat.com) 0133 (2010-11-09): Ask Solem (askh@opera.com) 0134 (2010-11-10): Terry Reedy (tjreedy@udel.edu) 0135 (2010-11-10): Łukasz Langa (lukasz@langa.pl) 0136 (2012-06-24): Ned Deily (nad@acm.org) 0137 (2011-01-14): Eli Bendersky (eliben@gmail.com) 0138 (2011-03-10): Eric V. Smith (eric@trueblade.com) 0139 (2011-03-10): R David Murray (rdmurray@bitdance.com) 0140 (2011-03-12): orsenthil (orsenthil@gmail.com) 0141 (2011-03-14): Ross Lagerwall (rosslagerwall@gmail.com) 0142 (2011-03-14): Reid Kleckner (reid@kleckner.net) 0143 (2011-03-14): briancurtin (brian.curtin@gmail.com) 0144 (2011-03-24): guido (guido@google.com) 0145 (2011-03-30): Kristjan Valur Jonsson (sweskman@gmail.com) 0146 (2011-04-04): brian.curtin (brian@python.org) 0147 (2011-04-12): Nadeem Vawda (nadeem.vawda@gmail.com) 0148 (2011-04-19): Giampaolo Rodola' (g.rodola@gmail.com) 0149 (2011-05-04): Alexis Metaireau (alexis@notmyidea.org) 0150 (2011-05-09): Gerhard Haering (gh@ghaering.de) 0151 (2011-05-09): Petri Lehtinen (petri@digip.org) 0152 (2011-05-24): Charles-François Natali (neologix@free.fr) 0153 (2011-07-17): Alex Gaynor (alex.gaynor@gmail.com) 0154 (2011-07-27): Jason R. Coombs (jaraco@jaraco.com) 0155 (2011-08-02): Sandro Tosi (sandro.tosi@gmail.com) 0156 (2011-09-28): Meador Inge (meadori@gmail.com) 0157 (2012-01-09): Terry Jan Reedy (tjreedy@udel.edu) 0158 (2011-05-19): Tarek Ziade (tarek@ziade.org) 0159 (2011-05-22): Martin v. Loewis (martin@v.loewis.de) 0160 (2011-05-31): Ralf Schmitt (ralf@systemexit.de) 0161 (2011-09-12): Jeremy Kloth (jeremy.kloth@gmail.com) 0162 (2012-03-14): Andrew Svetlov (andrew.svetlov@gmail.com) 0163 (2012-03-21): krisvale (sweskman@gmail.com) 0164 (2012-04-24): Marc-Andre Lemburg (mal@egenix.com) 0165 (2012-04-30): Richard Oudkerk (shibturn@gmail.com) 0166 (2012-05-15): Hynek Schlawack (hs@ox.cx) 0167 (2012-06-20): doko (doko@ubuntu.com) 0168 (2012-07-16): Atsuo Ishimoto (ishimoto@gembook.org) 0169 (2012-09-02): Zbigniew JÄ™drzejewski-Szmek (zbyszek@in.waw.pl) 0170 (2012-09-06): Eric Snow (ericsnowcurrently@gmail.com) 0171 (2012-09-25): Chris Jerdonek (chris.jerdonek@gmail.com) 0172 (2012-12-27): Serhiy Storchaka (storchaka@gmail.com) 0173 (2013-03-31): Roger Serwy (roger.serwy@gmail.com) 0174 (2013-03-31): Charles-Francois Natali (cf.natali@gmail.com) 0175 (2013-05-10): Andrew Kuchling (amk@amk.ca) 0176 (2013-06-14): Ethan Furman (ethan@stoneleaf.us) 0177 (2013-08-12): Felix Crux (felixc@felixcrux.com) 0178 (2013-10-21): Peter Moody (python@hda3.com) 0179 (2013-10-25): bquinlan (brian@sweetapp.com) 0180 (2013-11-04): Zachary Ware (zachary.ware@gmail.com) 0181 (2013-12-02): Walter Doerwald (walter@livinglogic.de) 0182 (2013-12-21): Donald Stufft (donald@stufft.io) 0183 (2014-01-03): Daniel Holth (dholth@fastmail.fm) 0184 (2014-01-27): Yury Selivanov (yselivanov@sprymix.com) 0185 (2014-04-15): Kushal Das (kushaldas@gmail.com) 0186 (2014-06-29): Berker Peksag (berker.peksag@gmail.com) 0187 (2014-07-16): Tal Einat (taleinat@gmail.com) 0188 (2014-10-08): Steve Dower (steve.dower@microsoft.com) 0189 (2014-10-18): Robert Collins (rbtcollins@hp.com) 0190 (2015-03-22): Paul Moore (p.f.moore@gmail.com)
Resources
Similar functionality is provided by the more full-blown frameworks grunt-git-authors and gitstats.
Some resources that might be insightful for you:
- https://docs.python.org/3.5/library/io.html#io.TextIOBase
- https://docs.python.org/2/library/io.html#io.TextIOWrapper
- https://www.ics.uci.edu/~pattis/ICS-33/lectures/complexitypython.txt
- https://git-scm.com/docs/git-log
- https://docs.python.org/2/tutorial/datastructures.html#sets
- http://stackoverflow.com/a/11857467/145400
- http://stackoverflow.com/a/16549381/145400
- http://unix.stackexchange.com/a/31339/13256
- http://bugs.python.org/issue18553
- https://bugs.python.org/issue10791
- https://bugs.python.org/issue13848
- https://bugs.python.org/issue12591
Leave a Reply