Category Archives: Security

Good to know checkrestart from debian-goodies

The security audits triggered by Heartbleed lead to an increasing discovery of security issues in libraries such as GnuTLS and OpenSSL within the past weeks. These issues were dealt with responsibly, i.e. they were usually published together with the corresponding binary updates by major Linux distributions such as Debian (subscribing to debian-security-announce or comparable sources is highly recommended if you want to keep track of these developments).

After downloading an updated binary of a shared library, it is important to restart all services (processes) that are linked against this library in order to make the update take effect. Usually, processes load a shared library code segment to random access memory once during startup (actually, the program loader / runtime linker does this), and do not reload that code afterwards throughout their life time, which may be weeks or months in case of server processes. Prime examples are Nginx/Apache/Exim being linked against OpenSSL or GnuTLS: if you update the latter but do not restart the former, you have changed your disk contents but not updated your service. In the worst case the system is still vulnerable. It should therefore be the habit of a good system administrator to question which services are using a certain shared library and to restart them after a shared library update, if required.

There is a neat helper utility in Debian, called checkrestart. It comes with the debian-goodies package. First of all, what is debian-goodies? Let’s see (quote from Wheezy’s package description):

Small toolbox-style utilities for Debian systems
 
These programs are designed to integrate with standard shell tools, extending them to operate on the Debian packaging system.
 
 dgrep  - Search all files in specified packages for a regex
 dglob  - Generate a list of package names which match a pattern
 
These are also included, because they are useful and don't justify their own packages:
 
 debget          - Fetch a .deb for a package in APT's database
 dpigs           - Show which installed packages occupy the most space
 debman          - Easily view man pages from a binary .deb without extracting
 debmany         - Select manpages of installed or uninstalled packages
 checkrestart    - Help to find and restart processes which are using old
                   versions of upgraded files (such as libraries)
 popbugs         - Display a customized release-critical bug list based on
                   packages you use (using popularity-contest data)
 which-pkg-broke - find which package might have broken another

Checkrestart is a Python application wrapping lsof (“list open files”). It tries to identify files used by processes that are not in the file system anymore. How so?

Note that during an update a certain binary file becomes replaced: the new version is first downloaded to disk and then rename()ed in order to overwrite the original. During POSIX rename() the old file becomes deleted. But the old file is still in use! The standard says that if any process still has a file open during its deletion, that file will remain “in existence” until the last file descriptor referring to it is closed. While these files that are still held “in existence” for running processes by the operating system, they are not listed in the file system anymore. They can however easily be identified via the lsof tool. And this is exactly what checkrestart does.

Hence, checkrestart “compares” the open files used by running processes to the corresponding files in the file system. If the file system contains other (e.g. newer) data than the process is currently using, then checkrestart proposes to restart that process. In a tidy server environment, this usually is the case only for updated shared library files. Below you can find example output after updating Java, Python, OpenSSL, and GnuTLS:

# checkrestart
Found 12 processes using old versions of upgraded files
(5 distinct programs)
(5 distinct packages)
 
Of these, 3 seem to contain init scripts which can be used to restart them:
The following packages seem to have init scripts that could be used
to restart them:
nginx-extras:
    20534   /usr/sbin/nginx
    20533   /usr/sbin/nginx
    20532   /usr/sbin/nginx
    19113   /usr/sbin/nginx
openssh-server:
    3124    /usr/sbin/sshd
    22964   /usr/sbin/sshd
    25724   /usr/sbin/sshd
    22953   /usr/sbin/sshd
    25719   /usr/sbin/sshd
exim4-daemon-light:
    3538    /usr/sbin/exim4
 
These are the init scripts:
service nginx restart
service ssh restart
service exim4 restart
 
These processes do not seem to have an associated init script to restart them:
python2.7-minimal:
    2548    /usr/bin/python2.7
openjdk-7-jre-headless:amd64:
    4348    /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java

Nginx (web server) and OpenSSH (SSH server) are linked against OpenSSL, and Exim (mail transfer agent) is linked against GnuTLS. So that output makes sense. Obviously, after an update of Python and Java processes using this interpreter or VM (respectively) also need to be restarted in order to use the new code. Checkrestart is extremely helpful, thanks for this nice tool.

Update 2014-09-03:

With the upcoming Debian 8 (Jessie, currently unstable), a new package has been introduced — needrestart. It is a more modern version of checkrestart, and tightly integrates with the systemd service infrastructure. Notably, needrestart comes in its own package and with the goal to become more popular than checkrestart (which was hidden in debian-goodies, as discussed above). Currently, a discussion is going on on the debian-security mailing list about installing and running needrestart in a default Debian installation.

The challenges of secure asynchronous group messaging

I just want to draw your attention towards another nice blog post of Open WhisperSystems, describing technical challenges for securely implementing a group chat. So, head over to https://whispersystems.org/blog/private-groups/ and have a nice read. If you are new to OTR and secure instant messaging architectures in general, you might be surprised by the complexity and the text might be a bit too difficult for you. There are, however, some more nicely written blog posts on the Open WhisperSystems blog that might help you understand the basic concepts.

By the way, yes, I totally recommend TextSecure if you are interested in using a secure instant messaging solution for your mobile phone. To me, it currently is the most convincing solution, especially from the political point of view and at the same time regarding all technical concepts. So far, the technical implementation might not be perfect, and an iOS client is still missing, but thanks to the open source community and a great project lead these issues will resolve over time. No need for looking at Threema and other possibly commercial and closed-source applications. If you want security, go for TextSecure. If security does not matter to you, go ahead and proceed using WhatsApp (I do, and for many things it serves the purpose just well).

And remember, IT security is not always what it seems to be:

GnuTLS vulnerability: is unit testing a matter of language culture?

You have probably heard about this major security issue in GnuTLS, publicly announced on March 3, 2014, with the following words in a patch note on the GnuTLS mailinglist:

This fixes is an important (and at the same time embarrassing) bug
discovered during an audit for Red Hat. Everyone is urged to upgrade.

The official security advisory describes the issue in these general terms:

A vulnerability was discovered that affects the certificate verification functions of all gnutls versions. A specially crafted certificate could bypass certificate validation checks. The vulnerability was discovered during an audit of GnuTLS for Red Hat.

Obviously, media and tech bloggers pointed out the significance of this issue. If you are interested in some technical detail, I would like to recommend a well-written article on LWN on the topic: A longstanding GnuTLS certificate validation botch. As it turns out, the bug was introduced by a code change that re-factored the error/success communication between functions. Eventually, spoken generally, the problem is that two communication partners went out of sync: when the sender sent ‘Careful, error!’, the recipient actually understood ‘Cool, success.’. Bah. We are used to modern, test-driven development culture. Consequently, most of us immediately think “WTF, don’t they test their code?”.

An automated test suite should have immediately spotted that invalid commit, right. But wait a second, that malicious commit was pushed in year 2000, the language we are talking about is C, and unit testing for C is not exactly established. Given that — did you really, honestly, expect a C code base that reaches back more than a decade to be under surveillance of ideal unit-tests, by modern standards? No? Me neither (although I would have expected a security-relevant library such as GnuTLS to be under a significant 3rd party test coverage — does everybody trust the authors?).

We seem to excuse or at least acknowledge and tolerate that old system software written in C is not well-tested by modern standards of test-driven development. For sure, there is modern software out there applying ideal testing strategies — but having only a few users. At the same time old software is circulating, used by millions, but not applying modern testing strategies. But why is that? And should we tolerate this? There was an interesting discussion about this topic, right underneath the above-mentioned LWN article. I’d like to quote one comment that I particularly agree to, although it is mostly asking questions than providing answers:

> In addition to the culture of limited testing you alluded to,
> I think there are some language issues here as well

Yes, true. But I wonder if discussing type systems is also a
distraction from the more pressing issue here? After all, even
with all the help of Haskell’s type system, you *will* still
have bugs.

It seems to me that the lack of rigorous testing was:
(a) The most immediate cause of these bugs
(b) More common in projects written in C

I find it frustrating that discussions of these issues continually
drift towards language wars, rather than towards modern ideas about
unit testing, software composability, test-driven development, and
code coverage tracking.

Aren’t these the more pressing questions?
(1) Where are the GnuTLS unit tests, so I can review and add more?
(2) Where is the new regression test covering this bug?
(3) What is the command to run a code coverage tool on the test
suite, so that I can see what coverage is missing?

Say what you will about “toy” languages, but that is what would
happen in any halfway mature Ruby or Python or Javascript project,
and I’m happy to provide links to back that up.

Say what you will about the non-systems languages on the JVM, but
that is also what would happen in any halfway mature Scala, Java,
or Clojure project.

It’s only in C, the systems language in which so many of these
vital libraries are written, that this is not the case. Isn’t it
time to ask why?

Someone answered, and I think this view makes sense:

For example, I suspect that the reason “C culture” seems impervious to adopting the lessons of test-driven development has a lot to do with the masses of developers who are interested in it, by following your advice, are moving to other languages and practicing it there.

In other words, by complecting the issue of unit testing and test coverage with the choice of language, are we not actively *contributing* to the continuing absence of these ideas from C culture, and thus from the bulk of our existing systems?

Food for thought, at least, I hope!

I agree: the effort for improved testing of old, but essential, C libraries must come from the open source community. Someone has to do it.

Kurze Einmischung zum Thema WotzApp

Ein neuer Stern am Himmel der Milliardenkonzerne. Man sollte WhatsApp nicht so sehr dafür loben, dass es simpel zu bedienen ist und funktioniert. Das können andere auch. Anderes ist wichtig. Ihr erinnert euch vielleicht: bis vor Kurzem konnte man sehr einfach WhatsApp-Nachrichten im Namen anderer Leute verschicken. Ist schwieriger geworden, geht aber noch. Was ist eigentlich sicher an WhatsApp?

  • Der Nutzer ist Produkt.
  • Datenschutz hat geringe Priorität.

Ersteres ist spätestens seit dem 19.02.2014 klar. Zweiter Punkt: zum Beispiel kann man im lokalen WLAN versendete Nachrichten mit relativ einfachen Mitteln mitlesen. Es gibt sicherlich noch viele andere kleinere und größere Datenschutz-Probleme — aber all diese Dinge interessieren nur einen kleinen Bruchteil des gemeinen Volkes.

WhatsApp: keine Standards, keine Sicherheit. Einfach Chat im 90er Style für’s Handy.

Die WhatsApp-Ingenieure haben zu Anfang quick & dirty gearbeitet und die Grundzüge ihrer Architektur nicht an gängigen Standards ausgerichtet. IT-Sicherheit und Datenschutz ohne sich an Standards zu halten? Sowas ist von Vornherein im mathematischen Sinne ill-posed. Die Sorglosigkeit bei der technischen Umsetzung von WhatsApp ist uns schon seit Jahren bewusst. Uns ist doch klar, was WhatsApp im Kern ist: eine ganz simple, total gleichgültige Form des Chats. Sicherheit und Datenschutz völlig egal. Folgendes hatte mich schon vor Jahren beeindruckt: man muss sich nicht bei WhatsApp “einloggen”, es gibt kein (geteiltes) Geheimnis.

Krypto-Grundkurs

Man muss sich gar nicht weiter mit WhatsApp beschäftigen, um große Datenschutz-Skepsis dagegen zu hegen. Da schreiben Leute miteinander ohne vorher ein Geheimnis auszutauschen. Mal ein ganz kleiner kurzer Mini-Krypto-Ausflug, vielleicht erreiche ich ja ne schmale Masse, also nen kleinen Teil der breiten Masse. Ein Geheimnis ist etwas was nur DU kennst. Deine Telefonnummer ist kein Geheimnis. Deine IMEI ist auch kein Geheimnis. Ohne ein Geheimnis kann man

  • sich nicht sicher authentifizieren (eindeutig ausweisen),
  • keine Daten sicher verschlüsseln,
  • die Integrität versendeter Daten nicht gewährleisten.

Im Umkehrschluss heißt das für Kommunikation ohne Geheimnis:

  • Jeder kann (mit mehr oder weniger Aufwand) in deinem Namen Nachrichten versenden.
  • Jeder auf dem Kommunikationsweg zwischen dir und dem Empfänger (WLAN, ISP, …) kann deine Nachrichten lesen.
  • Jeder auf dem Kommunikationsweg zwischen dir und dem Empfänger (WLAN, ISP, …) kann deine Nachrichten verändern.

Was wollen die Leute eigentlich? Sicherheit eher nicht so.

Das reicht schon für den Grundkurs. Aber jetzt mal ehrlich: E-Mail und SMS leiden unter den gleichen Problemen. Und auch ICQ und Skype bieten keinen theoretisch vollständigen Schutz, obwohl man hier ja ein Geheimnis benutzt, die Login-Daten (das ist die falsche Geheimnisform, aber das wollen wir hier jetzt nicht behandeln). Und bei DE-Mail muss man sich aufregen, denn hier wird Sicherheit versprochen, die nicht existiert.

Das alles interessiert kaum jemanden. Und ich glaube das ist der Kern — eine wichtige Einsicht: echte Sicherheit wollen die Leute gar nicht unbedingt. Meistens ist ihnen schlicht egal, ob “jemand” mitlesen kann. Ist die breite Masse da irgendwas zwischen naiv und illusorisch? Vielleicht, ist aber egal. Konzeptionell perfekte IT-Sicherheit und alltägliche menschliche Kommunikation passen nicht so recht zusammen. Die selbe Gleichgültigkeit haben wir doch beim NSA-Skandal gesehen. Wo bleibt der Aufschrei? #aufschrei? #aufschrei3000? Mir ist das Ganze ja auch ein Stück weit egal — schließlich benutzte ich ICQ, versende E-Mails und SMS und bin seit Kurzem auch WhatsApp-Nutzer. Dabei weiß ich bei jeder dieser Techniken genau, wie man hier angreifen kann. Habt ihr etwa noch nie per tcpdump im Router die Nachrichten eurer Mitbewohner mitgeschnitten ;-)?

Aber bitte seid euch doch im Klaren darüber, was hier passiert.

Was meiner Meinung nach wichtig ist: das Bewusstsein darüber, wer da Daten von wem in welcher Größenordnung sammeln kann. Und Bewusstsein darüber, dass man sich unter Umständen verkauft. Schaut mal in diesen offiziellen Blogpost von WhatsApp aus 2012:

Remember, when advertising is involved you the user are the product.

Sie erklären da, dass der WhatsApp-Nutzer nicht das Produkt ist, weil sie keine Werbung verwenden. Dann reden sie von ihrer ach so tollen Architektur und dass die simple Form der Kommunikation ihr Produkt ist, nicht etwa der Nutzer oder seine Daten:

That’s our product and that’s our passion. Your data isn’t even in the picture. We are simply not interested in any of it.

19 Milliarden $ für was genau? Achso, ja klar.

Das da oben klang schon immer schmutzig. Neuerdings erscheinen diese Aussagen aber in besonders reudigem Licht, ich formuliere das mal simpel:

  • 19 Milliarden $ für eine (IT-)Architektur, die viele besser hinbekommen hätten? Nee.
  • 19 Milliarden $ für eine riesige Nutzerzahl? Ja.

Was ist also das Produkt? Die Nutzer, genau, wie immer. Seid euch drüber im Klaren.

Was ich noch sagen will: heml.is und BitTorrent Chat

Wenn man wirklich mal sichere Kommunikation braucht, dann muss man wissen, wo man die bekommt. Die Medien haben sich gerade auf Threema eingespielt. Schön für die Schweizer, da klingeln bestimmt gut die Kassen. Soweit ich das sehe, ist das kryptographisch solide gemacht. Man hat sich an anerkannte Standards gehalten. Und arbeitet mit echten Geheimnissen. Die Nutzdaten, also die Nachrichteninhalte, scheinen sicher. Ihr müsst aber wissen, dass auch die Leute von Threema natürlich Metadaten sehen und sammeln können, also wer mit wem wann wie viel und so (eigentlich alles außer was und warum vielleicht :-)). Außerdem ist Threema nicht — wie einige Konkurrenten — kostenlos. Nebenbei bemerkt: ihr müsst kein schlechtes Gewissen haben, wenn ihr eine kostenlose App installiert. Der Flappybird-Mann hatte 50.000 $ tägliche Werbebeteiligung nur durch die Präsenz in den jeweiligen Applikations-Einkaufsläden.

Ich würde eure Aufmerksamkeit gerne auf heml.is und BitTorrent Chat lenken. https://heml.is/ wird seit Längerem von drei Schweden entwickelt. Die machen durch ihre Planungs- und Informationspolitik einen äußerst sympathischen und professionellen Eindruck. Sie stehen kurz vor Release und man twittert ihnen schon zu, dass sie am besten sofort jetzt, blabla, aber man reagiert recht cool:

A car without wheels may be 99% complete but is pretty useless, right?

Also, https://heml.is/, merkt euch das mal. Macht einen besseren Eindruck als Threema. Auch BittorrentChat ist sehr vielversprechend — in anderer Art und Weise. Wie immer rund um “Torrent” wird hier ein dezentraler Ansatz verfolgt. Ein selbstregulierendes P2P-Netz. Nur mit so einem Ansatz kann man Anonymität verwirklichen (Ähnlichkeit zu TOR), nur so kann man das effiziente Sammeln von Metadaten verhindern. Auch BittorrentChat ist noch nicht fertig, aber kurz vor Release.

Recover a gzipped HTML response from the browser cache

Recently, an incident with the server running this website resulted in a total data loss. I had set up a daily remote backup of my WordPress database (to Dropbox, i.e. Amazon S3) and was able to restore a ~24 hour old state of my blog. Unfortunately, one article that I wrote and published only a few hours before the incident was not contained in the last database backup and therefore lost for the moment.

I knew that I had checked the article’s final version from the random visitor’s perspective using Chrome right after publishing it. So the browser cache was the only hope for me to restore the article, at least in HTML form. Consequently, I immediately archived my Chrome cache for further investigation. Thankfully, with a tiny forensics exercise, I was able to retrieve the final contents of the article from a gzipped and cached HTML response. I used Python for extracting the HTML content in clear text and figured that the applied procedure is worth a small blog post on its own. In particular, I think that this is a nice example of why Python has earned the “batteries included” attribute.

I am going to lead you through this by means of an example — a 403 error page in this case, as retrieved by accessing http://gehrcke.de/files/perm/. It has the following HTML source:

<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

In the moment you have accessed the above-given URL, it is usually already cached by your browser. Let’s assume that this response contains valuable data and our goal is to restore it from the cache.

Chrome’s browser cache is stored in larger binary files, it cannot be conveniently searched or queried by using file system tools or less / grep only. Chrome brings along a rudimentary tool for searching the cache: chrome://cache/. Search that list for “http://gehrcke.de/files/perm” (Strg + F) and you will find a corresponding entry. When clicking it, the details of this cache entry are displayed on a simplistic web page:

http://gehrcke.de/files/perm/
HTTP/1.1 403 Forbidden
Date: Mon, 16 Sep 2013 14:03:22 GMT
Content-Type: text/html
Content-Encoding: gzip
 
 
00000000:  9a  00  00  00  03  00  00  00  e3  cf  68  f3  16  45  2e  00  ..........h..E..
00000010:  51  3d  69  f3  16  45  2e  00  6b  00  00  00  48  54  54  50  Q=i..E..k...HTTP
00000020:  2f  31  2e  31  20  34  30  33  20  46  6f  72  62  69  64  64  /1.1 403 Forbidd
00000030:  65  6e  00  44  61  74  65  3a  20  4d  6f  6e  2c  20  31  36  en.Date: Mon, 16
00000040:  20  53  65  70  20  32  30  31  33  20  31  34  3a  30  33  3a   Sep 2013 14:03:
00000050:  32  32  20  47  4d  54  00  43  6f  6e  74  65  6e  74  2d  54  22 GMT.Content-T
00000060:  79  70  65  3a  20  74  65  78  74  2f  68  74  6d  6c  00  43  ype: text/html.C
00000070:  6f  6e  74  65  6e  74  2d  45  6e  63  6f  64  69  6e  67  3a  ontent-Encoding:
00000080:  20  67  7a  69  70  00  00  00  0d  00  00  00  33  37  2e  32   gzip.......37.2
00000090:  32  31  2e  31  39  34  2e  37  32  00  00  00  50  00          21.194.72...P.
 
 
00000000:  1f  8b  08  00  00  00  00  00  00  03  ed  8e  b1  0e  c2  30  ...............0
00000010:  0c  44  77  24  fe  c1  74  8f  02  82  31  64  41  20  31  30  .Dw$..t...1dA 10
00000020:  f1  05  49  6d  92  48  69  82  4c  24  e8  df  93  96  22  21  ..Im.Hi.L$...."!
00000030:  66  46  36  fb  ee  fc  ce  ca  97  2e  ea  f9  4c  79  32  a8  fF6.........Ly2.
00000040:  55  09  25  92  de  2c  d7  70  c8  6c  03  22  25  25  5f  a2  U.%..,.p.l."%%_.
00000050:  92  63  a4  46  6d  c6  1e  ac  6b  73  cc  bc  6d  ee  3e  14  .c.Fm...ks..m.>.
00000060:  6a  06  bd  a5  54  88  b5  f2  ab  6f  42  55  94  9c  ec  a1  j...T....oBU....
00000070:  ab  86  a6  2d  b9  90  1e  9f  9e  1c  e8  e3  f0  fe  6c  21  ...-..........l!
00000080:  04  18  b8  1a  c4  90  1c  94  0c  18  6e  c6  46  82  d3  f9  ..........n.F...
00000090:  b8  07  93  10  76  9e  73  47  70  e1  40  09  63  0f  c4  9c  ....v.sGp.@.c...
000000a0:  b9  5e  38  02  21  fe  88  5f  23  9e  f1  7a  0e  0d  34  02  .^8.!.._#..z..4.
000000b0:  00  00                                                          ..

When looking at the HTML source of this page, three pre blocks stand out:

  • The first pre block contains a formatted version of the HTTP response header.
  • The second pre block contains a hexdump of the response header.
  • The third pre block contains a hexdump of the response body.

The hexdumps are formatted in a way similar to how hexdump -C would print stuff to stdout: the first column shows an address offset, the second column shows space-separated hex representations of single bytes, the third column shows an ASCII interpretation of single bytes.

From Content-Encoding: gzip we see that this response was delivered in gzipped form by the webserver. Hence, the ASCII representation in the third column of the hexdumps is not human-readable. The programming goal now is to restore the original HTML document from this cache entry web page as displayed by Google Chrome (unfortunately, this obvious feature is not built into the browser itself). As a first step, save the cache entry web page to a file (right click, “Save As” …). I called it cache.html.

I wrote a Python script, recover.py. It reads cache.html, restores the original HTML document, and prints it to stdout:

$ python recover.py 
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

The source of recover.py is:

  1. import re
  2. from binascii import unhexlify
  3. from gzip import GzipFile
  4. from itertools import chain
  5. from StringIO import StringIO
  6.  
  7. with open("cache.html", "rb") as f:
  8.     html = f.read().decode("utf-8")
  9.  
  10. hexlines = re.findall("<pre>(.*?)</pre>", html, flags=re.S)[2].splitlines()
  11. hexdata = ''.join(chain.from_iterable(l[11:73].split() for l in hexlines))
  12. print GzipFile(fileobj=StringIO(unhexlify(hexdata))).read()

I tested this with Python 2.7. In lines 1-5, a selection of packages, classes, and functions is imported from the Python standard library. As you can already infer from the names, we have a tool at hand for the conversion from data in hex representation to raw binary data (unhexlify from the binascii module), as well as a tool for decompressing data that has previously been compressed according to the gzip file format standard. StringIO provides in-memory file handling — that way we can get around writing an actual file to disk containing the gzipped data. re is Python’s regular expression package, I am using it for extracting the contents of the third pre block, i.e. Chrome’s hexdump of the gzipped HTTP response body (as explained above).

A step-by-step walk-through:

  • In lines 7 and 8, the entire content of Chrome’s HTML representation of the cached response (cache.html) is read. Since Chrome tells us that it had encoded cache.html using the UTF-8 codec, we use the same codec to decode the file into a Python unicode object.
  • In line 10, the hexdump representation of the gzipped response body is extracted. A regular expression is used for matching the content of all pre blocks. The third of those is selected. The formatted hexdump is split into a list of single lines for further processing.
  • In line 11, the three-column formatted hexdump is converted to a raw hex representation, free of any whitespace characters. Only the middle column is extracted of each line (characters 12 to 73). Finally, all characters are concatenated to one single string.
  • In line 12, the data in hex representation is converted to binary data and written to an in-memory file object. This is treated as gzip file and decompressed. The result is printed to stdout. It is the original HTML response body as sent by the web server.

Hopefully this is useful to someone who also has to retrieve important data from Chrome’s cache…