Monthly Archives: September 2013

Recover a gzipped HTML response from the browser cache

Recently, an incident with the server running this website resulted in a total data loss. I had set up a daily remote backup of my WordPress database (to Dropbox, i.e. Amazon S3) and was able to restore a ~24 hour old state of my blog. Unfortunately, one article that I wrote and published only a few hours before the incident was not contained in the last database backup and therefore lost for the moment.

I knew that I had checked the article’s final version from the random visitor’s perspective using Chrome right after publishing it. So the browser cache was the only hope for me to restore the article, at least in HTML form. Consequently, I immediately archived my Chrome cache for further investigation. Thankfully, with a tiny forensics exercise, I was able to retrieve the final contents of the article from a gzipped and cached HTML response. I used Python for extracting the HTML content in clear text and figured that the applied procedure is worth a small blog post on its own. In particular, I think that this is a nice example of why Python has earned the “batteries included” attribute.

I am going to lead you through this by means of an example — a 403 error page in this case, as retrieved by accessing http://gehrcke.de/files/perm/. It has the following HTML source:

<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

In the moment you have accessed the above-given URL, it is usually already cached by your browser. Let’s assume that this response contains valuable data and our goal is to restore it from the cache.

Chrome’s browser cache is stored in larger binary files, it cannot be conveniently searched or queried by using file system tools or less / grep only. Chrome brings along a rudimentary tool for searching the cache: chrome://cache/. Search that list for “http://gehrcke.de/files/perm” (Strg + F) and you will find a corresponding entry. When clicking it, the details of this cache entry are displayed on a simplistic web page:

http://gehrcke.de/files/perm/
HTTP/1.1 403 Forbidden
Date: Mon, 16 Sep 2013 14:03:22 GMT
Content-Type: text/html
Content-Encoding: gzip
 
 
00000000:  9a  00  00  00  03  00  00  00  e3  cf  68  f3  16  45  2e  00  ..........h..E..
00000010:  51  3d  69  f3  16  45  2e  00  6b  00  00  00  48  54  54  50  Q=i..E..k...HTTP
00000020:  2f  31  2e  31  20  34  30  33  20  46  6f  72  62  69  64  64  /1.1 403 Forbidd
00000030:  65  6e  00  44  61  74  65  3a  20  4d  6f  6e  2c  20  31  36  en.Date: Mon, 16
00000040:  20  53  65  70  20  32  30  31  33  20  31  34  3a  30  33  3a   Sep 2013 14:03:
00000050:  32  32  20  47  4d  54  00  43  6f  6e  74  65  6e  74  2d  54  22 GMT.Content-T
00000060:  79  70  65  3a  20  74  65  78  74  2f  68  74  6d  6c  00  43  ype: text/html.C
00000070:  6f  6e  74  65  6e  74  2d  45  6e  63  6f  64  69  6e  67  3a  ontent-Encoding:
00000080:  20  67  7a  69  70  00  00  00  0d  00  00  00  33  37  2e  32   gzip.......37.2
00000090:  32  31  2e  31  39  34  2e  37  32  00  00  00  50  00          21.194.72...P.
 
 
00000000:  1f  8b  08  00  00  00  00  00  00  03  ed  8e  b1  0e  c2  30  ...............0
00000010:  0c  44  77  24  fe  c1  74  8f  02  82  31  64  41  20  31  30  .Dw$..t...1dA 10
00000020:  f1  05  49  6d  92  48  69  82  4c  24  e8  df  93  96  22  21  ..Im.Hi.L$...."!
00000030:  66  46  36  fb  ee  fc  ce  ca  97  2e  ea  f9  4c  79  32  a8  fF6.........Ly2.
00000040:  55  09  25  92  de  2c  d7  70  c8  6c  03  22  25  25  5f  a2  U.%..,.p.l."%%_.
00000050:  92  63  a4  46  6d  c6  1e  ac  6b  73  cc  bc  6d  ee  3e  14  .c.Fm...ks..m.>.
00000060:  6a  06  bd  a5  54  88  b5  f2  ab  6f  42  55  94  9c  ec  a1  j...T....oBU....
00000070:  ab  86  a6  2d  b9  90  1e  9f  9e  1c  e8  e3  f0  fe  6c  21  ...-..........l!
00000080:  04  18  b8  1a  c4  90  1c  94  0c  18  6e  c6  46  82  d3  f9  ..........n.F...
00000090:  b8  07  93  10  76  9e  73  47  70  e1  40  09  63  0f  c4  9c  ....v.sGp.@.c...
000000a0:  b9  5e  38  02  21  fe  88  5f  23  9e  f1  7a  0e  0d  34  02  .^8.!.._#..z..4.
000000b0:  00  00                                                          ..

When looking at the HTML source of this page, three pre blocks stand out:

  • The first pre block contains a formatted version of the HTTP response header.
  • The second pre block contains a hexdump of the response header.
  • The third pre block contains a hexdump of the response body.

The hexdumps are formatted in a way similar to how hexdump -C would print stuff to stdout: the first column shows an address offset, the second column shows space-separated hex representations of single bytes, the third column shows an ASCII interpretation of single bytes.

From Content-Encoding: gzip we see that this response was delivered in gzipped form by the webserver. Hence, the ASCII representation in the third column of the hexdumps is not human-readable. The programming goal now is to restore the original HTML document from this cache entry web page as displayed by Google Chrome (unfortunately, this obvious feature is not built into the browser itself). As a first step, save the cache entry web page to a file (right click, “Save As” …). I called it cache.html.

I wrote a Python script, recover.py. It reads cache.html, restores the original HTML document, and prints it to stdout:

$ python recover.py 
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

The source of recover.py is:

  1. import re
  2. from binascii import unhexlify
  3. from gzip import GzipFile
  4. from itertools import chain
  5. from StringIO import StringIO
  6.  
  7. with open("cache.html", "rb") as f:
  8.     html = f.read().decode("utf-8")
  9.  
  10. hexlines = re.findall("<pre>(.*?)</pre>", html, flags=re.S)[2].splitlines()
  11. hexdata = ''.join(chain.from_iterable(l[11:73].split() for l in hexlines))
  12. print GzipFile(fileobj=StringIO(unhexlify(hexdata))).read()

I tested this with Python 2.7. In lines 1-5, a selection of packages, classes, and functions is imported from the Python standard library. As you can already infer from the names, we have a tool at hand for the conversion from data in hex representation to raw binary data (unhexlify from the binascii module), as well as a tool for decompressing data that has previously been compressed according to the gzip file format standard. StringIO provides in-memory file handling — that way we can get around writing an actual file to disk containing the gzipped data. re is Python’s regular expression package, I am using it for extracting the contents of the third pre block, i.e. Chrome’s hexdump of the gzipped HTTP response body (as explained above).

A step-by-step walk-through:

  • In lines 7 and 8, the entire content of Chrome’s HTML representation of the cached response (cache.html) is read. Since Chrome tells us that it had encoded cache.html using the UTF-8 codec, we use the same codec to decode the file into a Python unicode object.
  • In line 10, the hexdump representation of the gzipped response body is extracted. A regular expression is used for matching the content of all pre blocks. The third of those is selected. The formatted hexdump is split into a list of single lines for further processing.
  • In line 11, the three-column formatted hexdump is converted to a raw hex representation, free of any whitespace characters. Only the middle column is extracted of each line (characters 12 to 73). Finally, all characters are concatenated to one single string.
  • In line 12, the data in hex representation is converted to binary data and written to an in-memory file object. This is treated as gzip file and decompressed. The result is printed to stdout. It is the original HTML response body as sent by the web server.

Hopefully this is useful to someone who also has to retrieve important data from Chrome’s cache…

Up again

I had some bad luck. Last week, my hosting provider performed a previously announced routine maintenance with the server running this website. During this maintenance, they managed to destroy the RAID (thanks, guys). All data lost. I did not have a very recent remote system snapshot and therefore needed to re-create the entire system from scratch.

A couple of months ago already, I have put Cloudflare in front of my website, so first I thought: hey, that gives me some time to restore things while the incident is largely transparent to the visitors of my website. I had activated Cloudflare’s “always online” feature months ago. That’s basically just a cache that jumps in when the backend is down. But wait, what does “down” mean? Cloudflare says that “always on” is triggered when the backend sends a 502 or 504 type response. Currently, when the backend is just dead (not responsive), this cache does not serve anything at all (in a support ticket, they said that they will “add this” in the future). My server went down at night and by the time I had a system and web server running again (and also sending a 502 response in order to trigger Cloudflare’s “always online”) — that was about 10 hours after the machine went down — the Cloudflare cache already assumed that my website does not exist anymore. Pah, “always online” useless. So, I am sorry, the downtime lasted quite long. I had to put all components back together manually.

Good thing is that I had up-to-date remote backups of my WordPress database (via Dropbox), my nginx config and other configs (via Bitbucket), and a more or less up-to-date remote backup of the directory structure behind my website. While settings things up again, I used the opportunity for re-structuring my website a bit. I am now running a modified version of the WordPress’ TwentyTwelve theme, for a cleaner appearance and a certain responsiveness.

A few hours before all data was lost, I wrote another blog post. That one was not contained in the latest (daily) WordPress database backup performed before the crash. When I realized that, the first thing I did was cp -a ing my Firefox and Chrome browser caches on the machine where I was writing that post. I then started digging in these caches in order to find residual pieces of the article’s content. And I found a golden piece. Chrome had cached a gzipped HTTP response containing the final version of my article, found via the chrome://cache/ list. Chrome displays the contents in a hexdump -C fashion. I copied this text, used a Python script to parse the dump, re-create the binary data, and to unzip this data. Based on the resulting HTML view of my article, I could quickly add it again to the WordPress system.

Some lessons learned:

  • Don’t trust scheduled routine maintenance, perform a backup *right* before that.
  • Cloudflare cache does not help in the situation where you need it most (that’s the current situation, at least).
  • Caches can still be of essential help in such a worst-case scenario. I recovered an article from the browser cache. I use the Google cache in order to see if something is still missing on the new version of my website
  • Google Webmastertools are pretty convenient in the sense that they inform you about crawling errors — I frequently check their interface and realize that there are still some missing pieces of my web presence (files, mostly).
  • Using (remote) code repositories for configuration stuff is the best you can do. First of all, it’s perfect bookkeeping if done properly. Secondly, if regularly pushed, it’s the best configuration backup you can have.

If you are still missing a file here and there or find some dead links, please don’t hesitate to notify me. Thanks.