Recover a gzipped HTML response from the browser cache

Recently, an incident with the server running this website resulted in a total data loss. I had set up a daily remote backup of my WordPress database (to Dropbox, i.e. Amazon S3) and was able to restore a ~24 hour old state of my blog. Unfortunately, one article that I wrote and published only a few hours before the incident was not contained in the last database backup and therefore lost for the moment.

I knew that I had checked the article’s final version from the random visitor’s perspective using Chrome right after publishing it. So the browser cache was the only hope for me to restore the article, at least in HTML form. Consequently, I immediately archived my Chrome cache for further investigation. Thankfully, with a tiny forensics exercise, I was able to retrieve the final contents of the article from a gzipped and cached HTML response. I used Python for extracting the HTML content in clear text and figured that the applied procedure is worth a small blog post on its own. In particular, I think that this is a nice example of why Python has earned the “batteries included” attribute.

I am going to lead you through this by means of an example — a 403 error page in this case, as retrieved by accessing http://gehrcke.de/files/perm/. It has the following HTML source:

<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

In the moment you have accessed the above-given URL, it is usually already cached by your browser. Let’s assume that this response contains valuable data and our goal is to restore it from the cache.

Chrome’s browser cache is stored in larger binary files, it cannot be conveniently searched or queried by using file system tools or less / grep only. Chrome brings along a rudimentary tool for searching the cache: chrome://cache/. Search that list for “http://gehrcke.de/files/perm” (Strg + F) and you will find a corresponding entry. When clicking it, the details of this cache entry are displayed on a simplistic web page:

http://gehrcke.de/files/perm/
HTTP/1.1 403 Forbidden
Date: Mon, 16 Sep 2013 14:03:22 GMT
Content-Type: text/html
Content-Encoding: gzip
 
 
00000000:  9a  00  00  00  03  00  00  00  e3  cf  68  f3  16  45  2e  00  ..........h..E..
00000010:  51  3d  69  f3  16  45  2e  00  6b  00  00  00  48  54  54  50  Q=i..E..k...HTTP
00000020:  2f  31  2e  31  20  34  30  33  20  46  6f  72  62  69  64  64  /1.1 403 Forbidd
00000030:  65  6e  00  44  61  74  65  3a  20  4d  6f  6e  2c  20  31  36  en.Date: Mon, 16
00000040:  20  53  65  70  20  32  30  31  33  20  31  34  3a  30  33  3a   Sep 2013 14:03:
00000050:  32  32  20  47  4d  54  00  43  6f  6e  74  65  6e  74  2d  54  22 GMT.Content-T
00000060:  79  70  65  3a  20  74  65  78  74  2f  68  74  6d  6c  00  43  ype: text/html.C
00000070:  6f  6e  74  65  6e  74  2d  45  6e  63  6f  64  69  6e  67  3a  ontent-Encoding:
00000080:  20  67  7a  69  70  00  00  00  0d  00  00  00  33  37  2e  32   gzip.......37.2
00000090:  32  31  2e  31  39  34  2e  37  32  00  00  00  50  00          21.194.72...P.
 
 
00000000:  1f  8b  08  00  00  00  00  00  00  03  ed  8e  b1  0e  c2  30  ...............0
00000010:  0c  44  77  24  fe  c1  74  8f  02  82  31  64  41  20  31  30  .Dw$..t...1dA 10
00000020:  f1  05  49  6d  92  48  69  82  4c  24  e8  df  93  96  22  21  ..Im.Hi.L$...."!
00000030:  66  46  36  fb  ee  fc  ce  ca  97  2e  ea  f9  4c  79  32  a8  fF6.........Ly2.
00000040:  55  09  25  92  de  2c  d7  70  c8  6c  03  22  25  25  5f  a2  U.%..,.p.l."%%_.
00000050:  92  63  a4  46  6d  c6  1e  ac  6b  73  cc  bc  6d  ee  3e  14  .c.Fm...ks..m.>.
00000060:  6a  06  bd  a5  54  88  b5  f2  ab  6f  42  55  94  9c  ec  a1  j...T....oBU....
00000070:  ab  86  a6  2d  b9  90  1e  9f  9e  1c  e8  e3  f0  fe  6c  21  ...-..........l!
00000080:  04  18  b8  1a  c4  90  1c  94  0c  18  6e  c6  46  82  d3  f9  ..........n.F...
00000090:  b8  07  93  10  76  9e  73  47  70  e1  40  09  63  0f  c4  9c  ....v.sGp.@.c...
000000a0:  b9  5e  38  02  21  fe  88  5f  23  9e  f1  7a  0e  0d  34  02  .^8.!.._#..z..4.
000000b0:  00  00                                                          ..

When looking at the HTML source of this page, three pre blocks stand out:

  • The first pre block contains a formatted version of the HTTP response header.
  • The second pre block contains a hexdump of the response header.
  • The third pre block contains a hexdump of the response body.

The hexdumps are formatted in a way similar to how hexdump -C would print stuff to stdout: the first column shows an address offset, the second column shows space-separated hex representations of single bytes, the third column shows an ASCII interpretation of single bytes.

From Content-Encoding: gzip we see that this response was delivered in gzipped form by the webserver. Hence, the ASCII representation in the third column of the hexdumps is not human-readable. The programming goal now is to restore the original HTML document from this cache entry web page as displayed by Google Chrome (unfortunately, this obvious feature is not built into the browser itself). As a first step, save the cache entry web page to a file (right click, “Save As” …). I called it cache.html.

I wrote a Python script, recover.py. It reads cache.html, restores the original HTML document, and prints it to stdout:

$ python recover.py 
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

The source of recover.py is:

  1. import re
  2. from binascii import unhexlify
  3. from gzip import GzipFile
  4. from itertools import chain
  5. from StringIO import StringIO
  6.  
  7. with open("cache.html", "rb") as f:
  8.     html = f.read().decode("utf-8")
  9.  
  10. hexlines = re.findall("<pre>(.*?)</pre>", html, flags=re.S)[2].splitlines()
  11. hexdata = ''.join(chain.from_iterable(l[11:73].split() for l in hexlines))
  12. print GzipFile(fileobj=StringIO(unhexlify(hexdata))).read()

I tested this with Python 2.7. In lines 1-5, a selection of packages, classes, and functions is imported from the Python standard library. As you can already infer from the names, we have a tool at hand for the conversion from data in hex representation to raw binary data (unhexlify from the binascii module), as well as a tool for decompressing data that has previously been compressed according to the gzip file format standard. StringIO provides in-memory file handling — that way we can get around writing an actual file to disk containing the gzipped data. re is Python’s regular expression package, I am using it for extracting the contents of the third pre block, i.e. Chrome’s hexdump of the gzipped HTTP response body (as explained above).

A step-by-step walk-through:

  • In lines 7 and 8, the entire content of Chrome’s HTML representation of the cached response (cache.html) is read. Since Chrome tells us that it had encoded cache.html using the UTF-8 codec, we use the same codec to decode the file into a Python unicode object.
  • In line 10, the hexdump representation of the gzipped response body is extracted. A regular expression is used for matching the content of all pre blocks. The third of those is selected. The formatted hexdump is split into a list of single lines for further processing.
  • In line 11, the three-column formatted hexdump is converted to a raw hex representation, free of any whitespace characters. Only the middle column is extracted of each line (characters 12 to 73). Finally, all characters are concatenated to one single string.
  • In line 12, the data in hex representation is converted to binary data and written to an in-memory file object. This is treated as gzip file and decompressed. The result is printed to stdout. It is the original HTML response body as sent by the web server.

Hopefully this is useful to someone who also has to retrieve important data from Chrome’s cache…

  • Thank you sooo much! I used your code to recover a very very important post on http://blog.franceprix.fr
    But I made a change to you code so in order to work for Firefox html cache page: line 10 should end in “[0].splitlines()” instead of “[2].splitlines()”.

    The rest is perfect for Firefox, too.

    Thank you again!

  • Marjo

    (Apologies for replying to a rather old post.)

    Thanks for writing this down! However, I’ve tried to use the code, and at line 10 I get the following error:

    Traceback (most recent call last):
    File “”, line 1, in
    IndexError: list index out of range

    Any idea what might be causing this and how it can be solved? I should mention that I have no experience in python, so it might be something quite simple.
    (Also worth noting is that I’m using this for a firefox file. I tried Steve J.’s change as well, but with the same result.)

    Thanks in advance for having a look at this! Any help would be greatly appreciated.

  • Nace Sapundjiev

    Thank you! For me it wasn’t working and the fix was to make a small change at line 11:

    chain.from_iterable(l[10:59]