Category Archives: Python

Mojibake: Beatport’s ID3 text encoding is broken

Mojibake is a name for garbled text, arising from systematic errors along a text encoding-transfer-decoding chain. What does it have to do with Beatport? This:


This is a screenshot from a playlist of the VLC player, showing MP3 meta data. I downloaded the corresponding track from Beatport. Garbage is displayed where the German Umlaut “Ü” should appear. Why is that? Does the player not support the meta data version, or more specifically the meta data encoding used by Beatport MP3s?

After some investigation I found that Beatport provides MP3 files with invalid meta data. The invalid meta data is the result from a tremendously flawed text encoding procedure in the bowels of Beatport, where text is first encoded via UTF-8, the resulting raw binary data then is interpreted as a unicode code point sequence, and subsequently encoded via UTF-8 again. Horrific, and unsurprisingly the outcome is garbage. The invalid title tag shown above can easily be fixed in Python:

>>> from mutagen.id3 import ID3, TIT2
>>> data = ID3("test.mp3")
>>> corrected_title = unicode(data["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
>>> data.add(TIT2(encoding=3, text=corrected_title))

You do not need to understand that code right now. In the following paragraphs I will explain the issue step by step and slowly work towards this solution. The issue is a result of another developer (team?) not taking enough care of character encodings, although in fact this topic is one of the most important topics in modern information technology, and ignorance in this regard has led to tons of bugs in a plethora of software projects. It is time to refer to Joel’s article “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” again, which you may want to read later on, if you did not so far.

Raw data in the ID3 tag

Meta data in MP3s is usually stored within an ID3 meta data container, as explained on Wikipedia and specified on Different versions of this container format specification are available. First of all, let us find out which ID3 tag version the MP3 files from Beatport use. I have renamed the Beatport MP3 file in question to test.mp3. The following snippet shows the first five bytes of the file:

$ hexdump -C -n 5 test.mp3
00000000  49 44 33 04 00                                    |ID3..|

Quote from here: The first three bytes of the tag are always “ID3”, to indicate that this is an ID3v2 tag, directly followed by the two version bytes. The first byte of ID3v2 version is its major version, while the second byte is its revision number. Hence, this MP3 file contains an ID3 tag in version 2.4.0.

The ID3 data is comprised of frames. For example, the so-called TIT2 frame is designed to contain the track title. I have used hexdump to look for that frame within the first kilobytes of the MP3 file (the ID3 tag may also contain image data, so the size of the entire ID3v2 container can be several kilobytes). The following partial dump shows all the bytes belonging to the TIT2 frame in this file, as well as some stuff before and behind that.

00004900  49 54 31 00 00 00 08 00  00 03 4b 6f 6d 70 61 6b  |IT1.......Kompak|
00004910  74 54 49 54 32 00 00 00  1d 00 00 03 c3 83 c2 9c  |tTIT2...........|
00004920  62 65 72 73 70 72 75 6e  67 20 28 4f 72 69 67 69  |bersprung (Origi|
00004930  6e 61 6c 20 4d 69 78 29  54 4b 45 59 00 00 00 05  |nal Mix)TKEY....|

Text encoding in ID3 v2.4.0

It is clear that the above dump contains the track title in encoded form (there always is some kind of text encoding, there is no such thing as plain text, this should not surprise you). What is the exact format of the piece of data shown above? Which character encodings does the ID3 v2.4.0 specification allow for? Is the encoding itself specified in the file? Let’s have a look at the specification, these are relevant parts:

   All ID3v2 frames consists of one frame header followed by one or more
   fields containing the actual information. The header is always 10
   bytes and laid out as follows:
     Frame ID      $xx xx xx xx  (four characters)
     Size      4 * %0xxxxxxx
     Flags         $xx xx
The frame ID is followed by a size descriptor containing the size of
   the data in the final frame, after encryption, compression and
   unsynchronisation. The size is excluding the frame header ('total
   frame size' - 10 bytes) and stored as a 32 bit synchsafe integer.
   In the frame header the size descriptor is followed by two flag
   bytes. These flags are described in section 4.1.

What follows is the isolated frame data, i.e. all raw bytes belonging to the TIT2 frame (nothing else prepended or appended):

   54 49 54 32 00 00 00  1d 00 00 03 c3 83 c2 9c  |TIT2...........|
62 65 72 73 70 72 75 6e  67 20 28 4f 72 69 67 69  |bersprung (Origi|
6e 61 6c 20 4d 69 78 29                           |nal Mix)|
  • Frame ID: 54 49 54 32. This is the TIT2 label, indicating that this is the frame containing information about the track title.
  • Size: 00 00 00 1d. This is 29 (Python: int("0x1d", 0)). You can count for yourself, there are 39 bytes shown in the dump above, and the ID3 specification says that the frame size is the total frame size minus 10 bytes, so that fits.
  • Flags: 00 00. No flags.

What about text encoding? This is specified in section 4.2 of

All the text information frames have the following format:
     <Header for 'Text information frame', ID: "T000" - "TZZZ",
     excluding "TXXX" described in 4.2.6.>
     Text encoding                $xx
     Information                  <text string(s) according to encoding> informs us about possible encodings:

Frames that allow different types of text encoding contains a text
   encoding description byte. Possible encodings:
     $00   ISO-8859-1 [ISO-8859-1]. Terminated with $00.
     $01   UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
           strings in the same frame SHALL have the same byteorder.
           Terminated with $00 00.
     $02   UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
           Terminated with $00 00.
     $03   UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00.

In the raw data above, after frame type, size and flags we see a 03 byte. According to the specification above, this byte means that the following text is encoded in UTF-8. Hence, the file itself tells us that it contains the title tag encoded in UTF-8.

What follows is the byte representation of the title text, extracted from the dump shown above (frame header and text encoding marker removed). It is important to note that the following byte sequence has been created by Beatport (bytes shown hex representation, as before):

c3 83 c2 9c 62 65 72 73 70 72 75 6e 67 20
28 4f 72 69 67 69 6e 61 6c 20 4d 69 78 29

Now, just decode this raw byte sequence using the UTF-8 codec and we have our title, right? Let’s see.

Decoding the raw title data: something is wrong.

Using the \x prefix, we can easily get the raw data just shown (which should encode the title text) into a Python (2) byte string:

>>> raw = "\xc3\x83\xc2\x9c\x62\x65\x72\x73\x70\x72\x75\x6e\x67\x20\x28\x4f\x72\x69\x67\x69\x6e\x61\x6c\x20\x4d\x69\x78\x29"

The ID3 tag itself makes us believe that the original text has been encoded using UTF-8, so in order to retrieve the original text, this operation needs to be inverted. This is easily done in Python, by calling the decode() method on a byte string, providing the codec to be used:

>>> raw.decode("utf-8")
u'\xc3\x9cbersprung (Original Mix)'

The data type returned by this operation is a unicode string, i.e. a sequence of characters, not bytes. And this sequence of characters looks flawed. What is that \xc3\x9c thing there, actually? Does it make sense? To be clarified in the next section.

Reverse-engineering the issue

First, let us verify what happened here. We decoded a raw byte sequence via UTF-8 and retrieved two weird unicode code points in the output. This is the inverse process, starting from the two unexpected unicode code points C3 and 9C:

>>> u"\xc3\x9c".encode("utf-8")

The Python code above defines a sequence of unicode code points, and then encodes this “text” using UTF-8, yielding the very same byte sequence contained in the Beatport ID3 raw data which we have seen before. Now we know which “text” they encoded in order create the meta data in the file they provide for download. But what is that text? We are still missing the German umlaut Ü here, aren’t we? Let us look at the common character representation of these code points:

>>> print u"\xc3\x9c"

By having a look at we can clarify what the code points C3 and 9C really represent:


The print statement above attempted to display these characters on my terminal. The A with tilde appears as expected, followed by a rectangle (you might or might not see that here), representing a control character.

So now we have identified the actual text that Beatport encoded as UTF-8 and saved in the file as raw byte sequence. The VLC player in the figure at the top is behaving correctly: it decodes this byte sequence using UTF-8 and just displays the resulting characters: the A with the tilde and the control character, which has no glyph, and which is therefore represented with a rectangle.

The question left is: why does Beatport encode invalid text in the first place?

The magic of encoding text multiple times.

When you regularly deal with character encodings you probably have an idea already. I had a suspicion. The correct title text starts with a capital German Umlaut Ü. The unicode codepoint for Ü actually is 00DC. What is the raw byte sequence representation of this code point when using the UTF-8 codec?

>>> u"Ü".encode("utf-8")
>>> u"\xdc".encode("utf-8")

Right. It is c3 9c in hex notation. You have seen that a minute ago. Confused? Above, we learned that code points C3 and 9C were considered part of the original text, which was then encoded to its UTF-8 representation, i.e. the UTF-8 representations of the characters U+00C3 and U+009C ended up in the raw data. Now, we have learned that the two bytes c3 9c actually encode the character U+00DC in UTF-8. Still confused?


The original text was encoded twice, whereas the raw byte string representation after the first encoding was erroneously interpreted as unicode code point sequence.

Reproduction of Beatport’s broken text encoding

Let us reproduce this step by step. First, we encode U+00DC (the German Umlaut Ü) to UTF-8:

>>> u"\xdc".encode("utf-8")

Now it is time to go into detail of defining unicode literals in Python 2: with the u in front of the literal, Python is instructed to parse the characters in the literal as unicode code points. One code point can be given with different methods. The first 256 unicode code points (there are many more!) can be given in hex notation. This is what happens above, the \xdc is the U+00DC code point in hex notation.

The output of the above call to encode() is a raw byte string, where the bytes are shown in hex notation. Now we can go ahead and attach a u in front of the raw byte string. This little prefix fundamentally changes the meaning of this string literal. Now, the hex notation does not describe single raw bytes anymore, it describes unicode code points. The two resulting entities are entirely unrelated:

>>> print '\xc3\x9c'
>>> print u'\xc3\x9c'

The return value of both statements has nothing meaningful in common, by concept. The first is a byte string, implicitly decoded via the UTF-8 codec by my terminal (careful, that is magic!). The second is a sequence of two unicode code points.

This is like saying “hey, give me item number 156 and 195 from that shelve there, and then also give me number 156 and 195 from the other shelve over there”, whereas the shelves contain entirely different things. All these two statements have in common is the way the “numbers” are represented in hex notation.

It does not matter which programming language Beatport is using for creating the ID3 meta data, but somehow they managed to do a very weird thing: after having the text encoded in UTF-8 (technically it could also have been Latin-1, as Thomas pointed out in the comments, but that is not that likely), they

  • re-interprete that binary data (most likely in hex representation) again as unicode code point sequence
  • and re-encode this unicode code sequence again with UTF-8.

With our small example, this is the process:

# Encode the text with UTF-8.
>>> u"Ü".encode("utf-8")
# Take the hex representation of the raw byte sequence and
# re-interpret it as unicode code point sequence. Encode this
# with UTF-8 again.
>>> u'\xc3\x9c'.encode("utf-8")

The latter is exactly the invalid raw byte sequence found in the ID3 meta data of Beatport’s MP3 file. The last step in reproducing the entire encoding-transfer-decoding chain is to do what a MP3 player would do: decode that data using UTF-8 and display the corresponding characters:

>>> print '\xc3\x83\xc2\x9c'.decode("utf-8")

The above is exactly what happens within e.g. VLC player or any other player that properly parses the ID3 tag data.

Indeed, this is Beatport’s fault. Within the entire process of text processing, one needs to be aware of the actual representation of the text. At some point in Beatport’s text processing, a developer assumed text to be a Unicode sequence object, whereas it really was an UTF-8-encoded byte string. The conceptual problem is: never make assumptions about the text representation in your code. Always take control of the data and be 100 % sure about the type of text data you are handling.

Otherwise millions of MP3 downloads will be are erroneous.

A systematic fix based on raw_unicode_escape

The process that lead to the erroneous raw byte sequence is now well-understood. Fortunately, this process does not involve any loss of information. The information is just in bad shape. With the help of some Python magic we can invert that process.

The issue is that the byte sequence \xc3\x9c was interpreted as unicode code point sequence, yielding the raw byte sequence \xc3\x83\xc2\x9c after encoding. The Python codec raw_unicode_escape can invert this (kudos to this SO thread):

>>> u'\xc3\x9c'.encode('raw_unicode_escape')

Couldn’t we just have taken away the u? Yes. It is that simple. Manually. But using .encode('raw_unicode_escape') is the only straight-forward automatic procedure to achieve the same effect: keep the item representation, change the item meaning from unicode code points to raw bytes.

Likewise, the invalid raw byte sequence can be fixed using this technique:

>>> raw = '\xc3\x83\xc2\x9c'
# Decode the byte sequence to a unicode object.
>>> raw.decode("utf-8")
# Encode this unicode object, while keeping the item "numbering".
# This yields the UTF-8-encoded text as it was before Beatport
# corrupted it.
>>> raw.decode("utf-8").encode('raw_unicode_escape')
# Decode that text.
>>> raw.decode("utf-8").encode('raw_unicode_escape').decode("utf-8")

As you remember, the code point U+00DC is the Ü. Great! All mangled together, and printed:

>>> print '\xc3\x83\xc2\x9c'.decode("utf-8").encode('raw_unicode_escape').decode("utf-8")

Yes, that’s it: the Ü is restored from the invalid byte sequence, using the knowledge derived above.

Fix the title in an MP3 file using Mutagen

There is an awesome Python module called Mutagen for handling audio file meta data. First of all, let us use Mutagen for directly and comfortably accessing the title data in our MP3 file:

>>> from mutagen.id3 import ID3
>>> data = ID3("test.mp3")
>>> title = data["TIT2"]
>>> title
TIT2(encoding=3, text=[u'\xc3\x9cbersprung (Original Mix)'])
>>> unicode(title)
u'\xc3\x9cbersprung (Original Mix)'

In the above code, unicode(title) yields the same as raw.decode("utf-8") in the section before. Starting from there, we can apply our systematic fix. Loading a Beatport MP3 file, retrieving the title tag, and generating the proper title text in one line:

>>> print unicode(ID3("test.mp3")["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
Übersprung (Original Mix)

All in all, load an MP3 file, generate the corrected title from the invalid one, and save the corrected title back to the file:

>>> from mutagen.id3 import ID3, TIT2
# Load ID3 meta data from MP3 file.
>>> data = ID3("test.mp3")
# Build corrected title.
>>> corrected_title = unicode(data["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
# Update ID3 data object with corrected title.
>>> data.add(TIT2(encoding=3, text=corrected_title))
# Write updated ID3 date to MP3 file.

After pulling that file into the player, we see that the title issue is fixed:


How to fix all ID3 text frames in all files.

We could now assume that Beatport is doing the same mistake with all ID3 text frames. Actually, I have seen invalid Artist strings. Obviously, the task would then be to iterate through a collection of files, and for each file iterate through all ID3 text frames and fix them as shown above. Since I am not sure about the assumption stated before, I will not show the corresponding code here. I think you will manage to do that in case you have a collection of broken files from Beatport and know at least some Python. If not, it is a good exercise :-). But back up your MP3 files before!



Are you using matplotlib for Python from time to time? You might want to have a look at prettyplotlib! In short: it makes your plots appear more harmonic and comfortable by using modern design standards and color schemes (cf. colorbrewer). These styles are nice for publishing plots on a website or in a magazine, they obviously do not add scientific value.

Technically, prettyplotlib wraps matplotlib + brewer2mpl.

Thermal energy in calories

A short note. In chemistry and related fields we quite often describe energies in units of calories, instead of using Joules from the SI system. From physics, we know that the thermal energy at room temperature is about 25 meV. The question is: how does the thermal energy relate to an interaction energy (e.g. between two molecules) if that energy is provided in units of kcal/mol?

The thermal energy in Joule is k_B*T. That energy is “accessible” to any given molecule, i.e. for comparison with a value normalized on a mole (as an energy provided in units kcal/mol obviously is) we need to multiply it with the Avogadro constant A, yielding the thermal energy in Joule/mol. About 4190 Joule comprise one kcal (1000 calories), i.e. the thermal energy in kcal/mol is k_B * T * A / 4190. Given that, we can relate any given interaction energy to the thermal energy. A quick Python implementation:

import sys
# Thermal energy for 300 K:
# T_300 = 4.14 * 10^-21 J
# (k_B * 300 K, with k_B = 1.38 * 10^-23 J/K)
# The relation between Joule and (kilo)calories:
joule_per_kcal = 4190
# Thermal energy in kcal/mol:
# T_300_kcalpermol = T_300 * A / joule_per_kcal
# with A = 6.02 * 10^23 (Avogadro constant)
T_300_kcalpermol = 4.14 * 6.02 * 100 / joule_per_kcal
print "Thermal energy (at 300 K): %.2f kcal/mol" % (
# Read actual energy in kcal/mol.
energy_kcalpermol = float(sys.argv[1])
energy_per_thermal_energy = energy_kcalpermol / T_300_kcalpermol 
print "%.3f kcal/mol divided by the thermal energy: %.1f" % (
    energy_kcalpermol, energy_per_thermal_energy)

So how does an interaction energy of 2 kcal/mol relate to the thermal energy?

$ python 2
Thermal energy (at 300 K): 0.59 kcal/mol
2.000 kcal/mol divided by the thermal energy: 3.4

First of all, the thermal energy is 0.59 kcal/mol — something to keep in mind when dealing with kcal/mol-based energy values on a regular basis. We learn that an interaction energy of 2 kcal/mol is already more than three times larger than the thermal energy, i.e. this kind of interaction may easily dominate diffusion and stands out of the thermal noise.

Travis CI finally supports Python 3.4

Python 3.4 was released over a month ago. According to this announcement, Travis CI will finally support Python 3.4 in only a few hours. This has been long awaited by the community, given the many "+1" postings in Travis CI issue 1989 and the countless "Add 3.4 to .travis.yml"-style commit messages referencing this issue.

Many believe that Python 3.4 will be the breakthrough for Python 3 and we can expect it to become quite popular. Although Python 2.7 security and bug fixes have recently been “guaranteed” for up to 2020 by Guido, I got the impression that the dominance of Python 2.7 finally decreases — slowly, but steadily. For developers in the open source community this means that Python 3.4 compatibility is an important target to aim for now (you might even want to ignore all releases up to 3.3).

By the way, Ubuntu has made Python 3.4 the default Python 3 in their recently released 14.04 LTS (which will be supported for 5 years). They even considered to ship it as the default Python which they did not do in the end — their recommendation, however, is

“to best support future versions of Ubuntu you should consider porting your code to Python 3”

So, go ahead, use the great Travis CI and make your code run on both, Python 2.7 and Python 3.4!

Presenting timegaps, a tool for thinning out your data

I have released timegaps, a command line program for sorting a set of items into rejected and accepted ones, based on the age of each item and user-given time categorization rules. While this general description sounds quite abstract, the concept is simple to grasp considering timegaps’ main use case (quote from the readme file):

Timegaps allows for thinning out a collection of items, whereas the time gaps between accepted items become larger with increasing age of items. This is useful for keeping backups “logarithmically” distributed in time, e.g. one for each of the last 24 hours, one for each of the last 30 days, one for each of the last 8 weeks, and so on.

A word in advance: I would very much appreciate to receive your feedback on timegaps. And, if you like it, spread it — Thanks!

Motivation: simple implementation of backup retention policies

Backup strategies must be very well thought through. An important question is at which point old backups are to be deleted or, in other words, which old backups are to be kept for how long. This is generally implemented as a so-called data retention policy — a quote from Wikipedia (Backup):

The secondary purpose of backups is to recover data from an earlier time, according to a user-defined data retention policy, typically configured within a backup application for how long copies of data are required.

Why is the implementation of such a policy important? Obviously, storing all periodically (e.g. daily) created snapshots wastes valuable storage space. A backup retention policy allows to precisely determine which snapshots will be kept for how long. It allows users to find a trade-off between data restoration needs and the cost of backup storage.

People usually implement an automatic backup solution which takes periodic snapshots/backups of a certain data repository. Additionally, unless the data is very small compared to the available backup space, the user has to implement a retention policy which automatically deletes old backups. At this point, people unfortunately tend to take the simplest possible approach and automatically delete snapshots older than X days. This is easily implemented using standard command line tools. However, a clearly more sophisticated and safer backup retention strategy is to also keep very old backups, just not all of them.

An obvious solution is to retain backups “logarithmically” distributed in time. The well-established backup solution rsnapshot does this. It creates a structure of hourly / daily / weekly / ... snapshots on the fly. Unfortunately, other backup approaches often lack such a fine-grained logic for eliminating old backups, and people tend to hack simple filters themselves. Furthermore, even rsnapshot is not able to post-process and thin out an existing set of snapshots. This is where timegaps comes in: you can use the backup solution of your choice for periodically (e.g. hourly) creating a snapshot. You can then — independently and at any time — process this set of snapshots with timegaps and identify those snapshots that need to be eliminated (removed or displaced) in order to maintain a certain “logarithmic” distribution of snapshots in time. This is the main motivation behind timegaps, but of course you can use it for filtering any kind of time-dependent data.

Usage example

Consider the following situation: all *.tar.gz files in the current working directory happen to be daily snapshots of something. The task is to accept one snapshot for each of the last 20 days, one for each for the last 8 weeks, and one for each of the last 12 months, and to move all others to the directory notneededanymore. Using timegaps, this is a simple task:

$ mkdir notneededanymore
$ timegaps --move notneededanymore days20,weeks8,months12 *.tar.gz


Design goals and development notes

Timegaps aims to be a slick, simple, reliable command line tool — ready to be applied in serious system administration work flows that actually touch data. It follows the Unix philosophy, has a well-defined command line interface, and well-defined behavior with respect to stdin, stdout, stderr and its exit code, so I expect it to be applied in combination with other command line tools such as find. You should head over to the project page for seeing more usage examples and a detailed specification.

The timegaps Python code runs on both, Unix and Windows as well as on both, Python 2 and 3. The same code base is used in all environments, so no automatic 2to3 conversion is involved. I undertook some efforts to make the program support unicode command line arguments on Windows at the same time as byte string paths on Unix, so I am pretty sure that timegaps works well with all kinds of exotic characters in your file names. The program respects the PYTHONIOENCODING environment variable when reading items from stdin and when writing items to stdout. That way, the user has the definite control over item de- and encoding.

For general quality assurance and testing the stability of behavior, timegaps is continuously checked against two classes of unit tests:

  • API tests, testing internally used functionality, such as the time categorization logic. Some tests are fed with huge random input data sets, and the output is checked against what is statistically expected.

  • Command line interface (CLI) tests, testing the program from the user’s perspective. To that end, I have started implementing a Python CLI testing framework, Currently, it is included in the timegaps code repository. At some point, I will probably create an independent open source project from that.