Mojibake: Beatport’s ID3 text encoding is broken

Mojibake is a name for garbled text, arising from systematic errors along a text encoding-transfer-decoding chain. What does it have to do with Beatport? This:

vlc_mojibake

This is a screenshot from a playlist of the VLC player, showing MP3 meta data. I downloaded the corresponding track from Beatport. Garbage is displayed where the German Umlaut “Ü” should appear. Why is that? Does the player not support the meta data version, or more specifically the meta data encoding used by Beatport MP3s?

After some investigation I found that Beatport provides MP3 files with invalid meta data. The invalid meta data is the result from a tremendously flawed text encoding procedure in the bowels of Beatport, where text is first encoded via UTF-8, the resulting raw binary data then is interpreted as a unicode code point sequence, and subsequently encoded via UTF-8 again. Horrific, and unsurprisingly the outcome is garbage. The invalid title tag shown above can easily be fixed in Python:

>>> from mutagen.id3 import ID3, TIT2
>>> data = ID3("test.mp3")
>>> corrected_title = unicode(data["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
>>> data.add(TIT2(encoding=3, text=corrected_title))
>>> data.save()

You do not need to understand that code right now. In the following paragraphs I will explain the issue step by step and slowly work towards this solution. The issue is a result of another developer (team?) not taking enough care of character encodings, although in fact this topic is one of the most important topics in modern information technology, and ignorance in this regard has led to tons of bugs in a plethora of software projects. It is time to refer to Joel’s article “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” again, which you may want to read later on, if you did not so far.

Raw data in the ID3 tag

Meta data in MP3s is usually stored within an ID3 meta data container, as explained on Wikipedia and specified on id3.org. Different versions of this container format specification are available. First of all, let us find out which ID3 tag version the MP3 files from Beatport use. I have renamed the Beatport MP3 file in question to test.mp3. The following snippet shows the first five bytes of the file:

$ hexdump -C -n 5 test.mp3
00000000  49 44 33 04 00                                    |ID3..|

Quote from here: The first three bytes of the tag are always “ID3”, to indicate that this is an ID3v2 tag, directly followed by the two version bytes. The first byte of ID3v2 version is its major version, while the second byte is its revision number. Hence, this MP3 file contains an ID3 tag in version 2.4.0.

The ID3 data is comprised of frames. For example, the so-called TIT2 frame is designed to contain the track title. I have used hexdump to look for that frame within the first kilobytes of the MP3 file (the ID3 tag may also contain image data, so the size of the entire ID3v2 container can be several kilobytes). The following partial dump shows all the bytes belonging to the TIT2 frame in this file, as well as some stuff before and behind that.

00004900  49 54 31 00 00 00 08 00  00 03 4b 6f 6d 70 61 6b  |IT1.......Kompak|
00004910  74 54 49 54 32 00 00 00  1d 00 00 03 c3 83 c2 9c  |tTIT2...........|
00004920  62 65 72 73 70 72 75 6e  67 20 28 4f 72 69 67 69  |bersprung (Origi|
00004930  6e 61 6c 20 4d 69 78 29  54 4b 45 59 00 00 00 05  |nal Mix)TKEY....|

Text encoding in ID3 v2.4.0

It is clear that the above dump contains the track title in encoded form (there always is some kind of text encoding, there is no such thing as plain text, this should not surprise you). What is the exact format of the piece of data shown above? Which character encodings does the ID3 v2.4.0 specification allow for? Is the encoding itself specified in the file? Let’s have a look at the specification, these are relevant parts:

   All ID3v2 frames consists of one frame header followed by one or more
   fields containing the actual information. The header is always 10
   bytes and laid out as follows:
 
     Frame ID      $xx xx xx xx  (four characters)
     Size      4 * %0xxxxxxx
     Flags         $xx xx
 
[...]
 
The frame ID is followed by a size descriptor containing the size of
   the data in the final frame, after encryption, compression and
   unsynchronisation. The size is excluding the frame header ('total
   frame size' - 10 bytes) and stored as a 32 bit synchsafe integer.
 
   In the frame header the size descriptor is followed by two flag
   bytes. These flags are described in section 4.1.

What follows is the isolated frame data, i.e. all raw bytes belonging to the TIT2 frame (nothing else prepended or appended):

   54 49 54 32 00 00 00  1d 00 00 03 c3 83 c2 9c  |TIT2...........|
62 65 72 73 70 72 75 6e  67 20 28 4f 72 69 67 69  |bersprung (Origi|
6e 61 6c 20 4d 69 78 29                           |nal Mix)|
  • Frame ID: 54 49 54 32. This is the TIT2 label, indicating that this is the frame containing information about the track title.
  • Size: 00 00 00 1d. This is 29 (Python: int("0x1d", 0)). You can count for yourself, there are 39 bytes shown in the dump above, and the ID3 specification says that the frame size is the total frame size minus 10 bytes, so that fits.
  • Flags: 00 00. No flags.

What about text encoding? This is specified in section 4.2 of http://id3.org/id3v2.4.0-frames:

All the text information frames have the following format:
 
     <Header for 'Text information frame', ID: "T000" - "TZZZ",
     excluding "TXXX" described in 4.2.6.>
     Text encoding                $xx
     Information                  <text string(s) according to encoding>

http://id3.org/id3v2.4.0-structure informs us about possible encodings:

Frames that allow different types of text encoding contains a text
   encoding description byte. Possible encodings:
 
     $00   ISO-8859-1 [ISO-8859-1]. Terminated with $00.
     $01   UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
           strings in the same frame SHALL have the same byteorder.
           Terminated with $00 00.
     $02   UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
           Terminated with $00 00.
     $03   UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00.

In the raw data above, after frame type, size and flags we see a 03 byte. According to the specification above, this byte means that the following text is encoded in UTF-8. Hence, the file itself tells us that it contains the title tag encoded in UTF-8.

What follows is the byte representation of the title text, extracted from the dump shown above (frame header and text encoding marker removed). It is important to note that the following byte sequence has been created by Beatport (bytes shown hex representation, as before):

c3 83 c2 9c 62 65 72 73 70 72 75 6e 67 20
28 4f 72 69 67 69 6e 61 6c 20 4d 69 78 29

Now, just decode this raw byte sequence using the UTF-8 codec and we have our title, right? Let’s see.

Decoding the raw title data: something is wrong.

Using the \x prefix, we can easily get the raw data just shown (which should encode the title text) into a Python (2) byte string:

>>> raw = "\xc3\x83\xc2\x9c\x62\x65\x72\x73\x70\x72\x75\x6e\x67\x20\x28\x4f\x72\x69\x67\x69\x6e\x61\x6c\x20\x4d\x69\x78\x29"

The ID3 tag itself makes us believe that the original text has been encoded using UTF-8, so in order to retrieve the original text, this operation needs to be inverted. This is easily done in Python, by calling the decode() method on a byte string, providing the codec to be used:

>>> raw.decode("utf-8")
u'\xc3\x9cbersprung (Original Mix)'

The data type returned by this operation is a unicode string, i.e. a sequence of characters, not bytes. And this sequence of characters looks flawed. What is that \xc3\x9c thing there, actually? Does it make sense? To be clarified in the next section.

Reverse-engineering the issue

First, let us verify what happened here. We decoded a raw byte sequence via UTF-8 and retrieved two weird unicode code points in the output. This is the inverse process, starting from the two unexpected unicode code points C3 and 9C:

>>> u"\xc3\x9c".encode("utf-8")
'\xc3\x83\xc2\x9c'

The Python code above defines a sequence of unicode code points, and then encodes this “text” using UTF-8, yielding the very same byte sequence contained in the Beatport ID3 raw data which we have seen before. Now we know which “text” they encoded in order create the meta data in the file they provide for download. But what is that text? We are still missing the German umlaut Ü here, aren’t we? Let us look at the common character representation of these code points:

>>> print u"\xc3\x9c"
Ü▯

By having a look at http://codepoints.net we can clarify what the code points C3 and 9C really represent:

  • U+00C3 LATIN CAPITAL LETTER A WITH TILDE
  • U+009C STRING TERMINATOR*

The print statement above attempted to display these characters on my terminal. The A with tilde appears as expected, followed by a rectangle (you might or might not see that here), representing a control character.

So now we have identified the actual text that Beatport encoded as UTF-8 and saved in the file as raw byte sequence. The VLC player in the figure at the top is behaving correctly: it decodes this byte sequence using UTF-8 and just displays the resulting characters: the A with the tilde and the control character, which has no glyph, and which is therefore represented with a rectangle.

The question left is: why does Beatport encode invalid text in the first place?

The magic of encoding text multiple times.

When you regularly deal with character encodings you probably have an idea already. I had a suspicion. The correct title text starts with a capital German Umlaut Ü. The unicode codepoint for Ü actually is 00DC. What is the raw byte sequence representation of this code point when using the UTF-8 codec?

>>> u"Ü".encode("utf-8")
'\xc3\x9c'
>>> u"\xdc".encode("utf-8")
'\xc3\x9c'

Right. It is c3 9c in hex notation. You have seen that a minute ago. Confused? Above, we learned that code points C3 and 9C were considered part of the original text, which was then encoded to its UTF-8 representation, i.e. the UTF-8 representations of the characters U+00C3 and U+009C ended up in the raw data. Now, we have learned that the two bytes c3 9c actually encode the character U+00DC in UTF-8. Still confused?

Explanation:

The original text was encoded twice, whereas the raw byte string representation after the first encoding was erroneously interpreted as unicode code point sequence.

Reproduction of Beatport’s broken text encoding

Let us reproduce this step by step. First, we encode U+00DC (the German Umlaut Ü) to UTF-8:

>>> u"\xdc".encode("utf-8")
'\xc3\x9c'

Now it is time to go into detail of defining unicode literals in Python 2: with the u in front of the literal, Python is instructed to parse the characters in the literal as unicode code points. One code point can be given with different methods. The first 256 unicode code points (there are many more!) can be given in hex notation. This is what happens above, the \xdc is the U+00DC code point in hex notation.

The output of the above call to encode() is a raw byte string, where the bytes are shown in hex notation. Now we can go ahead and attach a u in front of the raw byte string. This little prefix fundamentally changes the meaning of this string literal. Now, the hex notation does not describe single raw bytes anymore, it describes unicode code points. The two resulting entities are entirely unrelated:

>>> print '\xc3\x9c'
Ü
>>> print u'\xc3\x9c'
Ü▯

The return value of both statements has nothing meaningful in common, by concept. The first is a byte string, implicitly decoded via the UTF-8 codec by my terminal (careful, that is magic!). The second is a sequence of two unicode code points.

This is like saying “hey, give me item number 156 and 195 from that shelve there, and then also give me number 156 and 195 from the other shelve over there”, whereas the shelves contain entirely different things. All these two statements have in common is the way the “numbers” are represented in hex notation.

It does not matter which programming language Beatport is using for creating the ID3 meta data, but somehow they managed to do a very weird thing: after having the text encoded in UTF-8 (technically it could also have been Latin-1, as Thomas pointed out in the comments, but that is not that likely), they

  • re-interprete that binary data (most likely in hex representation) again as unicode code point sequence
  • and re-encode this unicode code sequence again with UTF-8.

With our small example, this is the process:

# Encode the text with UTF-8.
>>> u"Ü".encode("utf-8")
'\xc3\x9c'
 
# Take the hex representation of the raw byte sequence and
# re-interpret it as unicode code point sequence. Encode this
# with UTF-8 again.
>>> u'\xc3\x9c'.encode("utf-8")
'\xc3\x83\xc2\x9c'

The latter is exactly the invalid raw byte sequence found in the ID3 meta data of Beatport’s MP3 file. The last step in reproducing the entire encoding-transfer-decoding chain is to do what a MP3 player would do: decode that data using UTF-8 and display the corresponding characters:

>>> print '\xc3\x83\xc2\x9c'.decode("utf-8")
Ü▯

The above is exactly what happens within e.g. VLC player or any other player that properly parses the ID3 tag data.

Indeed, this is Beatport’s fault. Within the entire process of text processing, one needs to be aware of the actual representation of the text. At some point in Beatport’s text processing, a developer assumed text to be a Unicode sequence object, whereas it really was an UTF-8-encoded byte string. The conceptual problem is: never make assumptions about the text representation in your code. Always take control of the data and be 100 % sure about the type of text data you are handling.

Otherwise millions of MP3 downloads will be are erroneous.

A systematic fix based on raw_unicode_escape

The process that lead to the erroneous raw byte sequence is now well-understood. Fortunately, this process does not involve any loss of information. The information is just in bad shape. With the help of some Python magic we can invert that process.

The issue is that the byte sequence \xc3\x9c was interpreted as unicode code point sequence, yielding the raw byte sequence \xc3\x83\xc2\x9c after encoding. The Python codec raw_unicode_escape can invert this (kudos to this SO thread):

>>> u'\xc3\x9c'.encode('raw_unicode_escape')
'\xc3\x9c'

Couldn’t we just have taken away the u? Yes. It is that simple. Manually. But using .encode('raw_unicode_escape') is the only straight-forward automatic procedure to achieve the same effect: keep the item representation, change the item meaning from unicode code points to raw bytes.

Likewise, the invalid raw byte sequence can be fixed using this technique:

>>> raw = '\xc3\x83\xc2\x9c'
 
# Decode the byte sequence to a unicode object.
>>> raw.decode("utf-8")
u'\xc3\x9c'
 
# Encode this unicode object, while keeping the item "numbering".
# This yields the UTF-8-encoded text as it was before Beatport
# corrupted it.
>>> raw.decode("utf-8").encode('raw_unicode_escape')
'\xc3\x9c'
 
# Decode that text.
>>> raw.decode("utf-8").encode('raw_unicode_escape').decode("utf-8")
u'\xdc'

As you remember, the code point U+00DC is the Ü. Great! All mangled together, and printed:

>>> print '\xc3\x83\xc2\x9c'.decode("utf-8").encode('raw_unicode_escape').decode("utf-8")
Ü

Yes, that’s it: the Ü is restored from the invalid byte sequence, using the knowledge derived above.

Fix the title in an MP3 file using Mutagen

There is an awesome Python module called Mutagen for handling audio file meta data. First of all, let us use Mutagen for directly and comfortably accessing the title data in our MP3 file:

>>> from mutagen.id3 import ID3
>>> data = ID3("test.mp3")
>>> title = data["TIT2"]
>>> title
TIT2(encoding=3, text=[u'\xc3\x9cbersprung (Original Mix)'])
>>> unicode(title)
u'\xc3\x9cbersprung (Original Mix)'

In the above code, unicode(title) yields the same as raw.decode("utf-8") in the section before. Starting from there, we can apply our systematic fix. Loading a Beatport MP3 file, retrieving the title tag, and generating the proper title text in one line:

>>> print unicode(ID3("test.mp3")["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
Übersprung (Original Mix)

All in all, load an MP3 file, generate the corrected title from the invalid one, and save the corrected title back to the file:

>>> from mutagen.id3 import ID3, TIT2
 
# Load ID3 meta data from MP3 file.
>>> data = ID3("test.mp3")
 
# Build corrected title.
>>> corrected_title = unicode(data["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
 
# Update ID3 data object with corrected title.
>>> data.add(TIT2(encoding=3, text=corrected_title))
 
# Write updated ID3 date to MP3 file.
>>> data.save()

After pulling that file into the player, we see that the title issue is fixed:

vlc_correct

How to fix all ID3 text frames in all files.

We could now assume that Beatport is doing the same mistake with all ID3 text frames. Actually, I have seen invalid Artist strings. Obviously, the task would then be to iterate through a collection of files, and for each file iterate through all ID3 text frames and fix them as shown above. Since I am not sure about the assumption stated before, I will not show the corresponding code here. I think you will manage to do that in case you have a collection of broken files from Beatport and know at least some Python. If not, it is a good exercise :-). But back up your MP3 files before!

  • Thomas

    I think you can use ‘latin-1’ in place of ‘raw_unicode_escape’, because the first 256 unicode code points are equivalent to the single byte characters available in latin 1.