Category Archives: Python

Discourse Docker container: send mail through Exim

Introduction

The Discourse deployment was greatly simplified by introducing Docker support (as I have written about before). Discourse heavily depends on e-mail, and its ability to send mail to arbitrary recipients is essential. While the recommended way is to use an external service like Mandrill, it is also possible to use a local MTA, such as Exim. However, when you set up the vanilla Discourse Docker container, it does not contain an pre-configured MTA, which is fine, since many have a well-configured MTA running on the host already. The question is how to use that MTA for letting Discourse send mail.

Usually, MTAs on smaller machines are configured to listen on localhost only, to not be exposed to the Internet and to not be mis-used for spam. localhost on the host itself, however, is different from localhost within a Docker container. The network within the container is a virtual one, and it is cleanly separated from the host. That is, when Discourse running in a container tries to reach an SMTP server on localhost, it cannot reach an MTA listening on localhost outside of the container. There is a straight-forward solution: Docker comes along with a network bridge. In fact, it provides a private network (in the 172.17.x.x range) that connects single containers with the host. This network can be used for establishing connectivity between a network application within a Docker container and the host.

Exim’s network configuration

Likewise, I have set up Exim4 on the Debian host for relaying mails that are incoming from localhost or from the local virtual Docker network. First I looked up the IP address of the docker bridge on the host, being 172.17.42.1 in my case (got that from /sbin/ifconfig). I then instructed Exim to treat this as local interface and listen on it. Also, Exim was explicitly told to relay mail incoming from the subnet 172.17.0.0/16, otherwise it would reject incoming mails from that network. These are the relevant keys in /etc/exim4/update-exim4.conf.conf:

dc_local_interfaces='127.0.0.1;172.17.42.1'
dc_relay_nets='172.17.0.0/16'

The config update is in place after calling update-exim4.conf and restarting Exim via service exim4 restart.

Testing SMTP access from within container

I tested if Exim’s SMTP server can be reached from within the container. I used the bare-bones SMTP implementation of Python’s smtplib for that. First of all, I SSHd into the container by calling launcher ssh app. I then called python. The following Python session demonstrates how I attempted to establish an SMTP connection right to the host via its IP address in Docker’s private network:

>>> import smtplib
>>> server = smtplib.SMTP('172.17.42.1')
>>> server.set_debuglevel(1)
>>> server.sendmail("rofl@gehrcke.de", "jgehrcke@googlemail.com", "test")
send: 'ehlo [172.17.0.54]\r\n'
reply: '250-localhost Hello [172.17.0.54] [172.17.0.54]\r\n'
reply: '250-SIZE 52428800\r\n'
reply: '250-8BITMIME\r\n'
reply: '250-PIPELINING\r\n'
reply: '250 HELP\r\n'
reply: retcode (250); Msg: localhost Hello [172.17.0.54] [172.17.0.54]
SIZE 52428800
8BITMIME
PIPELINING
HELP
send: 'mail FROM:<rofl@gehrcke.de> size=4\r\n'
reply: '250 OK\r\n'
reply: retcode (250); Msg: OK
send: 'rcpt TO:<jgehrcke@googlemail.com>\r\n'
reply: '250 Accepted\r\n'
reply: retcode (250); Msg: Accepted
send: 'data\r\n'
reply: '354 Enter message, ending with "." on a line by itself\r\n'
reply: retcode (354); Msg: Enter message, ending with "." on a line by itself
data: (354, 'Enter message, ending with "." on a line by itself')
send: 'test\r\n.\r\n'
reply: '250 OK id=1X9bpF-0000st-Od\r\n'
reply: retcode (250); Msg: OK id=1X9bpF-0000st-Od
data: (250, 'OK id=1X9bpF-0000st-Od')
{}

Indeed, the mail arrived at my Google Mail account. This test shows that the Exim4 server running on the host is reachable via SMTP from within the Discourse Docker instance. Until I got the configuration right, I observed essentially two different classes of errors:

  • socket.error: [Errno 111] Connection refused in case there is no proper network routing or connectivity established.
  • smtplib.SMTPRecipientsRefused: {'jgehrcke@googlemail.com': (550, 'relay not permitted')} in case the Exim4 SMTP server is reachable, but rejecting your mail (for this to solve I had to add the dc_relay_nets='172.17.0.0/16' to the config shown above).

Obviously, in order to make Discourse use that SMTP server, it needs to be configured with DISCOURSE_SMTP_ADDRESS being set to the IP address of the host in the Docker network, i.e. 172.17.42.1 in my case.

Hope that helps!

Mojibake: Beatport’s ID3 text encoding is broken

Mojibake is a name for garbled text, arising from systematic errors along a text encoding-transfer-decoding chain. What does it have to do with Beatport? This:

vlc_mojibake

This is a screenshot from a playlist of the VLC player, showing MP3 meta data. I downloaded the corresponding track from Beatport. Garbage is displayed where the German Umlaut “Ü” should appear. Why is that? Does the player not support the meta data version, or more specifically the meta data encoding used by Beatport MP3s?

After some investigation I found that Beatport provides MP3 files with invalid meta data. The invalid meta data is the result from a tremendously flawed text encoding procedure in the bowels of Beatport, where text is first encoded via UTF-8, the resulting raw binary data then is interpreted as a unicode code point sequence, and subsequently encoded via UTF-8 again. Horrific, and unsurprisingly the outcome is garbage. The invalid title tag shown above can easily be fixed in Python:

>>> from mutagen.id3 import ID3, TIT2
>>> data = ID3("test.mp3")
>>> corrected_title = unicode(data["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
>>> data.add(TIT2(encoding=3, text=corrected_title))
>>> data.save()

You do not need to understand that code right now. In the following paragraphs I will explain the issue step by step and slowly work towards this solution. The issue is a result of another developer (team?) not taking enough care of character encodings, although in fact this topic is one of the most important topics in modern information technology, and ignorance in this regard has led to tons of bugs in a plethora of software projects. It is time to refer to Joel’s article “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” again, which you may want to read later on, if you did not so far.

Raw data in the ID3 tag

Meta data in MP3s is usually stored within an ID3 meta data container, as explained on Wikipedia and specified on id3.org. Different versions of this container format specification are available. First of all, let us find out which ID3 tag version the MP3 files from Beatport use. I have renamed the Beatport MP3 file in question to test.mp3. The following snippet shows the first five bytes of the file:

$ hexdump -C -n 5 test.mp3
00000000  49 44 33 04 00                                    |ID3..|

Quote from here: The first three bytes of the tag are always “ID3”, to indicate that this is an ID3v2 tag, directly followed by the two version bytes. The first byte of ID3v2 version is its major version, while the second byte is its revision number. Hence, this MP3 file contains an ID3 tag in version 2.4.0.

The ID3 data is comprised of frames. For example, the so-called TIT2 frame is designed to contain the track title. I have used hexdump to look for that frame within the first kilobytes of the MP3 file (the ID3 tag may also contain image data, so the size of the entire ID3v2 container can be several kilobytes). The following partial dump shows all the bytes belonging to the TIT2 frame in this file, as well as some stuff before and behind that.

00004900  49 54 31 00 00 00 08 00  00 03 4b 6f 6d 70 61 6b  |IT1.......Kompak|
00004910  74 54 49 54 32 00 00 00  1d 00 00 03 c3 83 c2 9c  |tTIT2...........|
00004920  62 65 72 73 70 72 75 6e  67 20 28 4f 72 69 67 69  |bersprung (Origi|
00004930  6e 61 6c 20 4d 69 78 29  54 4b 45 59 00 00 00 05  |nal Mix)TKEY....|

Text encoding in ID3 v2.4.0

It is clear that the above dump contains the track title in encoded form (there always is some kind of text encoding, there is no such thing as plain text, this should not surprise you). What is the exact format of the piece of data shown above? Which character encodings does the ID3 v2.4.0 specification allow for? Is the encoding itself specified in the file? Let’s have a look at the specification, these are relevant parts:

   All ID3v2 frames consists of one frame header followed by one or more
   fields containing the actual information. The header is always 10
   bytes and laid out as follows:
 
     Frame ID      $xx xx xx xx  (four characters)
     Size      4 * %0xxxxxxx
     Flags         $xx xx
 
[...]
 
The frame ID is followed by a size descriptor containing the size of
   the data in the final frame, after encryption, compression and
   unsynchronisation. The size is excluding the frame header ('total
   frame size' - 10 bytes) and stored as a 32 bit synchsafe integer.
 
   In the frame header the size descriptor is followed by two flag
   bytes. These flags are described in section 4.1.

What follows is the isolated frame data, i.e. all raw bytes belonging to the TIT2 frame (nothing else prepended or appended):

   54 49 54 32 00 00 00  1d 00 00 03 c3 83 c2 9c  |TIT2...........|
62 65 72 73 70 72 75 6e  67 20 28 4f 72 69 67 69  |bersprung (Origi|
6e 61 6c 20 4d 69 78 29                           |nal Mix)|
  • Frame ID: 54 49 54 32. This is the TIT2 label, indicating that this is the frame containing information about the track title.
  • Size: 00 00 00 1d. This is 29 (Python: int("0x1d", 0)). You can count for yourself, there are 39 bytes shown in the dump above, and the ID3 specification says that the frame size is the total frame size minus 10 bytes, so that fits.
  • Flags: 00 00. No flags.

What about text encoding? This is specified in section 4.2 of http://id3.org/id3v2.4.0-frames:

All the text information frames have the following format:
 
     <Header for 'Text information frame', ID: "T000" - "TZZZ",
     excluding "TXXX" described in 4.2.6.>
     Text encoding                $xx
     Information                  <text string(s) according to encoding>

http://id3.org/id3v2.4.0-structure informs us about possible encodings:

Frames that allow different types of text encoding contains a text
   encoding description byte. Possible encodings:
 
     $00   ISO-8859-1 [ISO-8859-1]. Terminated with $00.
     $01   UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
           strings in the same frame SHALL have the same byteorder.
           Terminated with $00 00.
     $02   UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
           Terminated with $00 00.
     $03   UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00.

In the raw data above, after frame type, size and flags we see a 03 byte. According to the specification above, this byte means that the following text is encoded in UTF-8. Hence, the file itself tells us that it contains the title tag encoded in UTF-8.

What follows is the byte representation of the title text, extracted from the dump shown above (frame header and text encoding marker removed). It is important to note that the following byte sequence has been created by Beatport (bytes shown hex representation, as before):

c3 83 c2 9c 62 65 72 73 70 72 75 6e 67 20
28 4f 72 69 67 69 6e 61 6c 20 4d 69 78 29

Now, just decode this raw byte sequence using the UTF-8 codec and we have our title, right? Let’s see.

Decoding the raw title data: something is wrong.

Using the \x prefix, we can easily get the raw data just shown (which should encode the title text) into a Python (2) byte string:

>>> raw = "\xc3\x83\xc2\x9c\x62\x65\x72\x73\x70\x72\x75\x6e\x67\x20\x28\x4f\x72\x69\x67\x69\x6e\x61\x6c\x20\x4d\x69\x78\x29"

The ID3 tag itself makes us believe that the original text has been encoded using UTF-8, so in order to retrieve the original text, this operation needs to be inverted. This is easily done in Python, by calling the decode() method on a byte string, providing the codec to be used:

>>> raw.decode("utf-8")
u'\xc3\x9cbersprung (Original Mix)'

The data type returned by this operation is a unicode string, i.e. a sequence of characters, not bytes. And this sequence of characters looks flawed. What is that \xc3\x9c thing there, actually? Does it make sense? To be clarified in the next section.

Reverse-engineering the issue

First, let us verify what happened here. We decoded a raw byte sequence via UTF-8 and retrieved two weird unicode code points in the output. This is the inverse process, starting from the two unexpected unicode code points C3 and 9C:

>>> u"\xc3\x9c".encode("utf-8")
'\xc3\x83\xc2\x9c'

The Python code above defines a sequence of unicode code points, and then encodes this “text” using UTF-8, yielding the very same byte sequence contained in the Beatport ID3 raw data which we have seen before. Now we know which “text” they encoded in order create the meta data in the file they provide for download. But what is that text? We are still missing the German umlaut Ü here, aren’t we? Let us look at the common character representation of these code points:

>>> print u"\xc3\x9c"
Ü▯

By having a look at http://codepoints.net we can clarify what the code points C3 and 9C really represent:

  • U+00C3 LATIN CAPITAL LETTER A WITH TILDE
  • U+009C STRING TERMINATOR*

The print statement above attempted to display these characters on my terminal. The A with tilde appears as expected, followed by a rectangle (you might or might not see that here), representing a control character.

So now we have identified the actual text that Beatport encoded as UTF-8 and saved in the file as raw byte sequence. The VLC player in the figure at the top is behaving correctly: it decodes this byte sequence using UTF-8 and just displays the resulting characters: the A with the tilde and the control character, which has no glyph, and which is therefore represented with a rectangle.

The question left is: why does Beatport encode invalid text in the first place?

The magic of encoding text multiple times.

When you regularly deal with character encodings you probably have an idea already. I had a suspicion. The correct title text starts with a capital German Umlaut Ü. The unicode codepoint for Ü actually is 00DC. What is the raw byte sequence representation of this code point when using the UTF-8 codec?

>>> u"Ü".encode("utf-8")
'\xc3\x9c'
>>> u"\xdc".encode("utf-8")
'\xc3\x9c'

Right. It is c3 9c in hex notation. You have seen that a minute ago. Confused? Above, we learned that code points C3 and 9C were considered part of the original text, which was then encoded to its UTF-8 representation, i.e. the UTF-8 representations of the characters U+00C3 and U+009C ended up in the raw data. Now, we have learned that the two bytes c3 9c actually encode the character U+00DC in UTF-8. Still confused?

Explanation:

The original text was encoded twice, whereas the raw byte string representation after the first encoding was erroneously interpreted as unicode code point sequence.

Reproduction of Beatport’s broken text encoding

Let us reproduce this step by step. First, we encode U+00DC (the German Umlaut Ü) to UTF-8:

>>> u"\xdc".encode("utf-8")
'\xc3\x9c'

Now it is time to go into detail of defining unicode literals in Python 2: with the u in front of the literal, Python is instructed to parse the characters in the literal as unicode code points. One code point can be given with different methods. The first 256 unicode code points (there are many more!) can be given in hex notation. This is what happens above, the \xdc is the U+00DC code point in hex notation.

The output of the above call to encode() is a raw byte string, where the bytes are shown in hex notation. Now we can go ahead and attach a u in front of the raw byte string. This little prefix fundamentally changes the meaning of this string literal. Now, the hex notation does not describe single raw bytes anymore, it describes unicode code points. The two resulting entities are entirely unrelated:

>>> print '\xc3\x9c'
Ü
>>> print u'\xc3\x9c'
Ü▯

The return value of both statements has nothing meaningful in common, by concept. The first is a byte string, implicitly decoded via the UTF-8 codec by my terminal (careful, that is magic!). The second is a sequence of two unicode code points.

This is like saying “hey, give me item number 156 and 195 from that shelve there, and then also give me number 156 and 195 from the other shelve over there”, whereas the shelves contain entirely different things. All these two statements have in common is the way the “numbers” are represented in hex notation.

It does not matter which programming language Beatport is using for creating the ID3 meta data, but somehow they managed to do a very weird thing: after having the text encoded in UTF-8 (technically it could also have been Latin-1, as Thomas pointed out in the comments, but that is not that likely), they

  • re-interprete that binary data (most likely in hex representation) again as unicode code point sequence
  • and re-encode this unicode code sequence again with UTF-8.

With our small example, this is the process:

# Encode the text with UTF-8.
>>> u"Ü".encode("utf-8")
'\xc3\x9c'
 
# Take the hex representation of the raw byte sequence and
# re-interpret it as unicode code point sequence. Encode this
# with UTF-8 again.
>>> u'\xc3\x9c'.encode("utf-8")
'\xc3\x83\xc2\x9c'

The latter is exactly the invalid raw byte sequence found in the ID3 meta data of Beatport’s MP3 file. The last step in reproducing the entire encoding-transfer-decoding chain is to do what a MP3 player would do: decode that data using UTF-8 and display the corresponding characters:

>>> print '\xc3\x83\xc2\x9c'.decode("utf-8")
Ü▯

The above is exactly what happens within e.g. VLC player or any other player that properly parses the ID3 tag data.

Indeed, this is Beatport’s fault. Within the entire process of text processing, one needs to be aware of the actual representation of the text. At some point in Beatport’s text processing, a developer assumed text to be a Unicode sequence object, whereas it really was an UTF-8-encoded byte string. The conceptual problem is: never make assumptions about the text representation in your code. Always take control of the data and be 100 % sure about the type of text data you are handling.

Otherwise millions of MP3 downloads will be are erroneous.

A systematic fix based on raw_unicode_escape

The process that lead to the erroneous raw byte sequence is now well-understood. Fortunately, this process does not involve any loss of information. The information is just in bad shape. With the help of some Python magic we can invert that process.

The issue is that the byte sequence \xc3\x9c was interpreted as unicode code point sequence, yielding the raw byte sequence \xc3\x83\xc2\x9c after encoding. The Python codec raw_unicode_escape can invert this (kudos to this SO thread):

>>> u'\xc3\x9c'.encode('raw_unicode_escape')
'\xc3\x9c'

Couldn’t we just have taken away the u? Yes. It is that simple. Manually. But using .encode('raw_unicode_escape') is the only straight-forward automatic procedure to achieve the same effect: keep the item representation, change the item meaning from unicode code points to raw bytes.

Likewise, the invalid raw byte sequence can be fixed using this technique:

>>> raw = '\xc3\x83\xc2\x9c'
 
# Decode the byte sequence to a unicode object.
>>> raw.decode("utf-8")
u'\xc3\x9c'
 
# Encode this unicode object, while keeping the item "numbering".
# This yields the UTF-8-encoded text as it was before Beatport
# corrupted it.
>>> raw.decode("utf-8").encode('raw_unicode_escape')
'\xc3\x9c'
 
# Decode that text.
>>> raw.decode("utf-8").encode('raw_unicode_escape').decode("utf-8")
u'\xdc'

As you remember, the code point U+00DC is the Ü. Great! All mangled together, and printed:

>>> print '\xc3\x83\xc2\x9c'.decode("utf-8").encode('raw_unicode_escape').decode("utf-8")
Ü

Yes, that’s it: the Ü is restored from the invalid byte sequence, using the knowledge derived above.

Fix the title in an MP3 file using Mutagen

There is an awesome Python module called Mutagen for handling audio file meta data. First of all, let us use Mutagen for directly and comfortably accessing the title data in our MP3 file:

>>> from mutagen.id3 import ID3
>>> data = ID3("test.mp3")
>>> title = data["TIT2"]
>>> title
TIT2(encoding=3, text=[u'\xc3\x9cbersprung (Original Mix)'])
>>> unicode(title)
u'\xc3\x9cbersprung (Original Mix)'

In the above code, unicode(title) yields the same as raw.decode("utf-8") in the section before. Starting from there, we can apply our systematic fix. Loading a Beatport MP3 file, retrieving the title tag, and generating the proper title text in one line:

>>> print unicode(ID3("test.mp3")["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
Übersprung (Original Mix)

All in all, load an MP3 file, generate the corrected title from the invalid one, and save the corrected title back to the file:

>>> from mutagen.id3 import ID3, TIT2
 
# Load ID3 meta data from MP3 file.
>>> data = ID3("test.mp3")
 
# Build corrected title.
>>> corrected_title = unicode(data["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
 
# Update ID3 data object with corrected title.
>>> data.add(TIT2(encoding=3, text=corrected_title))
 
# Write updated ID3 date to MP3 file.
>>> data.save()

After pulling that file into the player, we see that the title issue is fixed:

vlc_correct

How to fix all ID3 text frames in all files.

We could now assume that Beatport is doing the same mistake with all ID3 text frames. Actually, I have seen invalid Artist strings. Obviously, the task would then be to iterate through a collection of files, and for each file iterate through all ID3 text frames and fix them as shown above. Since I am not sure about the assumption stated before, I will not show the corresponding code here. I think you will manage to do that in case you have a collection of broken files from Beatport and know at least some Python. If not, it is a good exercise :-). But back up your MP3 files before!

Prettyplotlib!

hist_prettyplotlib

Are you using matplotlib for Python from time to time? You might want to have a look at prettyplotlib! In short: it makes your plots appear more harmonic and comfortable by using modern design standards and color schemes (cf. colorbrewer). These styles are nice for publishing plots on a website or in a magazine, they obviously do not add scientific value.

Technically, prettyplotlib wraps matplotlib + brewer2mpl.

Thermal energy in calories

A short note. In chemistry and related fields we quite often describe energies in units of calories, instead of using Joules from the SI system. From physics, we know that the thermal energy at room temperature is about 25 meV. The question is: how does the thermal energy relate to an interaction energy (e.g. between two molecules) if that energy is provided in units of kcal/mol?

The thermal energy in Joule is k_B*T. That energy is “accessible” to any given molecule, i.e. for comparison with a value normalized on a mole (as an energy provided in units kcal/mol obviously is) we need to multiply it with the Avogadro constant A, yielding the thermal energy in Joule/mol. About 4190 Joule comprise one kcal (1000 calories), i.e. the thermal energy in kcal/mol is k_B * T * A / 4190. Given that, we can relate any given interaction energy to the thermal energy. A quick Python implementation:

import sys
 
# Thermal energy for 300 K:
# T_300 = 4.14 * 10^-21 J
# (k_B * 300 K, with k_B = 1.38 * 10^-23 J/K)
# The relation between Joule and (kilo)calories:
joule_per_kcal = 4190
 
# Thermal energy in kcal/mol:
# T_300_kcalpermol = T_300 * A / joule_per_kcal
# with A = 6.02 * 10^23 (Avogadro constant)
T_300_kcalpermol = 4.14 * 6.02 * 100 / joule_per_kcal
print "Thermal energy (at 300 K): %.2f kcal/mol" % (
    T_300_kcalpermol)
 
# Read actual energy in kcal/mol.
energy_kcalpermol = float(sys.argv[1])
 
energy_per_thermal_energy = energy_kcalpermol / T_300_kcalpermol 
print "%.3f kcal/mol divided by the thermal energy: %.1f" % (
    energy_kcalpermol, energy_per_thermal_energy)

So how does an interaction energy of 2 kcal/mol relate to the thermal energy?

$ python foo.py 2
Thermal energy (at 300 K): 0.59 kcal/mol
2.000 kcal/mol divided by the thermal energy: 3.4

First of all, the thermal energy is 0.59 kcal/mol — something to keep in mind when dealing with kcal/mol-based energy values on a regular basis. We learn that an interaction energy of 2 kcal/mol is already more than three times larger than the thermal energy, i.e. this kind of interaction may easily dominate diffusion and stands out of the thermal noise.

Travis CI finally supports Python 3.4

Python 3.4 was released over a month ago. According to this announcement, Travis CI will finally support Python 3.4 in only a few hours. This has been long awaited by the community, given the many "+1" postings in Travis CI issue 1989 and the countless "Add 3.4 to .travis.yml"-style commit messages referencing this issue.

Many believe that Python 3.4 will be the breakthrough for Python 3 and we can expect it to become quite popular. Although Python 2.7 security and bug fixes have recently been “guaranteed” for up to 2020 by Guido, I got the impression that the dominance of Python 2.7 finally decreases — slowly, but steadily. For developers in the open source community this means that Python 3.4 compatibility is an important target to aim for now (you might even want to ignore all releases up to 3.3).

By the way, Ubuntu has made Python 3.4 the default Python 3 in their recently released 14.04 LTS (which will be supported for 5 years). They even considered to ship it as the default Python which they did not do in the end — their recommendation, however, is

“to best support future versions of Ubuntu you should consider porting your code to Python 3”

So, go ahead, use the great Travis CI and make your code run on both, Python 2.7 and Python 3.4!