Category Archives: Character encoding

How to raise UnicodeDecodeError in Python 3

For debugging work, I needed to manually raise UnicodeDecodeError in CPython 3(.4). Its constructor requires 5 arguments:

>>> raise UnicodeDecodeError()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: function takes exactly 5 arguments (0 given)

The docs specify which properties an already instantiated UnicodeError (the base class) has: https://docs.python.org/3/library/exceptions.html#UnicodeError

These are five properties. It is obvious that the constructor needs these five pieces of information. However, setting them requires knowing the expected order, since the constructor does not take keyword arguments. The signature also is not documented when invoking help(UnicodeDecodeError), which suggests that this interface is implemented in C (which makes sense, as it affects the bowels of CPython’s text processing).

So, the only true reference for finding out the expected order of arguments is the C code implementing the constructor. It is defined by the function UnicodeDecodeError_init in the file Objects/exceptions.c in the CPython repository. The essential part are these lines:

if (!PyArg_ParseTuple(args, "O!OnnO!",
     &PyUnicode_Type, &ude->encoding,
     &ude->object,
     &ude->start,
     &ude->end,
     &PyUnicode_Type, &ude->reason)) {
         ude->encoding = ude->object = ude->reason = NULL;
         return -1;
}

That is, the order is the following:

  1. encoding (unicode object, i.e. type str)
  2. object that was attempted to be decoded (here a bytes object makes sense, whereas in fact the requirement just is that this object must provide the Buffer interface)
  3. start (integer)
  4. end (integer)
  5. reason (type str)

Hence, now we know how to artificially raise a UnicodeDecodeError:

>>> o = b'\x00\x00'
>>> raise UnicodeDecodeError('funnycodec', o, 1, 2, 'This is just a fake reason!')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'funnycodec' codec can't decode byte 0x00 in position 1: This is just a fake reason!

Mojibake: Beatport’s ID3 text encoding is broken

Mojibake is a name for garbled text, arising from systematic errors along a text encoding-transfer-decoding chain. What does it have to do with Beatport? This:

vlc_mojibake

This is a screenshot from a playlist of the VLC player, showing MP3 meta data. I downloaded the corresponding track from Beatport. Garbage is displayed where the German Umlaut “Ü” should appear. Why is that? Does the player not support the meta data version, or more specifically the meta data encoding used by Beatport MP3s?

After some investigation I found that Beatport provides MP3 files with invalid meta data. The invalid meta data is the result from a tremendously flawed text encoding procedure in the bowels of Beatport, where text is first encoded via UTF-8, the resulting raw binary data then is interpreted as a unicode code point sequence, and subsequently encoded via UTF-8 again. Horrific, and unsurprisingly the outcome is garbage. The invalid title tag shown above can easily be fixed in Python:

>>> from mutagen.id3 import ID3, TIT2
>>> data = ID3("test.mp3")
>>> corrected_title = unicode(data["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
>>> data.add(TIT2(encoding=3, text=corrected_title))
>>> data.save()

You do not need to understand that code right now. In the following paragraphs I will explain the issue step by step and slowly work towards this solution. The issue is a result of another developer (team?) not taking enough care of character encodings, although in fact this topic is one of the most important topics in modern information technology, and ignorance in this regard has led to tons of bugs in a plethora of software projects. It is time to refer to Joel’s article “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” again, which you may want to read later on, if you did not so far.

Raw data in the ID3 tag

Meta data in MP3s is usually stored within an ID3 meta data container, as explained on Wikipedia and specified on id3.org. Different versions of this container format specification are available. First of all, let us find out which ID3 tag version the MP3 files from Beatport use. I have renamed the Beatport MP3 file in question to test.mp3. The following snippet shows the first five bytes of the file:

$ hexdump -C -n 5 test.mp3
00000000  49 44 33 04 00                                    |ID3..|

Quote from here: The first three bytes of the tag are always “ID3”, to indicate that this is an ID3v2 tag, directly followed by the two version bytes. The first byte of ID3v2 version is its major version, while the second byte is its revision number. Hence, this MP3 file contains an ID3 tag in version 2.4.0.

The ID3 data is comprised of frames. For example, the so-called TIT2 frame is designed to contain the track title. I have used hexdump to look for that frame within the first kilobytes of the MP3 file (the ID3 tag may also contain image data, so the size of the entire ID3v2 container can be several kilobytes). The following partial dump shows all the bytes belonging to the TIT2 frame in this file, as well as some stuff before and behind that.

00004900  49 54 31 00 00 00 08 00  00 03 4b 6f 6d 70 61 6b  |IT1.......Kompak|
00004910  74 54 49 54 32 00 00 00  1d 00 00 03 c3 83 c2 9c  |tTIT2...........|
00004920  62 65 72 73 70 72 75 6e  67 20 28 4f 72 69 67 69  |bersprung (Origi|
00004930  6e 61 6c 20 4d 69 78 29  54 4b 45 59 00 00 00 05  |nal Mix)TKEY....|

Text encoding in ID3 v2.4.0

It is clear that the above dump contains the track title in encoded form (there always is some kind of text encoding, there is no such thing as plain text, this should not surprise you). What is the exact format of the piece of data shown above? Which character encodings does the ID3 v2.4.0 specification allow for? Is the encoding itself specified in the file? Let’s have a look at the specification, these are relevant parts:

   All ID3v2 frames consists of one frame header followed by one or more
   fields containing the actual information. The header is always 10
   bytes and laid out as follows:
 
     Frame ID      $xx xx xx xx  (four characters)
     Size      4 * %0xxxxxxx
     Flags         $xx xx
 
[...]
 
The frame ID is followed by a size descriptor containing the size of
   the data in the final frame, after encryption, compression and
   unsynchronisation. The size is excluding the frame header ('total
   frame size' - 10 bytes) and stored as a 32 bit synchsafe integer.
 
   In the frame header the size descriptor is followed by two flag
   bytes. These flags are described in section 4.1.

What follows is the isolated frame data, i.e. all raw bytes belonging to the TIT2 frame (nothing else prepended or appended):

   54 49 54 32 00 00 00  1d 00 00 03 c3 83 c2 9c  |TIT2...........|
62 65 72 73 70 72 75 6e  67 20 28 4f 72 69 67 69  |bersprung (Origi|
6e 61 6c 20 4d 69 78 29                           |nal Mix)|
  • Frame ID: 54 49 54 32. This is the TIT2 label, indicating that this is the frame containing information about the track title.
  • Size: 00 00 00 1d. This is 29 (Python: int("0x1d", 0)). You can count for yourself, there are 39 bytes shown in the dump above, and the ID3 specification says that the frame size is the total frame size minus 10 bytes, so that fits.
  • Flags: 00 00. No flags.

What about text encoding? This is specified in section 4.2 of http://id3.org/id3v2.4.0-frames:

All the text information frames have the following format:
 
     <Header for 'Text information frame', ID: "T000" - "TZZZ",
     excluding "TXXX" described in 4.2.6.>
     Text encoding                $xx
     Information                  <text string(s) according to encoding>

http://id3.org/id3v2.4.0-structure informs us about possible encodings:

Frames that allow different types of text encoding contains a text
   encoding description byte. Possible encodings:
 
     $00   ISO-8859-1 [ISO-8859-1]. Terminated with $00.
     $01   UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
           strings in the same frame SHALL have the same byteorder.
           Terminated with $00 00.
     $02   UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
           Terminated with $00 00.
     $03   UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00.

In the raw data above, after frame type, size and flags we see a 03 byte. According to the specification above, this byte means that the following text is encoded in UTF-8. Hence, the file itself tells us that it contains the title tag encoded in UTF-8.

What follows is the byte representation of the title text, extracted from the dump shown above (frame header and text encoding marker removed). It is important to note that the following byte sequence has been created by Beatport (bytes shown hex representation, as before):

c3 83 c2 9c 62 65 72 73 70 72 75 6e 67 20
28 4f 72 69 67 69 6e 61 6c 20 4d 69 78 29

Now, just decode this raw byte sequence using the UTF-8 codec and we have our title, right? Let’s see.

Decoding the raw title data: something is wrong.

Using the \x prefix, we can easily get the raw data just shown (which should encode the title text) into a Python (2) byte string:

>>> raw = "\xc3\x83\xc2\x9c\x62\x65\x72\x73\x70\x72\x75\x6e\x67\x20\x28\x4f\x72\x69\x67\x69\x6e\x61\x6c\x20\x4d\x69\x78\x29"

The ID3 tag itself makes us believe that the original text has been encoded using UTF-8, so in order to retrieve the original text, this operation needs to be inverted. This is easily done in Python, by calling the decode() method on a byte string, providing the codec to be used:

>>> raw.decode("utf-8")
u'\xc3\x9cbersprung (Original Mix)'

The data type returned by this operation is a unicode string, i.e. a sequence of characters, not bytes. And this sequence of characters looks flawed. What is that \xc3\x9c thing there, actually? Does it make sense? To be clarified in the next section.

Reverse-engineering the issue

First, let us verify what happened here. We decoded a raw byte sequence via UTF-8 and retrieved two weird unicode code points in the output. This is the inverse process, starting from the two unexpected unicode code points C3 and 9C:

>>> u"\xc3\x9c".encode("utf-8")
'\xc3\x83\xc2\x9c'

The Python code above defines a sequence of unicode code points, and then encodes this “text” using UTF-8, yielding the very same byte sequence contained in the Beatport ID3 raw data which we have seen before. Now we know which “text” they encoded in order create the meta data in the file they provide for download. But what is that text? We are still missing the German umlaut Ü here, aren’t we? Let us look at the common character representation of these code points:

>>> print u"\xc3\x9c"
Ü▯

By having a look at http://codepoints.net we can clarify what the code points C3 and 9C really represent:

  • U+00C3 LATIN CAPITAL LETTER A WITH TILDE
  • U+009C STRING TERMINATOR*

The print statement above attempted to display these characters on my terminal. The A with tilde appears as expected, followed by a rectangle (you might or might not see that here), representing a control character.

So now we have identified the actual text that Beatport encoded as UTF-8 and saved in the file as raw byte sequence. The VLC player in the figure at the top is behaving correctly: it decodes this byte sequence using UTF-8 and just displays the resulting characters: the A with the tilde and the control character, which has no glyph, and which is therefore represented with a rectangle.

The question left is: why does Beatport encode invalid text in the first place?

The magic of encoding text multiple times.

When you regularly deal with character encodings you probably have an idea already. I had a suspicion. The correct title text starts with a capital German Umlaut Ü. The unicode codepoint for Ü actually is 00DC. What is the raw byte sequence representation of this code point when using the UTF-8 codec?

>>> u"Ü".encode("utf-8")
'\xc3\x9c'
>>> u"\xdc".encode("utf-8")
'\xc3\x9c'

Right. It is c3 9c in hex notation. You have seen that a minute ago. Confused? Above, we learned that code points C3 and 9C were considered part of the original text, which was then encoded to its UTF-8 representation, i.e. the UTF-8 representations of the characters U+00C3 and U+009C ended up in the raw data. Now, we have learned that the two bytes c3 9c actually encode the character U+00DC in UTF-8. Still confused?

Explanation:

The original text was encoded twice, whereas the raw byte string representation after the first encoding was erroneously interpreted as unicode code point sequence.

Reproduction of Beatport’s broken text encoding

Let us reproduce this step by step. First, we encode U+00DC (the German Umlaut Ü) to UTF-8:

>>> u"\xdc".encode("utf-8")
'\xc3\x9c'

Now it is time to go into detail of defining unicode literals in Python 2: with the u in front of the literal, Python is instructed to parse the characters in the literal as unicode code points. One code point can be given with different methods. The first 256 unicode code points (there are many more!) can be given in hex notation. This is what happens above, the \xdc is the U+00DC code point in hex notation.

The output of the above call to encode() is a raw byte string, where the bytes are shown in hex notation. Now we can go ahead and attach a u in front of the raw byte string. This little prefix fundamentally changes the meaning of this string literal. Now, the hex notation does not describe single raw bytes anymore, it describes unicode code points. The two resulting entities are entirely unrelated:

>>> print '\xc3\x9c'
Ü
>>> print u'\xc3\x9c'
Ü▯

The return value of both statements has nothing meaningful in common, by concept. The first is a byte string, implicitly decoded via the UTF-8 codec by my terminal (careful, that is magic!). The second is a sequence of two unicode code points.

This is like saying “hey, give me item number 156 and 195 from that shelve there, and then also give me number 156 and 195 from the other shelve over there”, whereas the shelves contain entirely different things. All these two statements have in common is the way the “numbers” are represented in hex notation.

It does not matter which programming language Beatport is using for creating the ID3 meta data, but somehow they managed to do a very weird thing: after having the text encoded in UTF-8 (technically it could also have been Latin-1, as Thomas pointed out in the comments, but that is not that likely), they

  • re-interprete that binary data (most likely in hex representation) again as unicode code point sequence
  • and re-encode this unicode code sequence again with UTF-8.

With our small example, this is the process:

# Encode the text with UTF-8.
>>> u"Ü".encode("utf-8")
'\xc3\x9c'
 
# Take the hex representation of the raw byte sequence and
# re-interpret it as unicode code point sequence. Encode this
# with UTF-8 again.
>>> u'\xc3\x9c'.encode("utf-8")
'\xc3\x83\xc2\x9c'

The latter is exactly the invalid raw byte sequence found in the ID3 meta data of Beatport’s MP3 file. The last step in reproducing the entire encoding-transfer-decoding chain is to do what a MP3 player would do: decode that data using UTF-8 and display the corresponding characters:

>>> print '\xc3\x83\xc2\x9c'.decode("utf-8")
Ü▯

The above is exactly what happens within e.g. VLC player or any other player that properly parses the ID3 tag data.

Indeed, this is Beatport’s fault. Within the entire process of text processing, one needs to be aware of the actual representation of the text. At some point in Beatport’s text processing, a developer assumed text to be a Unicode sequence object, whereas it really was an UTF-8-encoded byte string. The conceptual problem is: never make assumptions about the text representation in your code. Always take control of the data and be 100 % sure about the type of text data you are handling.

Otherwise millions of MP3 downloads will be are erroneous.

A systematic fix based on raw_unicode_escape

The process that lead to the erroneous raw byte sequence is now well-understood. Fortunately, this process does not involve any loss of information. The information is just in bad shape. With the help of some Python magic we can invert that process.

The issue is that the byte sequence \xc3\x9c was interpreted as unicode code point sequence, yielding the raw byte sequence \xc3\x83\xc2\x9c after encoding. The Python codec raw_unicode_escape can invert this (kudos to this SO thread):

>>> u'\xc3\x9c'.encode('raw_unicode_escape')
'\xc3\x9c'

Couldn’t we just have taken away the u? Yes. It is that simple. Manually. But using .encode('raw_unicode_escape') is the only straight-forward automatic procedure to achieve the same effect: keep the item representation, change the item meaning from unicode code points to raw bytes.

Likewise, the invalid raw byte sequence can be fixed using this technique:

>>> raw = '\xc3\x83\xc2\x9c'
 
# Decode the byte sequence to a unicode object.
>>> raw.decode("utf-8")
u'\xc3\x9c'
 
# Encode this unicode object, while keeping the item "numbering".
# This yields the UTF-8-encoded text as it was before Beatport
# corrupted it.
>>> raw.decode("utf-8").encode('raw_unicode_escape')
'\xc3\x9c'
 
# Decode that text.
>>> raw.decode("utf-8").encode('raw_unicode_escape').decode("utf-8")
u'\xdc'

As you remember, the code point U+00DC is the Ü. Great! All mangled together, and printed:

>>> print '\xc3\x83\xc2\x9c'.decode("utf-8").encode('raw_unicode_escape').decode("utf-8")
Ü

Yes, that’s it: the Ü is restored from the invalid byte sequence, using the knowledge derived above.

Fix the title in an MP3 file using Mutagen

There is an awesome Python module called Mutagen for handling audio file meta data. First of all, let us use Mutagen for directly and comfortably accessing the title data in our MP3 file:

>>> from mutagen.id3 import ID3
>>> data = ID3("test.mp3")
>>> title = data["TIT2"]
>>> title
TIT2(encoding=3, text=[u'\xc3\x9cbersprung (Original Mix)'])
>>> unicode(title)
u'\xc3\x9cbersprung (Original Mix)'

In the above code, unicode(title) yields the same as raw.decode("utf-8") in the section before. Starting from there, we can apply our systematic fix. Loading a Beatport MP3 file, retrieving the title tag, and generating the proper title text in one line:

>>> print unicode(ID3("test.mp3")["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
Übersprung (Original Mix)

All in all, load an MP3 file, generate the corrected title from the invalid one, and save the corrected title back to the file:

>>> from mutagen.id3 import ID3, TIT2
 
# Load ID3 meta data from MP3 file.
>>> data = ID3("test.mp3")
 
# Build corrected title.
>>> corrected_title = unicode(data["TIT2"]).encode('raw_unicode_escape').decode("utf-8")
 
# Update ID3 data object with corrected title.
>>> data.add(TIT2(encoding=3, text=corrected_title))
 
# Write updated ID3 date to MP3 file.
>>> data.save()

After pulling that file into the player, we see that the title issue is fixed:

vlc_correct

How to fix all ID3 text frames in all files.

We could now assume that Beatport is doing the same mistake with all ID3 text frames. Actually, I have seen invalid Artist strings. Obviously, the task would then be to iterate through a collection of files, and for each file iterate through all ID3 text frames and fix them as shown above. Since I am not sure about the assumption stated before, I will not show the corresponding code here. I think you will manage to do that in case you have a collection of broken files from Beatport and know at least some Python. If not, it is a good exercise :-). But back up your MP3 files before!

Python 2 on Windows: how to read command line arguments containing Unicode code points

While in the Unix world UTF-8 is the de-facto standard for terminal input and output encoding, the situation on Windows is a bit more complex. In general, Windows is even a step ahead compared to Unix systems: Unicode code points in command line arguments are supported natively when using cmd.exe or the Powershell. The Win 32 API has corresponding functions for retrieving such strings as native Unicode data types.

Python 2(.7), however, does not make use of these functions. Instead, it tries to read arguments as byte sequences. Characters not included in the 7-bit ASCII range end up as ? in the byte strings in sys.argv.

Another issue might be that by default Python does not use UTF-8 for encoding characters in the stdout stream (for me, the default stdout encoding is the more limited code page cp437).

I don’t want to lose too many words now, there are quite reliable workarounds for both issues. Stdout encoding can be enforced with the PYTHONIOENCODING environment variable. chcp 65001 sets the console code page to an UTF-8-alike encoding, so that special characters can be used as command line arguments in an UTF-8-encoded batch file, such as this test.bat:

@chcp 65001 > nul
@set PYTHONIOENCODING=utf-8
python test.py ☺

This is the Python script test.py for printing information about the retrieved command line arguments:

import sys
sys.argv = win32_unicode_argv()
print repr(sys.argv)
for a in sys.argv:
    print(a.encode(sys.stdout.encoding))

Open a terminal (cmd.exe) and execute

c:\> test.bat > out

Then have a look into the file out in which we just redirected the stdout stream of the Python script (tell your editor/file viewer to decode the file using UTF-8 and use a proper font having special glyphs!):

c:\> python test.py ☺ 
[u'test.py', u'\u263a']
test.py
☺

As you can see, the items in argv are unicode strings. This is the magic performed by the function win32_unicode_argv() which I will show below. When encoding these unicode strings to sys.stdout.encoding (which, in fact, is UTF-8 as of the environment variable PYTHONIOENCODING), the special Unicode code point ☺ becomes properly encoded.

All in all, using chcp 65001 + PYTHONIOENCODING="utf-8" + win32_unicode_argv(), we got a well-behaved information stream from the UTF-8-encoded input file test.bat to the UTF-8-encoded output file out.

This is win32_unicode_argv() which is making use of the ctypes module for using the Win 32 API functions that are provided by Windows for retrieving command line arguments as native Win 32 Unicode strings:

import sys
def win32_unicode_argv():
    # Solution copied from http://stackoverflow.com/a/846931/145400
 
    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR
 
    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR
 
    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)
 
    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

Kudos to http://stackoverflow.com/a/846931/145400.

A command line argument is raw binary data. It comes with limitations and needs interpretation.

How are command line arguments interpreted? Can arbitrary data be exchanged between the calling program on the one hand and the called program on the other hand? Some might ask: are command line interfaces “unicode-aware”?

Command line interfaces comprise an essential class of interfaces used in system architecture and the above questions deserve precise answers. In this article I try to clarify why command line arguments are nothing but raw byte sequences that deserve proper interpretation in the receiving program. To some degree, the article dives into the topic of character encoding. Towards the end, I provide simple and expressive code examples based on bash and Python. Please note that the article only applies to Unix-like systems. Certain concepts are also true for Windows, but the main difference is that Windows actually has a fully unicode-aware command line argument API (but not all programs make use of it, such as Python 2) while Unix-like systems don’t.

Program invocation: behind the scenes it is always execve

On (largely) POSIX-compliant operating systems (e.g. Mac OS, Linuxes including your Android phone, all BSDs), all program invocation scenarios have one system call in common. Eventually, the system call execve() is the entry point for all program invocations on these platforms. It instructs the operating system to run the loader, which prepares the new program for execution and eventually brings it into running state (and leaves it to itself). One argument provided to execve() is a character string — a file system path pointing to the executable file on disk. One task of the loader is to read the program’s machine code from that file and place it into memory.

argv: nothing but raw binary data

Another essential step the loader performs before triggering the actual invocation of the program is to copy “the command line arguments” on the stack of the new program. These arguments were provided to the loader via the argv argument to execve() — argv means argument vector. Simply spoken, this is a set of strings. More precisely, each of these strings is a null-terminated C char array.

One could say that each element in a C char array is a character. A character, however, has quite an abstract meaning. The greek Σ is a character, right? In times of real character abstractions, the Unicode code points, we should call each element in a C char array what it is: it is a byte of raw data. Each element in such an array stores one byte of information. In a null-terminated char array, each byte may assume all values between 0000 0001 (x01) and 1111 1111 (xFF). The first byte with the value 0000 0000 (x00) terminates the “string”.

In essence, the data in argv (which by itself is a pointer to an array of pointers to type char arrays) as provided to execve() so to say takes a long journey through the kernel, the launcher, and finally ends up as second argument to the main() function in the new program (the first argument is the argument count). That’s why, when you write a C program, you usually use the following signature for the main function: main(argc, argv).

An argument may contain arbitrary binary data, with certain limitations

  • A command line argument is nothing but a sequence of bytes. These bytes are raw binary data that may mean anything. It is up to the retrieving program to make sense of these bytes (to decode them into something meaningful).
  • Any kind of binary data can be provided within the byte sequence of a command line argument. However, there is one important exception: the x00 byte cannot be included in such a sequence. It always terminates the byte sequence. If x00 is the first or the only byte in the sequence, then the sequence is considered empty.
  • Since argv data is initially and entirely written to the stack of the new program, the total amount of data that may be provided is limited by the operating system. These limits are defined in the system header files. xargs --show-limits can be used to evaluate these limits in a convenient way:
    $ xargs --show-limits
    Your environment variables take up 4478 bytes
    POSIX upper limit on argument length (this system): 2090626
    POSIX smallest allowable upper limit on argument length (all systems): 4096
    Maximum length of command we could actually use: 2086148
    Size of command buffer we are actually using: 131072

    The value “Maximum length of command we could actually use” is about 2 MB (this holds true for my machine and operating system and may be entirely different in other environments).

Provide argument data to execve() and read it in the new program: in theory

The minimal example for demonstrating the behavior explained above would require two compiled C programs. The first, the recipient, would have a main(argc, argv) function which evaluates the contents of argv and prints them in some human readable form to stdout (in Hex representation, for example). The second program, the sender, would

  • 1) prepare the arguments by setting up certain byte sequences (pointers to type char arrays).
  • 2) call one of the exec*() system calls (that actually wrap the execve() system call). It would provide the path to the compiled recipient program, and argv — a pointer to an array of pointers to type char arrays: the arguments.
  • 3) upon execution of the execve() system call the calling process (the sender) becomes replaced by the new program, which is now the receiver. The operating system launcher takes care of copying the argv data to the stack of the new program.
  • 4) The receiver, compiled against (g)libc, goes through the _start() function (provided by (g)libc) and eventually executes its main(argc, argv) function, which evaluates the byte sequences that we call command line arguments.

argv programming interface? We want an actual command line interface!

Knowing how the above works is useful in order to understand that, internally, command line arguments are just a portion of memory copied to the stack of the main thread of the new program. You might have noticed that in this picture, however, there is no actual command line involved.

So far, we have discussed a programming interface provided by the operating system that enables us to use argv in order to bootstrap a newly invoked program with some input data that is not coming from any file, socket, or other data sources.

The concept of argv quickly translates to the concept of command line arguments. A command line interface is something that enables us to call programs in the established program arg1 arg2 arg3 ... fashion. Right, that is one of the main features a shell provides! This is what happens behind the scenes: the shell translates the arguments provided on the command line to a set of C char arrays, spawns a child process and eventually calls the new program via execve().

In other words, the shell program takes parts of the user input on the command line and translates these parts to C char arrays that it later provides to execve(). Hence, it does all the things that our hypothetical sender program from above would have done (and more).

Provide command line arguments and read them in the new program: simple practice, yei!

An example for a shell is bash. An example for a ‘new program’ is Python (2, in this case). Python 2 is a useful tool in this case, because (in contrast to Python 3) it provides raw access to the byte sequences provided via argv. Consider this example program dump_args.py:

import sys
for a in sys.argv[1:]:
    print repr(a)

We type python dump_args.py b in our command line interface provided by the shell. It makes the shell spawn the Python executable in a child process. This program consumes the first command line argument, which is the path to our little Python script. The remaining arguments (one in this case) are left for our script. They are accessible in the sys.argv list. Each item in sys.argv is of Python 2 type ‘str’. This type carries raw byte sequences. This is the output:

$ python dump_args.py b
'b'

The single quotes ‘ are added by Python’s repr() function (it tries to reveal the true content of the variable a in the Python code — the quotes show where the bytestring starts and ends). The fact that b in the input translates to b in the output seems normal. I don’t want to go into all the details here, but you need to appreciate that the process starting with the keystroke on your keyboard that leads to “b” being displayed on the command line, and ending with “b” being displayed in your terminal as the output of our Python script, involves several encoding and decoding steps, i.e. data interpretation steps. These interpretation steps do not always do the thing you would expect. The common ground of most of these encoders and decoders is the 7-bit ASCII character set (a 2-way translation table between byte values and characters). That is why for simple characters such as “b” things seem to be simple and ‘work’ out of the box. As you will see below, it is not always that simple and often times you need to understand the details of the data interpretation steps involved.

Provide command line arguments and read them in the new program: binary data

From an ASCII character table like this we can infer that the letter b corresponds to the byte value 0110 0010, or x62 in hexadecimal representation. Let us now try to explicitly use this byte value as command line argument to our little Python script.

There is one difficulty: on the command line — how do you construct arbitrary binary data? Having the extended 8-bit ASCII character set in mind (i.e. all characters and their corresponding byte values) is not an option :-).

There are a couple of possibilities. I like one of them particularly: in bash, the $'...' notation (discussed here) is allowed to be used together with \x00-like escape sequences for constructing arbitrary byte sequences from the hexadecimal notation. Let us create the same output as before, but with a more interesting input:

$ python dump_args.py  $'\x62'
'b'

This worked as expected. The input chain is clear: this command line explicitly instructs the shell to create a certain byte sequence (1 byte long in this case) and provide this as first argument to our script. I guess that the shell internally actually terminates this byte sequence properly with a null byte before calling execve(). sys.argv in our Python script has the same contents as before. Therefore, it does not surprise that the output is the same as before. This example again suggests that there is some magic happening between stdout of the Python script and our terminal. Some decoder expected to retrieve ASCII (or UTF-8, of which ASCII is a subset) as input and correspondingly interpreted this byte as ‘b’ — our terminal displays it as such.

Let us now provide two arguments in explicit binary form. We expect one to translate to “b”, the other to “c” (according to ASCII):

$ python dump_args.py  $'\x62' $'\x63'
'b'
'c'

Cool. Now, I mentioned the null-termination of arguments. Difficult to create with the keyboard, right? Straight-forward with the hex notation:

$ python dump_args.py  $'\x00' $'\x63\x00' $'\x63\x00\x63'
''
'c'
'c'

That proves that a null byte actually terminates an argument byte sequence. The first one arrives as an empty byte sequence, because it only contains a null byte. The second and the third one arrives as single byte \x63 (“c” according to ASCII), because the next byte in the input is a null byte.

More fun? For a fact, the Unicode character ☺ (a smiley) is encoded with the byte sequence \xe2\x98\xba in UTF-8. Send it:

$ python dump_args.py  $'\xe2\x98\xba'
'\xe2\x98\xba'

Python’s repr() prints every single byte in this byte sequence in hex notation. It’s just a fallback to a readable representation when a certain byte is not representable as ASCII character. None of these three bytes has a character correspondence in the 7-bit ASCII table. The fact that both ‘strings’ look the same is because the hex notation for defining the input is the same as the hex notation for representing the output. We could have defined the input with a different notation representing the same byte sequence and would have gotten the same output.

It is clear: to our little Python script these three bytes just look like random binary data. It cannot make sense of it without us defining how to interpret this data. As I said earlier, these three bytes are the UTF-8 encoded form of a smiley. In order to make sense of this data, the Python script needs to decode it. The modified version of the script:

import sys
for a in sys.argv[1:]:
    print repr(a)
    da = a.decode("utf-8")
    print repr(da)
    print da

This is the output:

$ python dump_args.py  $'\xe2\x98\xba'
'\xe2\x98\xba'
u'\u263a'
☺

It first prints the raw representation of the byte string via repr() (the same as before). Secondly, it decodes the data using the explicitly defined codec UTF-8. This leads to a unicode data type da containing a certain code point representing a character. repr(da) tells us the number of this code point. See the 263a? This may not ring a bell for you, but it actually is the abstract and unambiguous description of our character here: http://www.charbase.com/263a-unicode-white-smiling-face. print da then actually makes us see the smiley in the terminal. The fact that this works involves Python being aware of the terminal’s expected character encoding. So when Python prints this unicode data type, it actually encodes it in the encoding as expected by the terminal. The terminal then decodes it again and displays the character (if the terminal font has a glyph for it).

I hope the article made clear that command line arguments are nothing but byte sequences (with certain limitations) that deserve proper interpretation in the receiving program. I intend to report more about the details of Python’s behavior when starting programs with the subprocess module which also allows passing command line arguments from within Python. At this point, Python 2 and 3 behave quite differently.