Category Archives: Linux

Python 2 on Windows: how to read command line arguments containing Unicode code points

While in the Unix world UTF-8 is the de-facto standard for terminal input and output encoding, the situation on Windows is a bit more complex. In general, Windows is even a step ahead compared to Unix systems: Unicode code points in command line arguments are supported natively when using cmd.exe or the Powershell. The Win 32 API has corresponding functions for retrieving such strings as native Unicode data types.

Python 2(.7), however, does not make use of these functions. Instead, it tries to read arguments as byte sequences. Characters not included in the 7-bit ASCII range end up as ? in the byte strings in sys.argv.

Another issue might be that by default Python does not use UTF-8 for encoding characters in the stdout stream (for me, the default stdout encoding is the more limited code page cp437).

I don’t want to lose too many words now, there are quite reliable workarounds for both issues. Stdout encoding can be enforced with the PYTHONIOENCODING environment variable. chcp 65001 sets the console code page to an UTF-8-alike encoding, so that special characters can be used as command line arguments in an UTF-8-encoded batch file, such as this test.bat:

@chcp 65001 > nul
@set PYTHONIOENCODING=utf-8
python test.py ☺

This is the Python script test.py for printing information about the retrieved command line arguments:

import sys
sys.argv = win32_unicode_argv()
print repr(sys.argv)
for a in sys.argv:
    print(a.encode(sys.stdout.encoding))

Open a terminal (cmd.exe) and execute

c:\> test.bat > out

Then have a look into the file out in which we just redirected the stdout stream of the Python script (tell your editor/file viewer to decode the file using UTF-8 and use a proper font having special glyphs!):

c:\> python test.py ☺ 
[u'test.py', u'\u263a']
test.py
☺

As you can see, the items in argv are unicode strings. This is the magic performed by the function win32_unicode_argv() which I will show below. When encoding these unicode strings to sys.stdout.encoding (which, in fact, is UTF-8 as of the environment variable PYTHONIOENCODING), the special Unicode code point ☺ becomes properly encoded.

All in all, using chcp 65001 + PYTHONIOENCODING="utf-8" + win32_unicode_argv(), we got a well-behaved information stream from the UTF-8-encoded input file test.bat to the UTF-8-encoded output file out.

This is win32_unicode_argv() which is making use of the ctypes module for using the Win 32 API functions that are provided by Windows for retrieving command line arguments as native Win 32 Unicode strings:

import sys
def win32_unicode_argv():
    # Solution copied from http://stackoverflow.com/a/846931/145400
 
    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR
 
    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR
 
    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)
 
    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

Kudos to http://stackoverflow.com/a/846931/145400.

Save single page from PDF file as PNG image file

In the open source world, the best choice for PDF command line foo (and PDF foo in general) is almost always ghostscript. This is a quick way to extract a single page from a PDF file and save it as PNG file with a given (resolution in dpi):

#!/bin/bash
INFILE="$1"
OUTFILE="$2"
PAGE="$3"
RES="$4"
 
gs -dBATCH -dNOPAUSE -sDEVICE=png16m \
    -r$RES \
    -dFirstPage=$PAGE \
    -dLastPage=$PAGE \
    "-sOutputFile=$OUTFILE" \
    "$INFILE"

Example usage:

./pdf_page_to_png.sh input.pdf output_p3.png 3 200

A command line argument is raw binary data. It comes with limitations and needs interpretation.

How are command line arguments interpreted? Can arbitrary data be exchanged between the calling program on the one hand and the called program on the other hand? Some might ask: are command line interfaces “unicode-aware”?

Command line interfaces comprise an essential class of interfaces used in system architecture and the above questions deserve precise answers. In this article I try to clarify why command line arguments are nothing but raw byte sequences that deserve proper interpretation in the receiving program. To some degree, the article dives into the topic of character encoding. Towards the end, I provide simple and expressive code examples based on bash and Python. Please note that the article only applies to Unix-like systems. Certain concepts are also true for Windows, but the main difference is that Windows actually has a fully unicode-aware command line argument API (but not all programs make use of it, such as Python 2) while Unix-like systems don’t.

Program invocation: behind the scenes it is always execve

On (largely) POSIX-compliant operating systems (e.g. Mac OS, Linuxes including your Android phone, all BSDs), all program invocation scenarios have one system call in common. Eventually, the system call execve() is the entry point for all program invocations on these platforms. It instructs the operating system to run the loader, which prepares the new program for execution and eventually brings it into running state (and leaves it to itself). One argument provided to execve() is a character string — a file system path pointing to the executable file on disk. One task of the loader is to read the program’s machine code from that file and place it into memory.

argv: nothing but raw binary data

Another essential step the loader performs before triggering the actual invocation of the program is to copy “the command line arguments” on the stack of the new program. These arguments were provided to the loader via the argv argument to execve() — argv means argument vector. Simply spoken, this is a set of strings. More precisely, each of these strings is a null-terminated C char array.

One could say that each element in a C char array is a character. A character, however, has quite an abstract meaning. The greek Σ is a character, right? In times of real character abstractions, the Unicode code points, we should call each element in a C char array what it is: it is a byte of raw data. Each element in such an array stores one byte of information. In a null-terminated char array, each byte may assume all values between 0000 0001 (x01) and 1111 1111 (xFF). The first byte with the value 0000 0000 (x00) terminates the “string”.

In essence, the data in argv (which by itself is a pointer to an array of pointers to type char arrays) as provided to execve() so to say takes a long journey through the kernel, the launcher, and finally ends up as second argument to the main() function in the new program (the first argument is the argument count). That’s why, when you write a C program, you usually use the following signature for the main function: main(argc, argv).

An argument may contain arbitrary binary data, with certain limitations

  • A command line argument is nothing but a sequence of bytes. These bytes are raw binary data that may mean anything. It is up to the retrieving program to make sense of these bytes (to decode them into something meaningful).
  • Any kind of binary data can be provided within the byte sequence of a command line argument. However, there is one important exception: the x00 byte cannot be included in such a sequence. It always terminates the byte sequence. If x00 is the first or the only byte in the sequence, then the sequence is considered empty.
  • Since argv data is initially and entirely written to the stack of the new program, the total amount of data that may be provided is limited by the operating system. These limits are defined in the system header files. xargs --show-limits can be used to evaluate these limits in a convenient way:
    $ xargs --show-limits
    Your environment variables take up 4478 bytes
    POSIX upper limit on argument length (this system): 2090626
    POSIX smallest allowable upper limit on argument length (all systems): 4096
    Maximum length of command we could actually use: 2086148
    Size of command buffer we are actually using: 131072

    The value “Maximum length of command we could actually use” is about 2 MB (this holds true for my machine and operating system and may be entirely different in other environments).

Provide argument data to execve() and read it in the new program: in theory

The minimal example for demonstrating the behavior explained above would require two compiled C programs. The first, the recipient, would have a main(argc, argv) function which evaluates the contents of argv and prints them in some human readable form to stdout (in Hex representation, for example). The second program, the sender, would

  • 1) prepare the arguments by setting up certain byte sequences (pointers to type char arrays).
  • 2) call one of the exec*() system calls (that actually wrap the execve() system call). It would provide the path to the compiled recipient program, and argv — a pointer to an array of pointers to type char arrays: the arguments.
  • 3) upon execution of the execve() system call the calling process (the sender) becomes replaced by the new program, which is now the receiver. The operating system launcher takes care of copying the argv data to the stack of the new program.
  • 4) The receiver, compiled against (g)libc, goes through the _start() function (provided by (g)libc) and eventually executes its main(argc, argv) function, which evaluates the byte sequences that we call command line arguments.

argv programming interface? We want an actual command line interface!

Knowing how the above works is useful in order to understand that, internally, command line arguments are just a portion of memory copied to the stack of the main thread of the new program. You might have noticed that in this picture, however, there is no actual command line involved.

So far, we have discussed a programming interface provided by the operating system that enables us to use argv in order to bootstrap a newly invoked program with some input data that is not coming from any file, socket, or other data sources.

The concept of argv quickly translates to the concept of command line arguments. A command line interface is something that enables us to call programs in the established program arg1 arg2 arg3 ... fashion. Right, that is one of the main features a shell provides! This is what happens behind the scenes: the shell translates the arguments provided on the command line to a set of C char arrays, spawns a child process and eventually calls the new program via execve().

In other words, the shell program takes parts of the user input on the command line and translates these parts to C char arrays that it later provides to execve(). Hence, it does all the things that our hypothetical sender program from above would have done (and more).

Provide command line arguments and read them in the new program: simple practice, yei!

An example for a shell is bash. An example for a ‘new program’ is Python (2, in this case). Python 2 is a useful tool in this case, because (in contrast to Python 3) it provides raw access to the byte sequences provided via argv. Consider this example program dump_args.py:

import sys
for a in sys.argv[1:]:
    print repr(a)

We type python dump_args.py b in our command line interface provided by the shell. It makes the shell spawn the Python executable in a child process. This program consumes the first command line argument, which is the path to our little Python script. The remaining arguments (one in this case) are left for our script. They are accessible in the sys.argv list. Each item in sys.argv is of Python 2 type ‘str’. This type carries raw byte sequences. This is the output:

$ python dump_args.py b
'b'

The single quotes ‘ are added by Python’s repr() function (it tries to reveal the true content of the variable a in the Python code — the quotes show where the bytestring starts and ends). The fact that b in the input translates to b in the output seems normal. I don’t want to go into all the details here, but you need to appreciate that the process starting with the keystroke on your keyboard that leads to “b” being displayed on the command line, and ending with “b” being displayed in your terminal as the output of our Python script, involves several encoding and decoding steps, i.e. data interpretation steps. These interpretation steps do not always do the thing you would expect. The common ground of most of these encoders and decoders is the 7-bit ASCII character set (a 2-way translation table between byte values and characters). That is why for simple characters such as “b” things seem to be simple and ‘work’ out of the box. As you will see below, it is not always that simple and often times you need to understand the details of the data interpretation steps involved.

Provide command line arguments and read them in the new program: binary data

From an ASCII character table like this we can infer that the letter b corresponds to the byte value 0110 0010, or x62 in hexadecimal representation. Let us now try to explicitly use this byte value as command line argument to our little Python script.

There is one difficulty: on the command line — how do you construct arbitrary binary data? Having the extended 8-bit ASCII character set in mind (i.e. all characters and their corresponding byte values) is not an option :-).

There are a couple of possibilities. I like one of them particularly: in bash, the $'...' notation (discussed here) is allowed to be used together with \x00-like escape sequences for constructing arbitrary byte sequences from the hexadecimal notation. Let us create the same output as before, but with a more interesting input:

$ python dump_args.py  $'\x62'
'b'

This worked as expected. The input chain is clear: this command line explicitly instructs the shell to create a certain byte sequence (1 byte long in this case) and provide this as first argument to our script. I guess that the shell internally actually terminates this byte sequence properly with a null byte before calling execve(). sys.argv in our Python script has the same contents as before. Therefore, it does not surprise that the output is the same as before. This example again suggests that there is some magic happening between stdout of the Python script and our terminal. Some decoder expected to retrieve ASCII (or UTF-8, of which ASCII is a subset) as input and correspondingly interpreted this byte as ‘b’ — our terminal displays it as such.

Let us now provide two arguments in explicit binary form. We expect one to translate to “b”, the other to “c” (according to ASCII):

$ python dump_args.py  $'\x62' $'\x63'
'b'
'c'

Cool. Now, I mentioned the null-termination of arguments. Difficult to create with the keyboard, right? Straight-forward with the hex notation:

$ python dump_args.py  $'\x00' $'\x63\x00' $'\x63\x00\x63'
''
'c'
'c'

That proves that a null byte actually terminates an argument byte sequence. The first one arrives as an empty byte sequence, because it only contains a null byte. The second and the third one arrives as single byte \x63 (“c” according to ASCII), because the next byte in the input is a null byte.

More fun? For a fact, the Unicode character ☺ (a smiley) is encoded with the byte sequence \xe2\x98\xba in UTF-8. Send it:

$ python dump_args.py  $'\xe2\x98\xba'
'\xe2\x98\xba'

Python’s repr() prints every single byte in this byte sequence in hex notation. It’s just a fallback to a readable representation when a certain byte is not representable as ASCII character. None of these three bytes has a character correspondence in the 7-bit ASCII table. The fact that both ‘strings’ look the same is because the hex notation for defining the input is the same as the hex notation for representing the output. We could have defined the input with a different notation representing the same byte sequence and would have gotten the same output.

It is clear: to our little Python script these three bytes just look like random binary data. It cannot make sense of it without us defining how to interpret this data. As I said earlier, these three bytes are the UTF-8 encoded form of a smiley. In order to make sense of this data, the Python script needs to decode it. The modified version of the script:

import sys
for a in sys.argv[1:]:
    print repr(a)
    da = a.decode("utf-8")
    print repr(da)
    print da

This is the output:

$ python dump_args.py  $'\xe2\x98\xba'
'\xe2\x98\xba'
u'\u263a'
☺

It first prints the raw representation of the byte string via repr() (the same as before). Secondly, it decodes the data using the explicitly defined codec UTF-8. This leads to a unicode data type da containing a certain code point representing a character. repr(da) tells us the number of this code point. See the 263a? This may not ring a bell for you, but it actually is the abstract and unambiguous description of our character here: http://www.charbase.com/263a-unicode-white-smiling-face. print da then actually makes us see the smiley in the terminal. The fact that this works involves Python being aware of the terminal’s expected character encoding. So when Python prints this unicode data type, it actually encodes it in the encoding as expected by the terminal. The terminal then decodes it again and displays the character (if the terminal font has a glyph for it).

I hope the article made clear that command line arguments are nothing but byte sequences (with certain limitations) that deserve proper interpretation in the receiving program. I intend to report more about the details of Python’s behavior when starting programs with the subprocess module which also allows passing command line arguments from within Python. At this point, Python 2 and 3 behave quite differently.

Filter web server logs for missing file errors (a grep, sort, uniq example)

I had the suspicion that some (image) files belonging to some pages of my web presence might not be in the proper place anymore. For systematically finding these cases, I had a look into my nginx error logs and saw various missing file errors such as

2014/01/23 09:18:18 [error] 22000#0: *3901754 open() "/XXX/apple-touch-icon.png" failed (2: No such file or directory), client: 173.245.53.224, server: gehrcke.de, request: "GET /apple-touch-icon.png HTTP/1.1", host: "gehrcke.de"

The above is perfectly valid (Apple devices by default poll for apple-touch-icon* files). For finding real problems, I then wanted to filter all missing file errors and sort them by frequency. This one-liner helps:

cat nginx_error.log | grep -Eo 'open\(\) "/.+" failed' | sort | uniq --count | sort -nk 1 | less

At the tail of the output I found that I am really missing two important image files on the server that belong to one blog post:

2681 open() "/XXX/wp/blog_content/websocket_test_empty.png" failed
2688 open() "/XXX/wp/blog_content/websocket_test.png" failed

The above one-liner works in the following way:

  • First, cat writes the nginx error log to stdout.
  • Then, grep reads these lines from stdin, processes line by line, looking for a pattern interpreted in extended regex mode (-E option). It writes lines containing the pattern to stdout, whereas it actually only writes that part of the line corresponding to the matched pattern and not the entire line (-o option).
  • sort brings these lines into order, i.e. repetitive occurrences (duplicate lines) become adjacent to each other.
  • uniq --count merges repetitive occurrences into one single occurrence and adds the number of occurrences to the beginning of the merged line.
  • sort -nk 1 sorts these merged lines by the first whitespace-separated field (-k 1 option) in numerical mode (-n option), in ascending order.
  • The final less visualizes the outcome. Go to the end of the output and you find the most frequent missing file errors.

Up again

I had some bad luck. Last week, my hosting provider performed a previously announced routine maintenance with the server running this website. During this maintenance, they managed to destroy the RAID (thanks, guys). All data lost. I did not have a very recent remote system snapshot and therefore needed to re-create the entire system from scratch.

A couple of months ago already, I have put Cloudflare in front of my website, so first I thought: hey, that gives me some time to restore things while the incident is largely transparent to the visitors of my website. I had activated Cloudflare’s “always online” feature months ago. That’s basically just a cache that jumps in when the backend is down. But wait, what does “down” mean? Cloudflare says that “always on” is triggered when the backend sends a 502 or 504 type response. Currently, when the backend is just dead (not responsive), this cache does not serve anything at all (in a support ticket, they said that they will “add this” in the future). My server went down at night and by the time I had a system and web server running again (and also sending a 502 response in order to trigger Cloudflare’s “always online”) — that was about 10 hours after the machine went down — the Cloudflare cache already assumed that my website does not exist anymore. Pah, “always online” useless. So, I am sorry, the downtime lasted quite long. I had to put all components back together manually.

Good thing is that I had up-to-date remote backups of my WordPress database (via Dropbox), my nginx config and other configs (via Bitbucket), and a more or less up-to-date remote backup of the directory structure behind my website. While settings things up again, I used the opportunity for re-structuring my website a bit. I am now running a modified version of the WordPress’ TwentyTwelve theme, for a cleaner appearance and a certain responsiveness.

A few hours before all data was lost, I wrote another blog post. That one was not contained in the latest (daily) WordPress database backup performed before the crash. When I realized that, the first thing I did was cp -a ing my Firefox and Chrome browser caches on the machine where I was writing that post. I then started digging in these caches in order to find residual pieces of the article’s content. And I found a golden piece. Chrome had cached a gzipped HTTP response containing the final version of my article, found via the chrome://cache/ list. Chrome displays the contents in a hexdump -C fashion. I copied this text, used a Python script to parse the dump, re-create the binary data, and to unzip this data. Based on the resulting HTML view of my article, I could quickly add it again to the WordPress system.

Some lessons learned:

  • Don’t trust scheduled routine maintenance, perform a backup *right* before that.
  • Cloudflare cache does not help in the situation where you need it most (that’s the current situation, at least).
  • Caches can still be of essential help in such a worst-case scenario. I recovered an article from the browser cache. I use the Google cache in order to see if something is still missing on the new version of my website
  • Google Webmastertools are pretty convenient in the sense that they inform you about crawling errors — I frequently check their interface and realize that there are still some missing pieces of my web presence (files, mostly).
  • Using (remote) code repositories for configuration stuff is the best you can do. First of all, it’s perfect bookkeeping if done properly. Secondly, if regularly pushed, it’s the best configuration backup you can have.

If you are still missing a file here and there or find some dead links, please don’t hesitate to notify me. Thanks.