A command line argument is raw binary data. It comes with limitations and needs interpretation.

How are command line arguments interpreted? Can arbitrary data be exchanged between the calling program on the one hand and the called program on the other hand? Some might ask: are command line interfaces “unicode-aware”?

Command line interfaces comprise an essential class of interfaces used in system architecture and the above questions deserve precise answers. In this article I try to clarify why command line arguments are nothing but raw byte sequences that deserve proper interpretation in the receiving program. To some degree, the article dives into the topic of character encoding. Towards the end, I provide simple and expressive code examples based on bash and Python. Please note that the article only applies to Unix-like systems. Certain concepts are also true for Windows, but the main difference is that Windows actually has a fully unicode-aware command line argument API (but not all programs make use of it, such as Python 2) while Unix-like systems don’t.

Program invocation: behind the scenes it is always execve

On (largely) POSIX-compliant operating systems (e.g. Mac OS, Linuxes including your Android phone, all BSDs), all program invocation scenarios have one system call in common. Eventually, the system call execve() is the entry point for all program invocations on these platforms. It instructs the operating system to run the loader, which prepares the new program for execution and eventually brings it into running state (and leaves it to itself). One argument provided to execve() is a character string — a file system path pointing to the executable file on disk. One task of the loader is to read the program’s machine code from that file and place it into memory.

argv: nothing but raw binary data

Another essential step the loader performs before triggering the actual invocation of the program is to copy “the command line arguments” on the stack of the new program. These arguments were provided to the loader via the argv argument to execve() — argv means argument vector. Simply spoken, this is a set of strings. More precisely, each of these strings is a null-terminated C char array.

One could say that each element in a C char array is a character. A character, however, has quite an abstract meaning. The greek Σ is a character, right? In times of real character abstractions, the Unicode code points, we should call each element in a C char array what it is: it is a byte of raw data. Each element in such an array stores one byte of information. In a null-terminated char array, each byte may assume all values between 0000 0001 (x01) and 1111 1111 (xFF). The first byte with the value 0000 0000 (x00) terminates the “string”.

In essence, the data in argv (which by itself is a pointer to an array of pointers to type char arrays) as provided to execve() so to say takes a long journey through the kernel, the launcher, and finally ends up as second argument to the main() function in the new program (the first argument is the argument count). That’s why, when you write a C program, you usually use the following signature for the main function: main(argc, argv).

An argument may contain arbitrary binary data, with certain limitations

A command line argument is nothing but a sequence of bytes. These bytes are raw binary data that may mean anything. It is up to the retrieving program to make sense of these bytes (to decode them into something meaningful).

Any kind of binary data can be provided within the byte sequence of a command line argument. However, there is one important exception: the x00 byte cannot be included in such a sequence. It always terminates the byte sequence. If x00 is the first or the only byte in the sequence, then the sequence is considered empty.

Since argv data is initially and entirely written to the stack of the new program, the total amount of data that may be provided is limited by the operating system. These limits are defined in the system header files. xargs --show-limits can be used to evaluate these limits in a convenient way:
$ xargs --show-limits Your environment variables take up 4478 bytes POSIX upper limit on argument length (this system): 2090626 POSIX smallest allowable upper limit on argument length (all systems): 4096 Maximum length of command we could actually use: 2086148 Size of command buffer we are actually using: 131072
The value “Maximum length of command we could actually use” is about 2 MB (this holds true for my machine and operating system and may be entirely different in other environments).

Provide argument data to execve() and read it in the new program: in theory

The minimal example for demonstrating the behavior explained above would require two compiled C programs. The first, the recipient, would have a main(argc, argv) function which evaluates the contents of argv and prints them in some human readable form to stdout (in Hex representation, for example). The second program, the sender, would

1) prepare the arguments by setting up certain byte sequences (pointers to type char arrays).
2) call one of the exec*() system calls (that actually wrap the execve() system call). It would provide the path to the compiled recipient program, and argv — a pointer to an array of pointers to type char arrays: the arguments.
3) upon execution of the execve() system call the calling process (the sender) becomes replaced by the new program, which is now the receiver. The operating system launcher takes care of copying the argv data to the stack of the new program.
4) The receiver, compiled against (g)libc, goes through the _start() function (provided by (g)libc) and eventually executes its main(argc, argv) function, which evaluates the byte sequences that we call command line arguments.

argv programming interface? We want an actual command line interface!

Knowing how the above works is useful in order to understand that, internally, command line arguments are just a portion of memory copied to the stack of the main thread of the new program. You might have noticed that in this picture, however, there is no actual command line involved.

So far, we have discussed a programming interface provided by the operating system that enables us to use argv in order to bootstrap a newly invoked program with some input data that is not coming from any file, socket, or other data sources.

The concept of argv quickly translates to the concept of command line arguments. A command line interface is something that enables us to call programs in the established program arg1 arg2 arg3 ... fashion. Right, that is one of the main features a shell provides! This is what happens behind the scenes: the shell translates the arguments provided on the command line to a set of C char arrays, spawns a child process and eventually calls the new program via execve().

In other words, the shell program takes parts of the user input on the command line and translates these parts to C char arrays that it later provides to execve(). Hence, it does all the things that our hypothetical sender program from above would have done (and more).

Provide command line arguments and read them in the new program: simple practice, yei!

An example for a shell is bash. An example for a ‘new program’ is Python (2, in this case). Python 2 is a useful tool in this case, because (in contrast to Python 3) it provides raw access to the byte sequences provided via argv. Consider this example program dump_args.py:

import sys
for a in sys.argv[1:]:
    print repr(a)

We type python dump_args.py b in our command line interface provided by the shell. It makes the shell spawn the Python executable in a child process. This program consumes the first command line argument, which is the path to our little Python script. The remaining arguments (one in this case) are left for our script. They are accessible in the sys.argv list. Each item in sys.argv is of Python 2 type ‘str’. This type carries raw byte sequences. This is the output:

$ python dump_args.py b
'b'

The single quotes ‘ are added by Python’s repr() function (it tries to reveal the true content of the variable a in the Python code — the quotes show where the bytestring starts and ends). The fact that b in the input translates to b in the output seems normal. I don’t want to go into all the details here, but you need to appreciate that the process starting with the keystroke on your keyboard that leads to “b” being displayed on the command line, and ending with “b” being displayed in your terminal as the output of our Python script, involves several encoding and decoding steps, i.e. data interpretation steps. These interpretation steps do not always do the thing you would expect. The common ground of most of these encoders and decoders is the 7-bit ASCII character set (a 2-way translation table between byte values and characters). That is why for simple characters such as “b” things seem to be simple and ‘work’ out of the box. As you will see below, it is not always that simple and often times you need to understand the details of the data interpretation steps involved.

Provide command line arguments and read them in the new program: binary data

From an ASCII character table like this we can infer that the letter b corresponds to the byte value 0110 0010, or x62 in hexadecimal representation. Let us now try to explicitly use this byte value as command line argument to our little Python script.

There is one difficulty: on the command line — how do you construct arbitrary binary data? Having the extended 8-bit ASCII character set in mind (i.e. all characters and their corresponding byte values) is not an option :-).

There are a couple of possibilities. I like one of them particularly: in bash, the $'...' notation (discussed here) is allowed to be used together with \x00-like escape sequences for constructing arbitrary byte sequences from the hexadecimal notation. Let us create the same output as before, but with a more interesting input:

$ python dump_args.py  $'\x62'
'b'

This worked as expected. The input chain is clear: this command line explicitly instructs the shell to create a certain byte sequence (1 byte long in this case) and provide this as first argument to our script. I guess that the shell internally actually terminates this byte sequence properly with a null byte before calling execve(). sys.argv in our Python script has the same contents as before. Therefore, it does not surprise that the output is the same as before. This example again suggests that there is some magic happening between stdout of the Python script and our terminal. Some decoder expected to retrieve ASCII (or UTF-8, of which ASCII is a subset) as input and correspondingly interpreted this byte as ‘b’ — our terminal displays it as such.

Let us now provide two arguments in explicit binary form. We expect one to translate to “b”, the other to “c” (according to ASCII):

$ python dump_args.py  $'\x62' $'\x63'
'b'
'c'

Cool. Now, I mentioned the null-termination of arguments. Difficult to create with the keyboard, right? Straight-forward with the hex notation:

$ python dump_args.py  $'\x00' $'\x63\x00' $'\x63\x00\x63'
''
'c'
'c'

That proves that a null byte actually terminates an argument byte sequence. The first one arrives as an empty byte sequence, because it only contains a null byte. The second and the third one arrives as single byte \x63 (“c” according to ASCII), because the next byte in the input is a null byte.

More fun? For a fact, the Unicode character ☺ (a smiley) is encoded with the byte sequence \xe2\x98\xba in UTF-8. Send it:

$ python dump_args.py  $'\xe2\x98\xba'
'\xe2\x98\xba'

Python’s repr() prints every single byte in this byte sequence in hex notation. It’s just a fallback to a readable representation when a certain byte is not representable as ASCII character. None of these three bytes has a character correspondence in the 7-bit ASCII table. The fact that both ‘strings’ look the same is because the hex notation for defining the input is the same as the hex notation for representing the output. We could have defined the input with a different notation representing the same byte sequence and would have gotten the same output.

It is clear: to our little Python script these three bytes just look like random binary data. It cannot make sense of it without us defining how to interpret this data. As I said earlier, these three bytes are the UTF-8 encoded form of a smiley. In order to make sense of this data, the Python script needs to decode it. The modified version of the script:

import sys
for a in sys.argv[1:]:
    print repr(a)
    da = a.decode("utf-8")
    print repr(da)
    print da

This is the output:

$ python dump_args.py  $'\xe2\x98\xba'
'\xe2\x98\xba'
u'\u263a'
☺

It first prints the raw representation of the byte string via repr() (the same as before). Secondly, it decodes the data using the explicitly defined codec UTF-8. This leads to a unicode data type da containing a certain code point representing a character. repr(da) tells us the number of this code point. See the 263a? This may not ring a bell for you, but it actually is the abstract and unambiguous description of our character here: http://www.charbase.com/263a-unicode-white-smiling-face. print da then actually makes us see the smiley in the terminal. The fact that this works involves Python being aware of the terminal’s expected character encoding. So when Python prints this unicode data type, it actually encodes it in the encoding as expected by the terminal. The terminal then decodes it again and displays the character (if the terminal font has a glyph for it).

I hope the article made clear that command line arguments are nothing but byte sequences (with certain limitations) that deserve proper interpretation in the receiving program. I intend to report more about the details of Python’s behavior when starting programs with the subprocess module which also allows passing command line arguments from within Python. At this point, Python 2 and 3 behave quite differently.

Jan-Philip Gehrcke, PhD