Category Archives: C/C++

GnuTLS vulnerability: is unit testing a matter of language culture?

You have probably heard about this major security issue in GnuTLS, publicly announced on March 3, 2014, with the following words in a patch note on the GnuTLS mailinglist:

This fixes is an important (and at the same time embarrassing) bug
discovered during an audit for Red Hat. Everyone is urged to upgrade.

The official security advisory describes the issue in these general terms:

A vulnerability was discovered that affects the certificate verification functions of all gnutls versions. A specially crafted certificate could bypass certificate validation checks. The vulnerability was discovered during an audit of GnuTLS for Red Hat.

Obviously, media and tech bloggers pointed out the significance of this issue. If you are interested in some technical detail, I would like to recommend a well-written article on LWN on the topic: A longstanding GnuTLS certificate validation botch. As it turns out, the bug was introduced by a code change that re-factored the error/success communication between functions. Eventually, spoken generally, the problem is that two communication partners went out of sync: when the sender sent ‘Careful, error!’, the recipient actually understood ‘Cool, success.’. Bah. We are used to modern, test-driven development culture. Consequently, most of us immediately think “WTF, don’t they test their code?”.

An automated test suite should have immediately spotted that invalid commit, right. But wait a second, that malicious commit was pushed in year 2000, the language we are talking about is C, and unit testing for C is not exactly established. Given that — did you really, honestly, expect a C code base that reaches back more than a decade to be under surveillance of ideal unit-tests, by modern standards? No? Me neither (although I would have expected a security-relevant library such as GnuTLS to be under a significant 3rd party test coverage — does everybody trust the authors?).

We seem to excuse or at least acknowledge and tolerate that old system software written in C is not well-tested by modern standards of test-driven development. For sure, there is modern software out there applying ideal testing strategies — but having only a few users. At the same time old software is circulating, used by millions, but not applying modern testing strategies. But why is that? And should we tolerate this? There was an interesting discussion about this topic, right underneath the above-mentioned LWN article. I’d like to quote one comment that I particularly agree to, although it is mostly asking questions than providing answers:

> In addition to the culture of limited testing you alluded to,
> I think there are some language issues here as well

Yes, true. But I wonder if discussing type systems is also a
distraction from the more pressing issue here? After all, even
with all the help of Haskell’s type system, you *will* still
have bugs.

It seems to me that the lack of rigorous testing was:
(a) The most immediate cause of these bugs
(b) More common in projects written in C

I find it frustrating that discussions of these issues continually
drift towards language wars, rather than towards modern ideas about
unit testing, software composability, test-driven development, and
code coverage tracking.

Aren’t these the more pressing questions?
(1) Where are the GnuTLS unit tests, so I can review and add more?
(2) Where is the new regression test covering this bug?
(3) What is the command to run a code coverage tool on the test
suite, so that I can see what coverage is missing?

Say what you will about “toy” languages, but that is what would
happen in any halfway mature Ruby or Python or Javascript project,
and I’m happy to provide links to back that up.

Say what you will about the non-systems languages on the JVM, but
that is also what would happen in any halfway mature Scala, Java,
or Clojure project.

It’s only in C, the systems language in which so many of these
vital libraries are written, that this is not the case. Isn’t it
time to ask why?

Someone answered, and I think this view makes sense:

For example, I suspect that the reason “C culture” seems impervious to adopting the lessons of test-driven development has a lot to do with the masses of developers who are interested in it, by following your advice, are moving to other languages and practicing it there.

In other words, by complecting the issue of unit testing and test coverage with the choice of language, are we not actively *contributing* to the continuing absence of these ideas from C culture, and thus from the bulk of our existing systems?

Food for thought, at least, I hope!

I agree: the effort for improved testing of old, but essential, C libraries must come from the open source community. Someone has to do it.

A command line argument is raw binary data. It comes with limitations and needs interpretation.

How are command line arguments interpreted? Can arbitrary data be exchanged between the calling program on the one hand and the called program on the other hand? Some might ask: are command line interfaces “unicode-aware”?

Command line interfaces comprise an essential class of interfaces used in system architecture and the above questions deserve precise answers. In this article I try to clarify why command line arguments are nothing but raw byte sequences that deserve proper interpretation in the receiving program. To some degree, the article dives into the topic of character encoding. Towards the end, I provide simple and expressive code examples based on bash and Python. Please note that the article only applies to Unix-like systems. Certain concepts are also true for Windows, but the main difference is that Windows actually has a fully unicode-aware command line argument API (but not all programs make use of it, such as Python 2) while Unix-like systems don’t.

Program invocation: behind the scenes it is always execve

On (largely) POSIX-compliant operating systems (e.g. Mac OS, Linuxes including your Android phone, all BSDs), all program invocation scenarios have one system call in common. Eventually, the system call execve() is the entry point for all program invocations on these platforms. It instructs the operating system to run the loader, which prepares the new program for execution and eventually brings it into running state (and leaves it to itself). One argument provided to execve() is a character string — a file system path pointing to the executable file on disk. One task of the loader is to read the program’s machine code from that file and place it into memory.

argv: nothing but raw binary data

Another essential step the loader performs before triggering the actual invocation of the program is to copy “the command line arguments” on the stack of the new program. These arguments were provided to the loader via the argv argument to execve() — argv means argument vector. Simply spoken, this is a set of strings. More precisely, each of these strings is a null-terminated C char array.

One could say that each element in a C char array is a character. A character, however, has quite an abstract meaning. The greek Σ is a character, right? In times of real character abstractions, the Unicode code points, we should call each element in a C char array what it is: it is a byte of raw data. Each element in such an array stores one byte of information. In a null-terminated char array, each byte may assume all values between 0000 0001 (x01) and 1111 1111 (xFF). The first byte with the value 0000 0000 (x00) terminates the “string”.

In essence, the data in argv (which by itself is a pointer to an array of pointers to type char arrays) as provided to execve() so to say takes a long journey through the kernel, the launcher, and finally ends up as second argument to the main() function in the new program (the first argument is the argument count). That’s why, when you write a C program, you usually use the following signature for the main function: main(argc, argv).

An argument may contain arbitrary binary data, with certain limitations

  • A command line argument is nothing but a sequence of bytes. These bytes are raw binary data that may mean anything. It is up to the retrieving program to make sense of these bytes (to decode them into something meaningful).
  • Any kind of binary data can be provided within the byte sequence of a command line argument. However, there is one important exception: the x00 byte cannot be included in such a sequence. It always terminates the byte sequence. If x00 is the first or the only byte in the sequence, then the sequence is considered empty.
  • Since argv data is initially and entirely written to the stack of the new program, the total amount of data that may be provided is limited by the operating system. These limits are defined in the system header files. xargs --show-limits can be used to evaluate these limits in a convenient way:
    $ xargs --show-limits
    Your environment variables take up 4478 bytes
    POSIX upper limit on argument length (this system): 2090626
    POSIX smallest allowable upper limit on argument length (all systems): 4096
    Maximum length of command we could actually use: 2086148
    Size of command buffer we are actually using: 131072

    The value “Maximum length of command we could actually use” is about 2 MB (this holds true for my machine and operating system and may be entirely different in other environments).

Provide argument data to execve() and read it in the new program: in theory

The minimal example for demonstrating the behavior explained above would require two compiled C programs. The first, the recipient, would have a main(argc, argv) function which evaluates the contents of argv and prints them in some human readable form to stdout (in Hex representation, for example). The second program, the sender, would

  • 1) prepare the arguments by setting up certain byte sequences (pointers to type char arrays).
  • 2) call one of the exec*() system calls (that actually wrap the execve() system call). It would provide the path to the compiled recipient program, and argv — a pointer to an array of pointers to type char arrays: the arguments.
  • 3) upon execution of the execve() system call the calling process (the sender) becomes replaced by the new program, which is now the receiver. The operating system launcher takes care of copying the argv data to the stack of the new program.
  • 4) The receiver, compiled against (g)libc, goes through the _start() function (provided by (g)libc) and eventually executes its main(argc, argv) function, which evaluates the byte sequences that we call command line arguments.

argv programming interface? We want an actual command line interface!

Knowing how the above works is useful in order to understand that, internally, command line arguments are just a portion of memory copied to the stack of the main thread of the new program. You might have noticed that in this picture, however, there is no actual command line involved.

So far, we have discussed a programming interface provided by the operating system that enables us to use argv in order to bootstrap a newly invoked program with some input data that is not coming from any file, socket, or other data sources.

The concept of argv quickly translates to the concept of command line arguments. A command line interface is something that enables us to call programs in the established program arg1 arg2 arg3 ... fashion. Right, that is one of the main features a shell provides! This is what happens behind the scenes: the shell translates the arguments provided on the command line to a set of C char arrays, spawns a child process and eventually calls the new program via execve().

In other words, the shell program takes parts of the user input on the command line and translates these parts to C char arrays that it later provides to execve(). Hence, it does all the things that our hypothetical sender program from above would have done (and more).

Provide command line arguments and read them in the new program: simple practice, yei!

An example for a shell is bash. An example for a ‘new program’ is Python (2, in this case). Python 2 is a useful tool in this case, because (in contrast to Python 3) it provides raw access to the byte sequences provided via argv. Consider this example program dump_args.py:

import sys
for a in sys.argv[1:]:
    print repr(a)

We type python dump_args.py b in our command line interface provided by the shell. It makes the shell spawn the Python executable in a child process. This program consumes the first command line argument, which is the path to our little Python script. The remaining arguments (one in this case) are left for our script. They are accessible in the sys.argv list. Each item in sys.argv is of Python 2 type ‘str’. This type carries raw byte sequences. This is the output:

$ python dump_args.py b
'b'

The single quotes ‘ are added by Python’s repr() function (it tries to reveal the true content of the variable a in the Python code — the quotes show where the bytestring starts and ends). The fact that b in the input translates to b in the output seems normal. I don’t want to go into all the details here, but you need to appreciate that the process starting with the keystroke on your keyboard that leads to “b” being displayed on the command line, and ending with “b” being displayed in your terminal as the output of our Python script, involves several encoding and decoding steps, i.e. data interpretation steps. These interpretation steps do not always do the thing you would expect. The common ground of most of these encoders and decoders is the 7-bit ASCII character set (a 2-way translation table between byte values and characters). That is why for simple characters such as “b” things seem to be simple and ‘work’ out of the box. As you will see below, it is not always that simple and often times you need to understand the details of the data interpretation steps involved.

Provide command line arguments and read them in the new program: binary data

From an ASCII character table like this we can infer that the letter b corresponds to the byte value 0110 0010, or x62 in hexadecimal representation. Let us now try to explicitly use this byte value as command line argument to our little Python script.

There is one difficulty: on the command line — how do you construct arbitrary binary data? Having the extended 8-bit ASCII character set in mind (i.e. all characters and their corresponding byte values) is not an option :-).

There are a couple of possibilities. I like one of them particularly: in bash, the $'...' notation (discussed here) is allowed to be used together with \x00-like escape sequences for constructing arbitrary byte sequences from the hexadecimal notation. Let us create the same output as before, but with a more interesting input:

$ python dump_args.py  $'\x62'
'b'

This worked as expected. The input chain is clear: this command line explicitly instructs the shell to create a certain byte sequence (1 byte long in this case) and provide this as first argument to our script. I guess that the shell internally actually terminates this byte sequence properly with a null byte before calling execve(). sys.argv in our Python script has the same contents as before. Therefore, it does not surprise that the output is the same as before. This example again suggests that there is some magic happening between stdout of the Python script and our terminal. Some decoder expected to retrieve ASCII (or UTF-8, of which ASCII is a subset) as input and correspondingly interpreted this byte as ‘b’ — our terminal displays it as such.

Let us now provide two arguments in explicit binary form. We expect one to translate to “b”, the other to “c” (according to ASCII):

$ python dump_args.py  $'\x62' $'\x63'
'b'
'c'

Cool. Now, I mentioned the null-termination of arguments. Difficult to create with the keyboard, right? Straight-forward with the hex notation:

$ python dump_args.py  $'\x00' $'\x63\x00' $'\x63\x00\x63'
''
'c'
'c'

That proves that a null byte actually terminates an argument byte sequence. The first one arrives as an empty byte sequence, because it only contains a null byte. The second and the third one arrives as single byte \x63 (“c” according to ASCII), because the next byte in the input is a null byte.

More fun? For a fact, the Unicode character ☺ (a smiley) is encoded with the byte sequence \xe2\x98\xba in UTF-8. Send it:

$ python dump_args.py  $'\xe2\x98\xba'
'\xe2\x98\xba'

Python’s repr() prints every single byte in this byte sequence in hex notation. It’s just a fallback to a readable representation when a certain byte is not representable as ASCII character. None of these three bytes has a character correspondence in the 7-bit ASCII table. The fact that both ‘strings’ look the same is because the hex notation for defining the input is the same as the hex notation for representing the output. We could have defined the input with a different notation representing the same byte sequence and would have gotten the same output.

It is clear: to our little Python script these three bytes just look like random binary data. It cannot make sense of it without us defining how to interpret this data. As I said earlier, these three bytes are the UTF-8 encoded form of a smiley. In order to make sense of this data, the Python script needs to decode it. The modified version of the script:

import sys
for a in sys.argv[1:]:
    print repr(a)
    da = a.decode("utf-8")
    print repr(da)
    print da

This is the output:

$ python dump_args.py  $'\xe2\x98\xba'
'\xe2\x98\xba'
u'\u263a'
☺

It first prints the raw representation of the byte string via repr() (the same as before). Secondly, it decodes the data using the explicitly defined codec UTF-8. This leads to a unicode data type da containing a certain code point representing a character. repr(da) tells us the number of this code point. See the 263a? This may not ring a bell for you, but it actually is the abstract and unambiguous description of our character here: http://www.charbase.com/263a-unicode-white-smiling-face. print da then actually makes us see the smiley in the terminal. The fact that this works involves Python being aware of the terminal’s expected character encoding. So when Python prints this unicode data type, it actually encodes it in the encoding as expected by the terminal. The terminal then decodes it again and displays the character (if the terminal font has a glyph for it).

I hope the article made clear that command line arguments are nothing but byte sequences (with certain limitations) that deserve proper interpretation in the receiving program. I intend to report more about the details of Python’s behavior when starting programs with the subprocess module which also allows passing command line arguments from within Python. At this point, Python 2 and 3 behave quite differently.

Reading files line by line in C++ using ifstream: dealing correctly with badbit, failbit, eofbit, and perror()

Motivated by this POV-Ray issue I tried to find a reliable way to read a file line by line in C++ using std::ifstream in combination with std::getline(). While doing so, the goal was to handle all underlying stream errors as well as file opening errors, and to emit as precise error messages as possible. In a high-level programming language such as Python this level of reliability and usability is not difficult to obtain. However, in C++ this turned out to be a rather complex topic.

Proper handling of the stream error bits eofbit, failbit, and badbit requires a tremendous amount of care, as discussed for example here, here, and here, and finally at cplusplus.com. It is worth mentioning that although cplusplus.com is a convenient reference, it does not provide us with a rock-solid solution for the above-stated problem and also does not mention all the important details.

When it comes to the idea of providing meaningful error messages, things become quite complicated. Proper evaluation of errno, respectively perror(), in response to the stream error bits is not a trivial task as can be inferred from discussions like this and this. From these discussions we learn that most of the related uncertainty comes from a lack of centralized documentation or even missing documentation. The exact behavior of C++ code with respect to file handling and stream manipulation is defined by an intertwining of language specification (C++ in this case), operating system interface (e.g. POSIX) and low-level APIs (provided by e.g. libc) — they all are documented in different places and to a different extent. We for example expect that when fopen() returns NULL, errno is set to something meaningful. But where is this actually documented?

In order to understand the relation between the language and operating system constructs involved, I performed quite some research and testing. Of course there are many obvious and non-obvious ways to write unreliable code. As expected, for writing reliable code, there also best practices or “recipes” to follow. To name, explain, and share those with the community is the goal of this article. We all know that re-using established recipes saves time and improves software quality in the long-term.

Update (January 18th, 2015): In the mean time, this article has made it into the top Google search results for “c++ read file ifstream”. It is one of the most-visited articles on my website. Thanks for commenting and sharing!

Update (July 7th, 2011): I revised the article after an important insight provided by Alexandre Duret-Lutz (confer comments).

Continue reading