As of this POV-Ray issue I tried to figure out a reliable way to read a file line by line with C++ using std::ifstream in combination with std::getline(). Furthermore, the goal was to emit as precise error messages as possible.
As it turned out, proper handling of the stream error bits eofbit, failbit, and badbit requires a tremendous amount of care, as discussed for example here, here, and here, and finally at cplusplus.com. It is worth mentioning that although cplusplus.com is a convenient reference, it does not provide us with a rock-solid solution for the above-stated problem and also does not mention all the important details.
When it comes to meaningful error messages, things become even more complicated. Proper evaluation of errno, respectively perror(), in response to the stream error bits is not a trivial task as can be inferred from discussions like this and this. From these discussions we learn that most of the related uncertainty comes from a lack of centralized documentation or even missing documentation. The exact behavior of C++ code with respect to file handling and stream manipulation is defined by an intertwining of language specification (C++ in this case), operating system interface (e.g. POSIX) and low-level APIs (provided by e.g. libc) — they all are documented in different places and to a different extent. We for example expect that when fopen() returns NULL, errno is set to something meaningful. But where is this actually documented?
In order to understand the relation between the language and operating system constructs involved, I performed quite some research and testing. Of course there are many obvious and non-obvious ways to write unreliable code. As expected, there also are one or two “best ways” or “recipes” to follow. To explain and name the latter ones is the goal of this article.
Update (July, 7th, 2011): I revised the whole article after an important insight provided by Alexandre Duret-Lutz (confer comments).
If you just want to have a look at the results of this small investigation, I recommend scrolling down to the ideal solutions section. Otherwise, before continuing, you should make yourself briefly familiar with eofbit, failbit, badbit of the ios class.
Note: all the code shown in this post can also be downloaded in a tarball.
Obey the two rules of ifstream iteration
The task is to iteratively process the lines read from a file by means of an ifstream (why ifstream?). Therefore, we try to open a file by invoking ifstream s ("file"). For trying to get a line from the file, we use std::getline(s, line), where line is a std::string to store the data to. The goal is to process the data read from the file, line by line, via process(line). Of course, we want to call process(line) only if the preceding getline() was able to extract meaningful data and store it in line. Usually it is also the goal to get the last line of the file, even if it is not terminated by a newline character.
After the investigation described below, I am pretty sure that the simplest rock-solid language construct for this task is:
string line; ifstream f ("file"); while(getline(f, line)) { process(&line); }
This is so simple and at the same time good, because it is the shortest approach following the two basic rules we must follow when using std::getline() (or any other I/O operation):
- Before processing data obtained from the stream, check for errors reported by
getline()(this holds true for any other IO operation on streams).
- If
getline()(or any other IO operation on a stream) has set the stream’sfailbitorbadbit, do not process the data.eofbitis not required to be checked in the loop and does not necessarily have to prevent data processing.
The origin of these rules will become clearer while reading the rest of the article.
Only two rules, to follow — isn’t that easy? Anyway, this often is not done, as you can infer from the links in the introduction. In fact, not following these rules lead to the bug in POV-Ray also mentioned in the beginning.
How does the simple code snippet above follow these rules? The loop, in fact, at first tries to obtain data from the stream via IO operation getline(). It is totally okay to try this even on a bad/empty/non-existing file, because it just tries and afterwards sets the stream’s error bits correctly, as defined here. After getline(), failbit and badbit are checked via the ifstream’s bool operator: getline() actually returns the stream object which is evaluated in a bool expression in the loop header. Only if both bits are not set one can be sure that there is data in line. In this case the loop body is evaluated. It processes the data obtained from the stream. Then, upon the next loop iteration, it is tried to read the next line, and so on. For each iteration, the chronological order of IO operation, error check, and data processing is preserved.
Do you wonder why we do not need to check the eofbit within the loop? This is answered further below.
Now, how does the code snippet above behave when the file exists but is empty? Or if it even does not exist? If it is a directory? If the executing process is not allowed to access the file? It just does not enter the loop body. It does not read the file. The code snippet above cannot be surprised. It deals with all types of errors transparently.
Transparent error handling is good. Sometimes, however, meaningful error messages must be emitted. How to do that? According to my findings, the following snippet is the best that can be done:
string line; ifstream f ("file"); if (!f.is_open()) perror("error while opening file"); while(getline(f, line)) { process(&line); } if (f.bad()) perror("error while reading file");
Why? Discussed in the next part.
How to catch errors specifically? Testing ifstream’s behavior
Let me start with
Two important things to know:
- Consider a call to
std::getline()detecting the end of file. It then setseofbit. But: “Notice that some eofbit cases will also set failbit.” (reference). This will be very important and we will figure out in which cases exactly we have either onlyeofbitor both,eofbitandfailbitset.
perror()evaluates the current setting oferrnoand prints a meaningful error message.errnois a global error variable which is set by low-level functions of your current operating system. Anerrnosetting is sticky: it stays until the next error is happening, overwriting the state of the last error. Therefore,perror()must only be called in a context that for sure has updatederrnoright before. Otherwise, the printed error message may not make any sense at all in the current context.
As you already can imagine, for providing meaningful error messages, it is required to understand when exactly the eofbit, failbit and badbit are set. Also, one has to know when exactly it is safe to call perror() in the context of stream methods. Unfortunately, at this point we enter system-dependency and the proper documentation(s) are difficult to find or even missing. In order to understand the behavior of my system (a 2.6.27 Linux at the time of writing this article), I went down the empirical path and implemented test cases. All source files are provided in a tarball and it will be very easy for you to run these tests on your system.
The test suite:
The test suite can be summarized as follows:
It starts off with ifstream s ("file") and then checks the state of the stream via
s.is_open()s.fail()(same as!s: check forfailbitandbadbit)s.bad()(check for onlybadbit)s.eof()(check for onlyeofbit)errno(evaluated viaperror())
while opening/reading
- a non-existing file
- an empty file
- an existing file with content
- an existing file with the last line not terminated by a newline character (could be considered being an invalid file format, since lines mostly are considered to be newline-terminated, not newline-separated).
- a file with content that is opened by another process for reading
- a file with content that is opened by another process for writing
- a file that the test program has no access to
- a directory
Basically, the test evaluates the named quantities at all interesting points and especially after calls to std::getline().
Technically, the test consists of:
- The C++ source of a test program with debug output. It expects an input filename as first command line argument.
- A bash script that is compiling the C++ source code of the test program and setting up the test files for the test. It runs the compiled test program against various input filenames.
This is the shell script (readfile_tests.sh):
#!/bin/bash COMPILATION_SOURCE=$1 NE_FILE="na" EMPTY_FILE="empty_file" ONE_LINE_FILE="one_line_file" INVALID_LINE_FILE="invalid_line_file" FILE_READ="file_read" FILE_WRITTEN="file_written" FILE_DENIED="/root/.bashrc" DIR="dir" # compile test program, resulting in a.out executable g++ $COMPILATION_SOURCE # create test files / directories and put them in the desired state touch $EMPTY_FILE if [[ ! -d $DIR ]]; then mkdir $DIR fi echo "rofl" > $ONE_LINE_FILE echo -ne "validline\ninvalidline" > $INVALID_LINE_FILE echo "i am opened to read from" > $FILE_READ python -c 'import time; f = open("'$FILE_READ'"); time.sleep(4)' & echo "i am opened to write to" > $FILE_WRITTEN python -c 'import time; f = open("'$FILE_WRITTEN'", "a"); time.sleep(4)' & # execute test cases echo "******** testing on non-existent file.." ./a.out $NE_FILE echo echo "******** testing on empty file.." ./a.out $EMPTY_FILE echo echo "******** testing on valid file with one line content" ./a.out $ONE_LINE_FILE echo echo "******** testing on a file with one valid and one invalid line" ./a.out $INVALID_LINE_FILE echo echo "******** testing on a file that is read by another process" ./a.out $FILE_READ echo echo "******** testing on a file that is written to by another process" ./a.out $FILE_WRITTEN echo echo "******** testing on a /root/.bashrc (access should be denied)" ./a.out $FILE_DENIED echo echo "******** testing on a directory" ./a.out $DIR
This is the source of the C++ program readfile_debug.cpp:
#include <iostream> #include <fstream> #include <string> using namespace std; int check_error_bits(ifstream* f) { int stop = 0; if (f->eof()) { perror("stream eofbit. error state"); // EOF after std::getline() is not the criterion to stop processing // data: In case there is data between the last delimiter and EOF, // getline() extracts it and sets the eofbit. stop = 0; } if (f->fail()) { perror("stream failbit (or badbit). error state"); stop = 1; } if (f->bad()) { perror("stream badbit. error state"); stop = 1; } return stop; } int main(int argc, char* argv[]) { string line; int getlinecount = 1; if(argc != 2) { cerr << "provide one argument" << endl; return 1; } cout << "* trying to open and read: " << argv[1] << endl; ifstream f (argv[1]); perror("error state after ifstream constructor"); if (!f.is_open()) perror("is_open() returned false. error state"); else cout << "is_open() returned true." << endl; cout << "* checking error bits once before first getline" << endl; check_error_bits(&f); while(1) { cout << "* perform getline() # " << getlinecount << endl; getline(f, line); cout << "* checking error bits after getline" << endl; if (check_error_bits(&f)) { cout << "* skip operation on data, break loop" << endl; break; } // This is the actual operation on the data obtained and we want to // protect it from errors during the last IO operation on the stream cout << "data line " << getlinecount << ": " << line << endl; getlinecount++; } f.close(); return 0; }
Let’s run it:
$ ./readfile_tests.sh readfile_debug.cpp
The output:
******** testing on non-existent file.. * trying to open and read: na error state after ifstream constructor: No such file or directory is_open() returned false. error state: No such file or directory * checking error bits once before first getline stream failbit (or badbit). error state: No such file or directory * perform getline() # 1 * checking error bits after getline stream failbit (or badbit). error state: No such file or directory * skip operation on data, break loop ******** testing on empty file.. * trying to open and read: empty_file error state after ifstream constructor: Success is_open() returned true. * checking error bits once before first getline * perform getline() # 1 * checking error bits after getline stream eofbit. error state: Success stream failbit (or badbit). error state: Success * skip operation on data, break loop ******** testing on valid file with one line content * trying to open and read: one_line_file error state after ifstream constructor: Success is_open() returned true. * checking error bits once before first getline * perform getline() # 1 * checking error bits after getline data line 1: rofl * perform getline() # 2 * checking error bits after getline stream eofbit. error state: Success stream failbit (or badbit). error state: Success * skip operation on data, break loop ******** testing on a file with one valid and one invalid line * trying to open and read: invalid_line_file error state after ifstream constructor: Success is_open() returned true. * checking error bits once before first getline * perform getline() # 1 * checking error bits after getline data line 1: validline * perform getline() # 2 * checking error bits after getline stream eofbit. error state: Success data line 2: invalidline * perform getline() # 3 * checking error bits after getline stream eofbit. error state: Success stream failbit (or badbit). error state: Success * skip operation on data, break loop ******** testing on a file that is read by another process * trying to open and read: file_read error state after ifstream constructor: Success is_open() returned true. * checking error bits once before first getline * perform getline() # 1 * checking error bits after getline data line 1: i am opened to read from * perform getline() # 2 * checking error bits after getline stream eofbit. error state: Success stream failbit (or badbit). error state: Success * skip operation on data, break loop ******** testing on a file that is written to by another process * trying to open and read: file_written error state after ifstream constructor: Success is_open() returned true. * checking error bits once before first getline * perform getline() # 1 * checking error bits after getline data line 1: i am opened to write to * perform getline() # 2 * checking error bits after getline stream eofbit. error state: Success stream failbit (or badbit). error state: Success * skip operation on data, break loop ******** testing on a /root/.bashrc (access should be denied) * trying to open and read: /root/.bashrc error state after ifstream constructor: Permission denied is_open() returned false. error state: Permission denied * checking error bits once before first getline stream failbit (or badbit). error state: Permission denied * perform getline() # 1 * checking error bits after getline stream failbit (or badbit). error state: Permission denied * skip operation on data, break loop ******** testing on a directory * trying to open and read: dir error state after ifstream constructor: Success is_open() returned true. * checking error bits once before first getline * perform getline() # 1 * checking error bits after getline stream failbit (or badbit). error state: Is a directory stream badbit. error state: Is a directory * skip operation on data, break loop
The test results (important things to know: part 2):
There are many things to learn from this output. The following conclusions are only a subset. All this makes makes a lot of sense:
- The
ifstream s ("file")constructor setserrnoin case of a non-existing file.
is_open()does not seterrno.
is_open()does not catch the case when trying to open a directory.
is_open()only catches the non-existing-file-case.
Conclusion: perror() right after is_open() right after ifstream construction is safe. According to the test, one single problem may be identified via this method: a non-existing file. Hence, the error message can be made precise.
Other observations:
- In almost all test cases, the
eofbithas been set at the same time as thefailbit(verifying “Notice that some eofbit cases will also set failbit.” as stated above). A closer look reveals that thefailbitis only set bygetline()if it did not manage to extract any data at all. Note that this is a regular scenario, when the last character in a file is a line delimiter. Theeofbiton the other hand means thatgetline()reached EOF while searching for the next line delimiter: If there is data between the last delimiter and EOF,getline()extracts this data and setseofbit.
- The
badbitis only set in case of trying to get a line from a directory.
getline()does only changeerrnoin case of trying to get a line from a directory. In all other error cases it does not changeerrno.
Conclusion 1: When getline() on stream s has evaluated to False, i.e. !s and s.fail() are True, do not blindly use perror() to print an error message, because it is likely to be wrong in the current context. This is because the bool evaluation of the stream is sensitive to both, badbit or failbit). Since failbit may occur in common cases, it is not qualified for detecting an exceptional state (although its name suggests so). Only a set badbit identifies an exception. Therefore, perror() right after an I/O operation on a stream must be preceded by a positive s.bad() evaluation.
Conclusion 2: In order to process residual data between the last line delimiter and EOF, a positive eofbit must not prevent data processing.
Ideal solutions
With the knowledge from above, ideal code solutions in form of ready-to-compile-examples can be proposed for two cases:
- one, in which error messages are not important
- one, in which we do the best we can to extract error messages
Ideal solution including meaningful error messages
It was shown that with C++’s standard means it is difficult to catch specific errors. The following readfile_stable_errors.cpp tries to provide as precise error messages as possible:
#include <iostream> #include <fstream> #include <string> using namespace std; void process(string* line) { cout << "line read: " << *line << endl; } int main(int argc, char* argv[]) { string line; if(argc != 2) { cerr << "One argument is required." << endl; return 1; } string filename(argv[1]); cout << "* trying to open and read: " << filename << endl; ifstream f (argv[1]); // After this attempt to open a file, we can safely use perror() only // in case f.is_open() returns False. if (!f.is_open()) perror(("error while opening file " + filename).c_str()); // Read the file via std::getline(). Rules obeyed: // - first the I/O operation, then error check, then data processing // - failbit and badbit prevent data processing, eofbit does not while(getline(f, line)) { process(&line); } // Only in case of set badbit we are sure that errno has been set in // the current context. Use perror() to print error details. if (f.bad()) perror(("error while reading file " + filename).c_str()); f.close(); return 0; }
Of course this can be run against the test shellscript from above:
./readfile_tests.sh readfile_stable_errors.cppThe output:
******** testing on non-existent file.. * trying to open and read: na error while opening file na: No such file or directory ******** testing on empty file.. * trying to open and read: empty_file ******** testing on valid file with one line content * trying to open and read: one_line_file line read: rofl ******** testing on a file with one valid and one invalid line * trying to open and read: invalid_line_file line read: validline line read: invalidline ******** testing on a file that is read by another process * trying to open and read: file_read line read: i am opened to read from ******** testing on a file that is written to by another process * trying to open and read: file_written line read: i am opened to write to ******** testing on a /root/.bashrc (access should be denied) * trying to open and read: /root/.bashrc error while opening file /root/.bashrc: Permission denied ******** testing on a directory * trying to open and read: dir error while reading file dir: Is a directory
Congratulations, “no such file or directory”, “is a directory”, and “permission denied” are catched. Also, the data in the “invalid” line was read.
Ideal solution without printing error messages:
The following source of readfile_stable_no_errors.cpp deals with all errors transparently and extracts residual data from an “invalid” last line:
#include <iostream> #include <fstream> #include <string> using namespace std; void process(string* line) { cout << "line read: " << *line << endl; } int main(int argc, char* argv[]) { string line; if(argc != 2) { cerr << "provide one argument" << endl; return 1; } cout << "* trying to open and read: " << argv[1] << endl; ifstream f (argv[1]); // Note that we can omit checking for f.is_open(), because // all errors will be catched correctly by f.fail() (!f) and // we do not want to print error messages here. // Also note that during the loop, the following rules are obeyed: // - first the IO operation, then error check, then data processing // - failbit and badbit prevent data processing, eofbit does not while(getline(f, line)) { process(&line); } f.close(); return 0; }
The test:
./readfile_tests.sh readfile_stable_no_errors.cppOutput:
******** testing on non-existent file.. * trying to open and read: na ******** testing on empty file.. * trying to open and read: empty_file ******** testing on valid file with one line content * trying to open and read: one_line_file line read: rofl ******** testing on a file with one valid and one invalid line * trying to open and read: invalid_line_file line read: validline line read: invalidline ******** testing on a file that is read by another process * trying to open and read: file_read line read: i am opened to read from ******** testing on a file that is written to by another process * trying to open and read: file_written line read: i am opened to write to ******** testing on a /root/.bashrc (access should be denied) * trying to open and read: /root/.bashrc ******** testing on a directory * trying to open and read: dir
The intention of this program is to transparently handle file opening and stream I/O errors. This succeeds: whenever there is data to extract, it is extracted. All error test cases result in no data being read.
Remember, all code shown here can be downloaded.
Please let me know if I have to correct certain points or if we can do better than with the presented solutions (thanks again to Alexandre at this point).