order of arguments for GNU find: be careful where to specify actions such as -print0.

Today, I used a find | ls | awk combination for summing up sizes of files ending with a certain suffix:

$ find . -print0 -name "*.rst" -type f | xargs -0 /bin/ls -l | awk '{t += $5} END {print t, "bytes."}'
1612918975712 bytes.

I knew that this number was too large and finally got it “right”:

$ find . -name "*.rst" -type f -print0 | xargs -0 /bin/ls -l | awk '{t += $5} END {print t, "bytes."}'
5789476750 bytes.

The only difference between the two commands is the position of the -print0 argument to the find command. One could think of -print0 being just an option determining the output format. In this case, it would be pretty counter-intuitive that its relative position to other arguments should matter at all. However, find is quite a complex application and behaves different from many other programs with respect to the command line interface. So, does the observed behavior make sense? Why does it matter where -print0 is specified? This is (more or less implicitly) explained on the man page. Below, I try to explain it systematically.

How does find evaluate the command line arguments? The main scheme is the following:

     find [search_path1 [search_path2] ...] [expression]

Hence, the arguments comprise a search path or multiple search paths and the expression. In case of find . -print0 -name "*.rst" -type f, the search path is . and the expression is -print0 -name "*.rst" -type f. Important facts to know about the expression:

  • The expression consists of one or more so-called primaries, each of which is a separate command line argument to find. find evaluates the expression each time it processes an item (file/directory).
  • The expression itself may consist of four types of primaries: options, tests, actions, and binary operators connecting them. The default operator is the logical AND.
  • For each item in the search path, the entire expression is evaluated from left to right, step by step, i.e. primary by primary.
  • A test, an option, or an action either return True or False upon evaluation.
  • In the moment it is clear that the entire expression evaluates to False, the evaluation of the expression is aborted (this concept is called short-circuit evaluation). In this case, various primaries may not have been executed. find continues with the next item in the search path.
  • While the expression is evaluated step by step from left to right, actions are performed right away. Actions have side-effects. Consequently, these side-effects become visible while the expression is evaluated, even if the evaluation is aborted in the next step.
  • The default action, when no other action is specified, is -print. It prints the current item to stdout (newline-terminated). It is executed when the expression is entirely evaluated and has returned True.
  • Tests are the natural way to filter files/directories, i.e. to abort the evaluation of the expression before the default action (normal print) is performed.

The magic behind the observation made above is that -print0 is an action, not an option. The “side-effect” of this action is the file path being printed to stdout (NULL-char terminated). When specified as the first primary in the entire expression, it becomes executed for each item in the search path. The subsequent filter tests become needless. That’s why the reported file size sum was higher than expected in the first case.

Btw, an alternative way for summing up the file sizes based on du (in --apparent-size --block-size=1 mode) would be:

find . -name "*.rst" -type f -print0 | du -b --files0-from=- | awk '{t += $1} END {print "Total:", t, "bytes."}'
  • Nicholas K

    The Internets brought me here…

    Now I know! Too bad though, I’ve just lost days of work by executing “find . -print0 -iname *.aux | xargs -0 remove -v” in a directory (with rm instead of “remove”). I wanted just to delete some LaTeX auxiliary files. I ended up nuking everything. The latest backup is days old.

    Fun fact #1: I’ve been a Linux user for many years, using GNU find and xargs very often. I usually put -print0 at the end, but never questioned the order (or perhaps I did, years ago, and forgot since.)

    Fun fact #2: This was perhaps the only time that I used such a command without running it first with “xargs -0 echo”. When you are hasty…

    Fun fact #3: This on an SSD with ATA TRIM enabled. Yeap, 0% chance of recovery, even if I had the resources of NSA and FBI combined.

    • I can’t say anything else than expressing my heartfelt sympathy :/ This is hard. Thanks for posting here, anyway! Btw, I usually version-control (LaTeX) documents / manuscripts while writing them and perform quite regular pushs to a remote repository (to Bitbucket in my case, since it allows for unlimited free private repositories). I have the feeling this is one of the safest backup strategies you can get when writing a document…

      • Nicholas K

        Thank you! Funny that you mention it, since there was also a .git subdirectory that also got nuked (I ‘m a newbie when it comes to LaTeX, and I consider the ability to use proper versioning/diffing one of its biggest strengths – along with the incredible quality of the output.)