Monthly Archives: April 2013

order of arguments for GNU find: be careful where to specify actions such as -print0.

Today, I used a find | ls | awk combination for summing up sizes of files ending with a certain suffix:

$ find . -print0 -name "*.rst" -type f | xargs -0 /bin/ls -l | awk '{t += $5} END {print t, "bytes."}'
1612918975712 bytes.

I knew that this number was too large and finally got it “right”:

$ find . -name "*.rst" -type f -print0 | xargs -0 /bin/ls -l | awk '{t += $5} END {print t, "bytes."}'
5789476750 bytes.

The only difference between the two commands is the position of the -print0 argument to the find command. One could think of -print0 being just an option determining the output format. In this case, it would be pretty counter-intuitive that its relative position to other arguments should matter at all. However, find is quite a complex application and behaves different from many other programs with respect to the command line interface. So, does the observed behavior make sense? Why does it matter where -print0 is specified? This is (more or less implicitly) explained on the man page. Below, I try to explain it systematically.

How does find evaluate the command line arguments? The main scheme is the following:

     find [search_path1 [search_path2] ...] [expression]

Hence, the arguments comprise a search path or multiple search paths and the expression. In case of find . -print0 -name "*.rst" -type f, the search path is . and the expression is -print0 -name "*.rst" -type f. Important facts to know about the expression:

  • The expression consists of one or more so-called primaries, each of which is a separate command line argument to find. find evaluates the expression each time it processes an item (file/directory).
  • The expression itself may consist of four types of primaries: options, tests, actions, and binary operators connecting them. The default operator is the logical AND.
  • For each item in the search path, the entire expression is evaluated from left to right, step by step, i.e. primary by primary.
  • A test, an option, or an action either return True or False upon evaluation.
  • In the moment it is clear that the entire expression evaluates to False, the evaluation of the expression is aborted (this concept is called short-circuit evaluation). In this case, various primaries may not have been executed. find continues with the next item in the search path.
  • While the expression is evaluated step by step from left to right, actions are performed right away. Actions have side-effects. Consequently, these side-effects become visible while the expression is evaluated, even if the evaluation is aborted in the next step.
  • The default action, when no other action is specified, is -print. It prints the current item to stdout (newline-terminated). It is executed when the expression is entirely evaluated and has returned True.
  • Tests are the natural way to filter files/directories, i.e. to abort the evaluation of the expression before the default action (normal print) is performed.

The magic behind the observation made above is that -print0 is an action, not an option. The “side-effect” of this action is the file path being printed to stdout (NULL-char terminated). When specified as the first primary in the entire expression, it becomes executed for each item in the search path. The subsequent filter tests become needless. That’s why the reported file size sum was higher than expected in the first case.

Btw, an alternative way for summing up the file sizes based on du (in --apparent-size --block-size=1 mode) would be:

find . -name "*.rst" -type f -print0 | du -b --files0-from=- | awk '{t += $1} END {print "Total:", t, "bytes."}'