Today, I used a find | ls | awk
combination for summing up sizes of files ending with a certain suffix:
$ find . -print0 -name "*.rst" -type f | xargs -0 /bin/ls -l | awk '{t += $5} END {print t, "bytes."}' 1612918975712 bytes.
I knew that this number was too large and finally got it “right”:
$ find . -name "*.rst" -type f -print0 | xargs -0 /bin/ls -l | awk '{t += $5} END {print t, "bytes."}' 5789476750 bytes.
The only difference between the two commands is the position of the -print0
argument to the find
command. One could think of -print0
being just an option determining the output format. In this case, it would be pretty counter-intuitive that its relative position to other arguments should matter at all. However, find
is quite a complex application and behaves different from many other programs with respect to the command line interface. So, does the observed behavior make sense? Why does it matter where -print0
is specified? This is (more or less implicitly) explained on the man page. Below, I try to explain it systematically.
How does find
evaluate the command line arguments? The main scheme is the following:
find [search_path1 [search_path2] ...] [expression]
Hence, the arguments comprise a search path or multiple search paths and the expression. In case of find . -print0 -name "*.rst" -type f
, the search path is .
and the expression is -print0 -name "*.rst" -type f
. Important facts to know about the expression:
- The expression consists of one or more so-called primaries, each of which is a separate command line argument to
find
.find
evaluates the expression each time it processes an item (file/directory). - The expression itself may consist of four types of primaries: options, tests, actions, and binary operators connecting them. The default operator is the logical
AND
. - For each item in the search path, the entire expression is evaluated from left to right, step by step, i.e. primary by primary.
- A test, an option, or an action either return
True
orFalse
upon evaluation. - In the moment it is clear that the entire expression evaluates to
False
, the evaluation of the expression is aborted (this concept is called short-circuit evaluation). In this case, various primaries may not have been executed.find
continues with the next item in the search path. - While the expression is evaluated step by step from left to right, actions are performed right away. Actions have side-effects. Consequently, these side-effects become visible while the expression is evaluated, even if the evaluation is aborted in the next step.
- The default action, when no other action is specified, is
-print
. It prints the current item to stdout (newline-terminated). It is executed when the expression is entirely evaluated and has returnedTrue
. - Tests are the natural way to filter files/directories, i.e. to abort the evaluation of the expression before the default action (normal print) is performed.
The magic behind the observation made above is that -print0
is an action, not an option. The “side-effect” of this action is the file path being printed to stdout (NULL-char terminated). When specified as the first primary in the entire expression, it becomes executed for each item in the search path. The subsequent filter tests become needless. That’s why the reported file size sum was higher than expected in the first case.
Btw, an alternative way for summing up the file sizes based on du
(in --apparent-size --block-size=1
mode) would be:
find . -name "*.rst" -type f -print0 | du -b --files0-from=- | awk '{t += $1} END {print "Total:", t, "bytes."}'
Leave a Reply