Michael HönnigMichael Hönnig

Ever had a long-running find/xargs pipe running in a Linux bash shell and you wished you could have a look inside how far it has progressed? Here is a little bash shell code snipped makes this possible.

The Story Behind

Up until recently I still had one quite old external HDD as my main backup storage, other than the other two backup storages for my <em>3-2-1 backup strategy</em>, it did also include videos, photos and my MP3 collection and thus it was pretty huge. But also different from the others, it was not yet encrypted. So, I used rsync to copy all these files to a new external HDD with encryption. To be honest, I did not trust myself if I had used rsync correctly and wanted a second 'opinion' whether the copy is really complete.

Thus, I started some naive little script to compare all files from the old HDD with their counterpart of the new HDD - it ran endless (days!), and it was still logging out more file names, thus still running. I just had no idea how far it had progressed and how much time I still have to wait. Thus, I interrupted it, to find a more efficient way for the task. This time, I also wanted to watch the progress.

The Problem

Wouldn’t it be cool having something like this?

find ... | xargs --show-progress-bar ...
output
78.0 B 0:00:27 [2.98 B/s] [==============>           ] 51% ETA 0:00:25

Unfortunately this option does not exist in <a href="https://www.gnu.org/software/findutils/manual/html_node/find_html/Invoking-the-shell-from-xargs.html">xargs [1]</a>. And I was really surprised not finding a workaround on the Internet. Bits and pieces, yes, also some solutions for special cases, but which I was not able to use. Thus, I did put together these bits and pieces.

The Solution

I came up with something resembling the following little bash shell script snippet:

ENTRIES=`mktemp --tmpdir entries.XXXXX`; echo "entries file: $ENTRIES"
ERRORS=`mktemp --tmpdir errors.XXXXX`; echo "errors file: $ERRORS"
RESULT=`mktemp --tmpdir result.XXXXX`; echo "result file: $RESULT"
echo -n "collecting files ..."
find <em><strong>{YOUR-CRITERIA}</strong></em> -print0 >$ENTRIES
let COUNT=`tr '\0' '\n' <$ENTRIES | wc -l`
echo -e "\b\b\bfound $COUNT entries, processing:"
sort -z <$ENTRIES |
    xargs -r0 -L<em><strong>{YOUR-CHUNKSIZE}</strong></em> sh -c '<em><strong>{YOUR-COMMAND}</strong></em> "$@" >>'$RESULT' 2>>'$ERRORS'; printf %$#s' arg0dummy |
    pv --progress --timer --eta --rate --bytes --size $COUNT >/dev/null
cat $ERRORS >&2
cat $RESULT
rm $ENTRIES $ERRORS $RESULT

(I give this code to the public domain. Links to my website michael.hoennig.de or mentioning my Twitter handle @javagol are welcome.)

Please notice the parameterization (replace including the braces, these are placeholders, not variables):

  • {YOUR-CRITERIA}: any find criteria you need to apply</li>

  • {YOUR-CHUNKSIZE}: the number of entries to be processed at once</li>

  • {YOUR-COMMAND}: whatever you need to process<br/>(receives the entries chunk-by-chunk as arguments)

An Executable Example

Wanna see it in action? Then open your bash, change to a directory which contains some thousand files and execute these commands:

ENTRIES=`mktemp --tmpdir entries.XXXXX`; echo "entries file: $ENTRIES"
ERRORS=`mktemp --tmpdir errors.XXXXX`; echo "errors file: $ERRORS"
RESULT=`mktemp --tmpdir result.XXXXX`; echo "result file: $RESULT"
echo -n "collecting files ..."
find . -type f -perm -g+r -print0 >$ENTRIES
let COUNT=`tr '\0' '\n' <$ENTRIES | wc -l`
echo -e "\b\b\bfound $COUNT entries, processing:"
sort -z <$ENTRIES |
    xargs -r0 -L100 sh -c 'md5sum "$@" >>'$RESULT' 2>>'$ERRORS'; printf %$#s' processing |
    pv --progress --timer --eta --rate --bytes --size $COUNT >/dev/null
cat $ERRORS >&2
cat $RESULT
rm $ENTRIES $ERRORS $RESULT

It will calculate the MD5 checksums of all the files in the directory while showing a progress bar.

How it Works - A Summary

After some initialization it uses find to create a list of entries to process and stores these in a file. You could also use an environment variable, but in my case the environment space was exceeded (I got "Argument list too long" on subsequent commands), thus I used a temporary file.

The next step is to determine the number of entries to process to get some sensible progress bar. If the time to process a single item has a huge variation, like in my case due to the size of the processed files, the progress might still not be totally accurate, but as long as there are not very few files taking most of the processing time, this should be a good compromise.

Now comes the main step: The list of entries is processed in chunks using xargs. I chose a chunked approach, because if I had passed all or too many arguments at once, the progress bar might just go from 0% to 100% in a glimpse at the end. Pay attention to the quoting in this line, it is very important, otherwise the variables with the file names cannot be resolved properly.

The last step is to print any errors and the processing result and to clean up.

This is still a bit overwhelming? No problem, I am going to explain it step by step.

How it Works - In Detail

Initialization

ENTRIES=`mktemp --tmpdir entries.XXXXX`; echo "entries file: $ENTRIES"
ERRORS=`mktemp --tmpdir errors.XXXXX`; echo "errors file: $ERRORS"
RESULT=`mktemp --tmpdir result.XXXXX`; echo "result file: $RESULT"

This creates three unique files in the systems temp-file folder (usually /tmp), one for the list of entries to be processed, one for the error output of the actual processing and one of the processing result. The processing errors and results are not printed immediately because that would destroy the progress bar visualization. But as the snipped prints the filenames at the beginning, you can always peek into from another shell (tail -f …​)

Collecting the Entries to Process

echo -n "collecting files ..."
find {YOUR-CRITERIA} -print0 >$ENTRIES

This line creates a list of entries to processed (file and directory names) and stores these in a temporary file. It separates each entry with a null-character to avoid any problems with whitespace in file names.

As this find alone can take quite some time, and echo statements informs the user that some pre-processing is going on. The "-n" means not to end with a line break, just for some little effect in the next step.

Because it’s not known how many entries there are to be processed, no progress bar could be shown for this step anyway. One improvement could show some other kind of indicator, like a spinning wheel, but for new I’ve skipped this optimization.

Determining the Number of Entries to Process

let COUNT=`tr '\0' '\n' <$ENTRIES | wc -l`
echo -e "\b\b\bfound $COUNT entries, processing:"

The command wc (word count) is used to determine the number of entries to process. The option "-l" means to count just the lines, but as the entries are separated by null-characters and not by newlines, the null-characters have to be converted to line breaks first by using tr.

Next the number of entries to be processed is printed. The "-e" makes echo interpret the "\b" (backspace) to remove the three dots from the previous echo statement.

Processing the Entries in Chunks

sort -z <$ENTRIES |
    xargs -r0 -L{YOUR-CHUNKSIZE} sh -c '{YOUR-COMMAND} "$@" >>'$RESULT' 2>>'$ERRORS'; printf %$#s' arg0dummy |
    pv --progress --timer --eta --rate --bytes --size $COUNT >/dev/null

The sort step is optionally, I added it because it’s a bit tricky if you need the input sorted. In my special case I actually needed it because I wanted to compare two runs of such a process on two distinct directory trees which resembled each other (original and copy). The "-z" means not to sort lines, but null-characters separated entries.

Next, the entries are processed using xargs chunk by chunk, each with the given size. Depending on IO-buffering and how your command works, chunking is necessary to give some minimum granularity for the progress bar. Maybe your processing command even needs a chunk size of 1 because it can only take one entry at the time.

Next we need a trick to show progress chunk-by-chunk and also do the actual processing chunk-by-chunk. In my case, I could easily use the output of my actual processing as base for progress because it was 1:1 for each input entity. But to have a more general solution, this snipped does not rely on such special conditions. Instead it passes the chunk entities go en embedded shell script which does two things:

  1. it executes your actual processing command and it’s out

  2. it prints the as many spaces as there are entities in the chunk
    (keep in mind, the last chunk might be shorter than the chunk size)

The arg0dummy at the end of the shell command is used as $0 by the subshell. It’s not actually used, but if we omit it, the first entry of each chunk would be used as such and thus neither get processed nor count for the progress bar.

Finally, and this is the core of the whole thing, a progress bar is displayed based on the output of xargs. It receives the number of entries as a size to process via the "--size" option and stores the actual stdout and stderr to files to not to destroy the progress bar with other output. Look up the meaning of the other options with man pv if you are interested.

The pv output looks like this example:

`78.0 B 0:00:27 [2.98 B/s] [===============>           ] 51% ETA 0:00:25

It means, it has already processed 87 entries and is running for 27 seconds so far. In average, it processed 2.98 entries per second, so far. It has already processed 51% of the entries, and it estimates to run for another 25 seconds. The value for the estimated remaining time can go up once in a while if some entries need much more time than others so far.

You can replace --eta by --fineta to the pv command to show the estimated end time (according to the timezone of your environment) instead of the estimated remaining duration.

Finishing Up

cat $ERRORS >&2
cat $RESULT
rm $ENTRIES $ERRORS $RESULT

Finally, we print out the errors and results from the actual processing and remove the temporary files.

Downsides, Alternatives and Comments

I see some downsides of this solution:

  • If your computer suspended during the process, the bad news is that the estimated remaining duration is usually way off.

  • Encapsulating this snipped into a reusable script would make the quoting even more complicated, thus I did not bother doing so for this blog article. But if somebody has frequent use for it, it’s doable for sure.

  • It still lacks some progress indicator for the time needed to collect the entries to process.

In some cases, an alternative could be GNU Parallel [3], which has a progress bar included. But its main job is parallel execution of tasks, which is sometimes not wanted. In many other cases you can use <PV [2] just as it is and don’t need these tricky fixture.

Any comments? Then please drop me a note by email or on Twitter!