Category Archives: timegaps

Thin out your ZFS snapshot collection with timegaps

Recently, I have released timegaps, a command line tool for — among others — implementing backup retention policies. In this article I demonstrate how timegaps can be applied for filtering ZFS snapshots, i.e. for identifying those snapshots that can be deleted according to a certain backup retention policy.

Start by listing the names of the snapshots (in my case of the usbbackup/synctargets dataset):

$ zfs list -r -H -t snapshot -o name usbbackup/synctargets

As you can see, I have encoded the snapshot creation time in the snapshot name. This is prerequisite for the method presented here.

In the following command line, we provide this list of snapshot names to timegaps — via stdin. We advise timegaps to keep the following snapshots:

  • one recent snapshot (i.e. younger than 1 hour)
  • one snapshot for each of the last 10 hours
  • one snapshot for each of the last 30 days
  • one snapshot for each of the last 12 weeks
  • one snapshot for each of the last 14 months
  • one snapshot for each of the last 3 years

… and to print the other ones — the rejected ones — to stdout. This is the command line:

$ zfs list -r -H -t snapshot -o name usbbackup/synctargets | timegaps \
      --stdin --time-from-string 'usbbackup/synctargets@%Y%m%d-%H%M%S' \

As you can see, the rules are provided to timegaps via the argument string recent1,hours10,days30,weeks12,months14,years3. The switch --time-from-string 'usbbackup/synctargets@%Y%m%d-%H%M%S' informs timegaps about how to parse the snapshot creation time from a snapshot name. Obviously, --stdin advises timegaps to read items from stdin (instead of from the command line, which would be the default).

See it in action:

$ zfs list -r -H -t snapshot -o name usbbackup/synctargets | timegaps \
      --stdin --time-from-string 'usbbackup/synctargets@%Y%m%d-%H%M%S' \

You don’t really see the difference here because I cropped the output. The following is proof that (for my data) timegaps decided (according to the rules) that 41 of 73 snapshots are to be rejected:

$ zfs list -r -H -t snapshot -o name usbbackup/synctargets | wc -l
$ zfs list -r -H -t snapshot -o name usbbackup/synctargets | timegaps \
    --stdin --time-from-string 'usbbackup/synctargets@%Y%m%d-%H%M%S' \
    recent1,hours10,days30,weeks12,months14,years3 | wc -l

That command line can easily be extended for creating a little script for actually deleting these snapshots. sed is useful here, for prepending the string 'zfs destroy ' to each output line (each line corresponds to one rejected snapshot):

$ zfs list -r -H -t snapshot -o name usbbackup/synctargets | timegaps \
    --stdin --time-from-string 'usbbackup/synctargets@%Y%m%d-%H%M%S' \
    recent1,hours10,days30,weeks12,months14,years3 | \
    sed 's/^/zfs destroy /' >
$ cat
zfs destroy usbbackup/synctargets@20140227-180824
zfs destroy usbbackup/synctargets@20140228-201639
zfs destroy usbbackup/synctargets@20140325-215800
zfs destroy usbbackup/synctargets@20140313-235809

Timegaps is well tested via unit tests, and I use it in production. However, at the time of writing, I have not gotten any feedback from others. Therefore, please review and see if it makes sense. Only then execute.

I expect this post to raise some questions, regarding data safety in general and possibly regarding the synchronization between snapshot creation and deletion. I would very much appreciate to receive questions and feedback in the comments section below, thanks.

Presenting timegaps, a tool for thinning out your data

I have released timegaps, a command line program for sorting a set of items into rejected and accepted ones, based on the age of each item and user-given time categorization rules. While this general description sounds quite abstract, the concept is simple to grasp considering timegaps’ main use case (quote from the readme file):

Timegaps allows for thinning out a collection of items, whereas the time gaps between accepted items become larger with increasing age of items. This is useful for keeping backups “logarithmically” distributed in time, e.g. one for each of the last 24 hours, one for each of the last 30 days, one for each of the last 8 weeks, and so on.

A word in advance: I would very much appreciate to receive your feedback on timegaps. And, if you like it, spread it — Thanks!

Motivation: simple implementation of backup retention policies

Backup strategies must be very well thought through. An important question is at which point old backups are to be deleted or, in other words, which old backups are to be kept for how long. This is generally implemented as a so-called data retention policy — a quote from Wikipedia (Backup):

The secondary purpose of backups is to recover data from an earlier time, according to a user-defined data retention policy, typically configured within a backup application for how long copies of data are required.

Why is the implementation of such a policy important? Obviously, storing all periodically (e.g. daily) created snapshots wastes valuable storage space. A backup retention policy allows to precisely determine which snapshots will be kept for how long. It allows users to find a trade-off between data restoration needs and the cost of backup storage.

People usually implement an automatic backup solution which takes periodic snapshots/backups of a certain data repository. Additionally, unless the data is very small compared to the available backup space, the user has to implement a retention policy which automatically deletes old backups. At this point, people unfortunately tend to take the simplest possible approach and automatically delete snapshots older than X days. This is easily implemented using standard command line tools. However, a clearly more sophisticated and safer backup retention strategy is to also keep very old backups, just not all of them.

An obvious solution is to retain backups “logarithmically” distributed in time. The well-established backup solution rsnapshot does this. It creates a structure of hourly / daily / weekly / ... snapshots on the fly. Unfortunately, other backup approaches often lack such a fine-grained logic for eliminating old backups, and people tend to hack simple filters themselves. Furthermore, even rsnapshot is not able to post-process and thin out an existing set of snapshots. This is where timegaps comes in: you can use the backup solution of your choice for periodically (e.g. hourly) creating a snapshot. You can then — independently and at any time — process this set of snapshots with timegaps and identify those snapshots that need to be eliminated (removed or displaced) in order to maintain a certain “logarithmic” distribution of snapshots in time. This is the main motivation behind timegaps, but of course you can use it for filtering any kind of time-dependent data.

Usage example

Consider the following situation: all *.tar.gz files in the current working directory happen to be daily snapshots of something. The task is to accept one snapshot for each of the last 20 days, one for each for the last 8 weeks, and one for each of the last 12 months, and to move all others to the directory notneededanymore. Using timegaps, this is a simple task:

$ mkdir notneededanymore
$ timegaps --move notneededanymore days20,weeks8,months12 *.tar.gz


Design goals and development notes

Timegaps aims to be a slick, simple, reliable command line tool — ready to be applied in serious system administration work flows that actually touch data. It follows the Unix philosophy, has a well-defined command line interface, and well-defined behavior with respect to stdin, stdout, stderr and its exit code, so I expect it to be applied in combination with other command line tools such as find. You should head over to the project page for seeing more usage examples and a detailed specification.

The timegaps Python code runs on both, Unix and Windows as well as on both, Python 2 and 3. The same code base is used in all environments, so no automatic 2to3 conversion is involved. I undertook some efforts to make the program support unicode command line arguments on Windows at the same time as byte string paths on Unix, so I am pretty sure that timegaps works well with all kinds of exotic characters in your file names. The program respects the PYTHONIOENCODING environment variable when reading items from stdin and when writing items to stdout. That way, the user has the definite control over item de- and encoding.

For general quality assurance and testing the stability of behavior, timegaps is continuously checked against two classes of unit tests:

  • API tests, testing internally used functionality, such as the time categorization logic. Some tests are fed with huge random input data sets, and the output is checked against what is statistically expected.

  • Command line interface (CLI) tests, testing the program from the user’s perspective. To that end, I have started implementing a Python CLI testing framework, Currently, it is included in the timegaps code repository. At some point, I will probably create an independent open source project from that.