Presenting timegaps, a tool for thinning out your data

I have released timegaps, a command line program for sorting a set of items into rejected and accepted ones, based on the age of each item and user-given time categorization rules. While this general description sounds quite abstract, the concept is simple to grasp considering timegaps’ main use case (quote from the readme file):

Timegaps allows for thinning out a collection of items, whereas the time gaps between accepted items become larger with increasing age of items. This is useful for keeping backups “logarithmically” distributed in time, e.g. one for each of the last 24 hours, one for each of the last 30 days, one for each of the last 8 weeks, and so on.

A word in advance: I would very much appreciate to receive your feedback on timegaps. And, if you like it, spread it — Thanks!

Motivation: simple implementation of backup retention policies

Backup strategies must be very well thought through. An important question is at which point old backups are to be deleted or, in other words, which old backups are to be kept for how long. This is generally implemented as a so-called data retention policy — a quote from Wikipedia (Backup):

The secondary purpose of backups is to recover data from an earlier time, according to a user-defined data retention policy, typically configured within a backup application for how long copies of data are required.

Why is the implementation of such a policy important? Obviously, storing all periodically (e.g. daily) created snapshots wastes valuable storage space. A backup retention policy allows to precisely determine which snapshots will be kept for how long. It allows users to find a trade-off between data restoration needs and the cost of backup storage.

People usually implement an automatic backup solution which takes periodic snapshots/backups of a certain data repository. Additionally, unless the data is very small compared to the available backup space, the user has to implement a retention policy which automatically deletes old backups. At this point, people unfortunately tend to take the simplest possible approach and automatically delete snapshots older than X days. This is easily implemented using standard command line tools. However, a clearly more sophisticated and safer backup retention strategy is to also keep very old backups, just not all of them.

An obvious solution is to retain backups “logarithmically” distributed in time. The well-established backup solution rsnapshot does this. It creates a structure of hourly / daily / weekly / ... snapshots on the fly. Unfortunately, other backup approaches often lack such a fine-grained logic for eliminating old backups, and people tend to hack simple filters themselves. Furthermore, even rsnapshot is not able to post-process and thin out an existing set of snapshots. This is where timegaps comes in: you can use the backup solution of your choice for periodically (e.g. hourly) creating a snapshot. You can then — independently and at any time — process this set of snapshots with timegaps and identify those snapshots that need to be eliminated (removed or displaced) in order to maintain a certain “logarithmic” distribution of snapshots in time. This is the main motivation behind timegaps, but of course you can use it for filtering any kind of time-dependent data.

Usage example

Consider the following situation: all *.tar.gz files in the current working directory happen to be daily snapshots of something. The task is to accept one snapshot for each of the last 20 days, one for each for the last 8 weeks, and one for each of the last 12 months, and to move all others to the directory notneededanymore. Using timegaps, this is a simple task:

$ mkdir notneededanymore
$ timegaps --move notneededanymore days20,weeks8,months12 *.tar.gz

Done.

Design goals and development notes

Timegaps aims to be a slick, simple, reliable command line tool — ready to be applied in serious system administration work flows that actually touch data. It follows the Unix philosophy, has a well-defined command line interface, and well-defined behavior with respect to stdin, stdout, stderr and its exit code, so I expect it to be applied in combination with other command line tools such as find. You should head over to the project page for seeing more usage examples and a detailed specification.

The timegaps Python code runs on both, Unix and Windows as well as on both, Python 2 and 3. The same code base is used in all environments, so no automatic 2to3 conversion is involved. I undertook some efforts to make the program support unicode command line arguments on Windows at the same time as byte string paths on Unix, so I am pretty sure that timegaps works well with all kinds of exotic characters in your file names. The program respects the PYTHONIOENCODING environment variable when reading items from stdin and when writing items to stdout. That way, the user has the definite control over item de- and encoding.

For general quality assurance and testing the stability of behavior, timegaps is continuously checked against two classes of unit tests:

  • API tests, testing internally used functionality, such as the time categorization logic. Some tests are fed with huge random input data sets, and the output is checked against what is statistically expected.

  • Command line interface (CLI) tests, testing the program from the user’s perspective. To that end, I have started implementing a Python CLI testing framework, clitest.py. Currently, it is included in the timegaps code repository. At some point, I will probably create an independent open source project from that.

Resources

Leave a Reply

Your email address will not be published. Required fields are marked *

Human? Please fill this out: * Time limit is exhausted. Please reload CAPTCHA.

  1. […] I have released timegaps, a command line tool for — among others — implementing backup retention policies. In […]