Monthly Archives: March 2014

Presenting timegaps, a tool for thinning out your data

I have released timegaps, a command line program for sorting a set of items into rejected and accepted ones, based on the age of each item and user-given time categorization rules. While this general description sounds quite abstract, the concept is simple to grasp considering timegaps’ main use case (quote from the readme file):

Timegaps allows for thinning out a collection of items, whereas the time gaps between accepted items become larger with increasing age of items. This is useful for keeping backups “logarithmically” distributed in time, e.g. one for each of the last 24 hours, one for each of the last 30 days, one for each of the last 8 weeks, and so on.

A word in advance: I would very much appreciate to receive your feedback on timegaps. And, if you like it, spread it — Thanks!

Motivation: simple implementation of backup retention policies

Backup strategies must be very well thought through. An important question is at which point old backups are to be deleted or, in other words, which old backups are to be kept for how long. This is generally implemented as a so-called data retention policy — a quote from Wikipedia (Backup):

The secondary purpose of backups is to recover data from an earlier time, according to a user-defined data retention policy, typically configured within a backup application for how long copies of data are required.

Why is the implementation of such a policy important? Obviously, storing all periodically (e.g. daily) created snapshots wastes valuable storage space. A backup retention policy allows to precisely determine which snapshots will be kept for how long. It allows users to find a trade-off between data restoration needs and the cost of backup storage.

People usually implement an automatic backup solution which takes periodic snapshots/backups of a certain data repository. Additionally, unless the data is very small compared to the available backup space, the user has to implement a retention policy which automatically deletes old backups. At this point, people unfortunately tend to take the simplest possible approach and automatically delete snapshots older than X days. This is easily implemented using standard command line tools. However, a clearly more sophisticated and safer backup retention strategy is to also keep very old backups, just not all of them.

An obvious solution is to retain backups “logarithmically” distributed in time. The well-established backup solution rsnapshot does this. It creates a structure of hourly / daily / weekly / ... snapshots on the fly. Unfortunately, other backup approaches often lack such a fine-grained logic for eliminating old backups, and people tend to hack simple filters themselves. Furthermore, even rsnapshot is not able to post-process and thin out an existing set of snapshots. This is where timegaps comes in: you can use the backup solution of your choice for periodically (e.g. hourly) creating a snapshot. You can then — independently and at any time — process this set of snapshots with timegaps and identify those snapshots that need to be eliminated (removed or displaced) in order to maintain a certain “logarithmic” distribution of snapshots in time. This is the main motivation behind timegaps, but of course you can use it for filtering any kind of time-dependent data.

Usage example

Consider the following situation: all *.tar.gz files in the current working directory happen to be daily snapshots of something. The task is to accept one snapshot for each of the last 20 days, one for each for the last 8 weeks, and one for each of the last 12 months, and to move all others to the directory notneededanymore. Using timegaps, this is a simple task:

$ mkdir notneededanymore
$ timegaps --move notneededanymore days20,weeks8,months12 *.tar.gz

Done.

Design goals and development notes

Timegaps aims to be a slick, simple, reliable command line tool — ready to be applied in serious system administration work flows that actually touch data. It follows the Unix philosophy, has a well-defined command line interface, and well-defined behavior with respect to stdin, stdout, stderr and its exit code, so I expect it to be applied in combination with other command line tools such as find. You should head over to the project page for seeing more usage examples and a detailed specification.

The timegaps Python code runs on both, Unix and Windows as well as on both, Python 2 and 3. The same code base is used in all environments, so no automatic 2to3 conversion is involved. I undertook some efforts to make the program support unicode command line arguments on Windows at the same time as byte string paths on Unix, so I am pretty sure that timegaps works well with all kinds of exotic characters in your file names. The program respects the PYTHONIOENCODING environment variable when reading items from stdin and when writing items to stdout. That way, the user has the definite control over item de- and encoding.

For general quality assurance and testing the stability of behavior, timegaps is continuously checked against two classes of unit tests:

  • API tests, testing internally used functionality, such as the time categorization logic. Some tests are fed with huge random input data sets, and the output is checked against what is statistically expected.

  • Command line interface (CLI) tests, testing the program from the user’s perspective. To that end, I have started implementing a Python CLI testing framework, clitest.py. Currently, it is included in the timegaps code repository. At some point, I will probably create an independent open source project from that.

Resources

FreeNAS buries Perl in favor of Python

I am happy user of FreeNAS (a great open source storage server solution) and sporadically follow its development. A couple of months ago, William Grzybowski committed revision 22ebffb6 to the FreeNAS code repository. He crafted a lovely commit message:

Dear perl,

You’re very brave, you have been fighting against us for a long, long time.
The time has come to tear you apart and bury you very deep.

Rest In Peace

Indeed, the FreeNAS team chose to build their management system on top of CPython (2.7, in this case). A great choice pro development efficiency and pro community efforts, I guess.

GnuTLS vulnerability: is unit testing a matter of language culture?

You have probably heard about this major security issue in GnuTLS, publicly announced on March 3, 2014, with the following words in a patch note on the GnuTLS mailinglist:

This fixes is an important (and at the same time embarrassing) bug
discovered during an audit for Red Hat. Everyone is urged to upgrade.

The official security advisory describes the issue in these general terms:

A vulnerability was discovered that affects the certificate verification functions of all gnutls versions. A specially crafted certificate could bypass certificate validation checks. The vulnerability was discovered during an audit of GnuTLS for Red Hat.

Obviously, media and tech bloggers pointed out the significance of this issue. If you are interested in some technical detail, I would like to recommend a well-written article on LWN on the topic: A longstanding GnuTLS certificate validation botch. As it turns out, the bug was introduced by a code change that re-factored the error/success communication between functions. Eventually, spoken generally, the problem is that two communication partners went out of sync: when the sender sent ‘Careful, error!’, the recipient actually understood ‘Cool, success.’. Bah. We are used to modern, test-driven development culture. Consequently, most of us immediately think “WTF, don’t they test their code?”.

An automated test suite should have immediately spotted that invalid commit, right. But wait a second, that malicious commit was pushed in year 2000, the language we are talking about is C, and unit testing for C is not exactly established. Given that — did you really, honestly, expect a C code base that reaches back more than a decade to be under surveillance of ideal unit-tests, by modern standards? No? Me neither (although I would have expected a security-relevant library such as GnuTLS to be under a significant 3rd party test coverage — does everybody trust the authors?).

We seem to excuse or at least acknowledge and tolerate that old system software written in C is not well-tested by modern standards of test-driven development. For sure, there is modern software out there applying ideal testing strategies — but having only a few users. At the same time old software is circulating, used by millions, but not applying modern testing strategies. But why is that? And should we tolerate this? There was an interesting discussion about this topic, right underneath the above-mentioned LWN article. I’d like to quote one comment that I particularly agree to, although it is mostly asking questions than providing answers:

> In addition to the culture of limited testing you alluded to,
> I think there are some language issues here as well

Yes, true. But I wonder if discussing type systems is also a
distraction from the more pressing issue here? After all, even
with all the help of Haskell’s type system, you *will* still
have bugs.

It seems to me that the lack of rigorous testing was:
(a) The most immediate cause of these bugs
(b) More common in projects written in C

I find it frustrating that discussions of these issues continually
drift towards language wars, rather than towards modern ideas about
unit testing, software composability, test-driven development, and
code coverage tracking.

Aren’t these the more pressing questions?
(1) Where are the GnuTLS unit tests, so I can review and add more?
(2) Where is the new regression test covering this bug?
(3) What is the command to run a code coverage tool on the test
suite, so that I can see what coverage is missing?

Say what you will about “toy” languages, but that is what would
happen in any halfway mature Ruby or Python or Javascript project,
and I’m happy to provide links to back that up.

Say what you will about the non-systems languages on the JVM, but
that is also what would happen in any halfway mature Scala, Java,
or Clojure project.

It’s only in C, the systems language in which so many of these
vital libraries are written, that this is not the case. Isn’t it
time to ask why?

Someone answered, and I think this view makes sense:

For example, I suspect that the reason “C culture” seems impervious to adopting the lessons of test-driven development has a lot to do with the masses of developers who are interested in it, by following your advice, are moving to other languages and practicing it there.

In other words, by complecting the issue of unit testing and test coverage with the choice of language, are we not actively *contributing* to the continuing absence of these ideas from C culture, and thus from the bulk of our existing systems?

Food for thought, at least, I hope!

I agree: the effort for improved testing of old, but essential, C libraries must come from the open source community. Someone has to do it.

Über Seifenblasen

Zeit für Seifenblasen — in Wort und Bild. Seifenblasen haben faszinierende physikalische und mathematische Eigenschaften. Ihre Erscheinungsform lässt sich stark abstrahieren, wie z. B. in On soap bubbles and isoperimetric regions in noncompact symmetric spaces (PDF).

Eine Seifenblase ist zu großen Teilen transparent, daher dominiert der Hintergrund die optische Erscheinung der Blase. Gleichzeitig besitzt eine Seifenblase ein hohes optisches Reflexionsvermögen. Möchte man die Reflexion des Umgebungslichtes an der Blase besonders deutlich erkennbar machen, sollte der Hintergrund der Szene dunkel und homogen gewählt werden, die reflektierte Umgebung gleichzeitig aber hell und detailreich. Homogener Hintergrund, interessante Reflexion: eindrucksvoll inszeniert hat das Richard Heeks in seinen Fotografien.

Ich finde bemerkenswert, dass die optische Reflexion der Umwelt in einer ideal kugelförmigen Seifenblase grundsätzlich punktsymmetrisch ist. Das können wir besonders gut in Richard Heeks Fotos beobachten, aber auch in den meisten allgemein auffindbaren Seifenblasenfotos (Google Bildersuche).

Die genannte Symmetrie führt zu einem interessanten Phänomen: die Umgebungsreflexion in der Blase erscheint aus zwei gleichen Teilen bestehend. Die visuelle Grenze zwischen diesen beiden Teilen halbiert die Kugel exakt. Besteht die Umgebung aus überwiegend weitläufiger Landschaft, verläuft diese Grenze innerhalb der Seifenblase parallel zum Horizont (wenn wir den Horizont selbst als Grenzlinie zwischen Himmel und Erde definieren). Das alles gilt unabhängig vom Blickwinkel.

In folgender Szene habe ich eine Seifenblase vor dem Horizont geknipst. Die Reflexion der Umgebung ist ein weitläufiger Park, die visuelle Grenze der Punktsymmetrie verläuft also parallel zum Horizont. Der Hintergrund ist nicht homogen, sondern selbst durch den Horizont in zwei Hälften geteilt und überlagert die Umgebungsreflexion deutlich. Während die Halb-Halb-Aufteilung innerhalb der Blase blickwinkelunabhängig ist, ist die Halb-Halb-Aufteilung des Hintergrunds durch Blickwinkel und Bildschnitt bestimmt.

Seifenblase halb-halb, Hintergrund halb-halb.

Seifenblase halb-halb, Hintergrund halb-halb. EOS 600 D, 100 mm, 1/160 s, f/2.5, ISO 1000.

Als kleine Zugabe eine große Blase die gerade ihre Oberfläche zu minimieren versucht:

Seifenblase kurz vor Trennung

Seifenblase kurz vor Trennung. EOS 600 D, 100 mm, 1/200 s, f/3.2, ISO 100.