Exploiting sharing in arbtt

My automatic rule-based time tracker (arbtt), which is written in Haskell, collects every minute a data sample consisting mainly of the list of currently open windows (window title and program name). Naturally, this log grows rather large. Since October of last year, I collected 70,000 samples. I already went from a text-based file format to a binary format using Data.Binary, which gave a big performance boost.

But by now, I was afraid that this is not enough. My log file is now 30MB large. Looking at the memory graph of gnome-panel, it is taking up more than half of my memory. When running arbtt-stats, the Haskell run time system reports 569 MB total memory in use and the command finishes after 28.5 seconds.

Naturally, the log file is highly redundant: Compressing it with bzip2 shrinks it to 1.6MB. But as I would like to preserve the ability to just append samples at the end, without having to read the file, I chose not just to add bzip2 or gzip compression. Rather, I am now exploiting a very obvious redundancy: Two adjacent samples usually list exactly the same windows, and a focus change only changes a flag. So now, when storing a string that is part of a sample, it will check if this string was already present in the previous sample and, in this case, just store the number of that string (one byte). Only if the string was not present it will write a zero byte and then the string. When reading the sample, the process is reversed.

This greatly reduces the file size: It is down to 6.2MB. It also improves the memory consumption, due to Haskell’s abilities with regard to sharing: When a reference to a string in a previous sample is read, then only one instance of this string is in memory, even if it occurs several times in the log. This brings the memory consumption down to 264 MB and the runtime to 17 seconds.

I released the changes as version to Hackage, Debian and as a Windows installer. The log file is not automatically converted, but new samples will be written in the compressed format. If you want to convert your whole file, you have to stop arbtt-capture, run arbtt-recover, and then move the hopefully noticeable smaller ~/.arbtt/capture.log.recovered  to ~/.arbtt/capture.log.

The required code changes were not too big. I somewhat isolated the relevant code in the Data.Binary.StringRef module. Unfortunately, I have to use OverlappingInstances to be able to provide the special instance for String – is there a cleaner way (besides the trick used for the Show class)?


But as I would like to preserve the ability to just append samples at the end, without having to read the file, I chose not just to add bzip2 or gzip compression

This is why I love gzip's feature of being able to append additional compressed data to existing compressed files.
#1 Joey Hess (Homepage) am 2010-03-01T03:14:30+00:00
Hmm, without reading any of the previous? But then the compression will be local to the appended part...

OTOH, my current code is no better than that either. But it does provide the chance for sharing also in the RAM...
#2 Joachim Breitner (Homepage) am 2010-03-01T10:03:31+00:00
Why not store changes in the list of open windows rather than the list itself?
#3 Jake McArthur am 2010-03-02T18:05:47+00:00
That’s also an interesting idea. But the lists are relatively small (< 10 elements), and with my compression scheme, it’s three bytes each (one for a boolean, two string referencing bytes). I don’t think a delta algorithm would pay off here.
#4 Joachim Breitner (Homepage) am 2010-03-02T20:20:16+00:00

Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.