Joachim Breitner's Homepage

Categories

Just about three years ago, I created the “automatic rule-based time tracker” (arbtt), a tool that quietly runs in the background, records your current windows and their titles, and allows you to later process this data using, well, rules. By now, I have almost half a million records, spanning 317 days of me working at the computer. Processing this amount of data has become very memory-hungry: It takes 1G of RAM to process my data. Other users, with a presumably higher sampling rate, report even worse performance, which finally made me approach the problem again.

Naturally, one wonders why it is such a memory hog, after all it takes just one run over the data to produce a report. The data itself is a lazy list so the garbage collector should the be able to drop the seen elements immediately.

But arbtt is able to print several reports in one run; there is even
an option (`--each-category`

) that creates a number of reports based on
the data itself. The code was structured that first a number of values
that are possibly used by several reports are first created one and then
re-used by the reports. And then there is the issue of the filters: The
user specifies a what predicates are interesting (e.g., those where he
was active) and some reports need to know not only which samples were
matched, but also distinguish the consecutive runs of selected samples.

Lots of reasons that made it hard to not use the list of of samples several times, which causes it to be completely retained in memory. The idea of completely rewriting everything in a much more imperative way was not appealing. But luckily Haskell’s good abstraction possibilities can help here.

What is the “correct” way to traverse a lazy list and calculate some information about it? It is a strict left fold, i.e. an algorithm that can be represented as arguments to

foldl' :: (s -> x -> s) -> s -> [x] -> s

For those who are new to folds: The first argument is a function that processes one element from the list, returning the new state of the processing. The second argument is the initial state. And `foldl'`

can apply such an algorithm to a list to return a final state.

So obviously I had to express my reports as left folds. To think clearer about it, I gave them a name. I also added a finishing step to the mix, which is handy:

data LeftFold x a = forall s. LeftFold { start :: s, process :: s -> x -> s, finish :: s -> a }

A value of type `LeftFold x a`

is then an algorithm that traverses a list of `x`

es and returns something of type `a`

. The `forall s.`

means that the user of such a processor must not make any assumptions about the type of the state. I also have a function that runs such a list processor, simply by invoking `foldl'`

with the arguments stored in the `LeftFold`

constructor:

runLeftFold :: LeftFold x a -> [x] -> a runLeftFold (LeftFold st1 p1 f1) xs = f1 (foldl' p1 st1 xs)

Now, as a Haskell programmer, I am searching for monads everywhere, and it really would have been nice if this were a monad, as they are very easy to compose. But unfortunately, they do not form a monad. If they did then I could decide after running one processor and looking at the result how I would want to process the list now, but this obviously contradicts with the desire to traverse the list only once.

But there is the monad’s smaller ugly sibling, the applicative functor. One can still somewhat nicely compose them to form more complex algorithms, just with more restrictions. Here is my instance:

instance Functor (LeftFold x) where fmap f (LeftFold st1 p1 f2) = LeftFold st1 p1 (f . f2) instance Applicative (LeftFold x) where pure x = LeftFold () const (const x) LeftFold st1 p1 f1 <*> LeftFold st2 p2 f2 = LeftFold { start = st1 :!: st2, process = \(s1 :!: s2) x -> p1 s1 x :!: p2 s2 x, finish = \(s1 :!: s2) -> f1 s1 (f2 s2) }

The `:!:`

operator is a strict tuple constructor. This is important, as otherwise the fold would be come a lazy one, I would be building up huge numbers of thunks and performance would be very bad. For the same reason every list processor should make sure that its state does not build thunks once it is in weak head normal form.

Now I am at a abstraction level that I feel comfortable working in, by writing functions that combine the list processors in various ways. Probably the most important one for me is already present in the libraries (here with a specialized type):

sequenceA :: [LeftFold x a] -> LeftFold x [a]

This is the whole trick: Now I can, based on the user input, run any number and selection of reports in one single list traversal. I find it very elegant that “under the hood”, the strict tuples are nested at precisely the right depth depending on the length of the list.

As hinted at before, I do some interesting stuff that was previously represented as lists of lists of values. Expressing these as left folds was a bit tricky, but again, the complexity is contained in the combinators that I wrote (although it still gets confusing, check out `processIntervalReport`

in `Stats.hs`

):

monoidFold :: Monoid m => LeftFold m m mapElems :: LeftFold y a -> (x -> y) -> LeftFold x a filterWith :: (x -> Bool) -> LeftFold (Bool :!: x) a -> LeftFold x a onSelected :: LeftFold x a -> LeftFold (Bool :!: x) a onJusts :: LeftFold x a -> LeftFold (Maybe x) a onAll :: LeftFold x a -> LeftFold (Bool :!: x) a runOnGroups :: (x -> x -> Bool) -> LeftFold x y -> LeftFold y z -> LeftFold x z runOnIntervals :: LeftFold x y -> LeftFold y z -> LeftFold (Bool :!: x) z lfLength :: LeftFold x Int lfFirst :: LeftFold x (Maybe x) lfLast :: LeftFold x (Maybe x) toList :: LeftFold x [x] concatFold :: LeftFold [x] [x]

As a result, the program now only requires 10MB and runs 18% faster.

If you wonder why, in the definition of `runLeftFold`

, I do not apply the finish function strictly: That is intentional; I am actually passing the result of a processor that calculates some often required data (e.g. total time recorded) as an argument to all the list processors. (Near the end of the main function in `stats-main.hs`

) This “loop” works fine as long as I am using these only in the `finish`

functions.

There is one downside to this approach: I cannot easily calculate stuff on demand *and* share it between two reports. Previously, I could just bind this to a variable, pass it to all reports, only if at least one is interested it would be evaluated and if several are interested, the result would be shared. I do not have a good solution for this yet.

The code above is in the self-contained module `LeftFold.hs`

in the arbtt sources.

Several People have had this idea before, here is one blog post by Max Rabkin.

You mentioned that you'd like to write this as a monad rather than as a fold. I'd suggest looking at the Conduit package, which makes it easy to do this type of data processing efficiently as a monad.

#1 Anonymous am 2012-09-28T17:06:25+00:00

Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.