Document Archival Tool?

Yesterday I stumbled over an interesting article and I thought “hey, it would be nice to save it for some time later“. Then I thouht “just saving is not enough, because I want to know where I found it”. But manually writing the URL somewhere as well feels ugly. Therefore I am looking for some kind of document archival tool. Here is how it might work, givinen a command line interface and the invented name “nodoc”:

$ nodoc get http://labs.google.com/papers/disk_failures.pdf
$ nodoc list
$ nodoc info disk
ambiguous file name “disk”
$ nodoc info failures
URL: http://labs.google.com/papers/disk_failures.pdf
Date: Sat, 03 Mar 2007 17:46:03 +0100
LastUpdate: Sat, 03 Mar 2007 17:46:03 +0100
Title: Failure Trends in a Large Disk Drive Population
$ nodoc tags failu + google disk statistics
$ nodoc list
disk_failures.pdf google,disk,statistics
$ nodoc update
disk_failures.pdf: File up to date
old_files_on_disks.pdf: New version available.... done downloading.
$ nodoc search happyness
happyness not found (full text search in 2 documents)

I think you get the idea. Note that I expect it to read the meta-data from PDF files (which the mentionend article does not contain, but it’s just an example). Does anyone know of a simple linux program, text-based or graphical, that does that? I am considering to write it myself, but I don’t want to re-invent the wheel.


Hi J,
I started hacking on a basic bash based upon your idea since it has some reasonable example of expected behavior. Xapian maybe what you need to extract metadata and keywords from the document. It has plugins to work with some file formats. But I dont think I've seen a cli program that has that command set. Interesting idea.
#1 Kevin Mark (Homepage) am 2007-03-03T21:40:15+00:00
This sounds like it could be a good exercise as a beagle plugin. While beagle does have GUI components, there's also the command-line beagle-query client. And since it can already track which pages you visit in Iceweasel it would be relatively simple (I think :) ) to write a plugin that would combine the URL you got a document from with the saved document.
#2 Alex (Homepage) am 2007-03-03T22:33:04+00:00
Referencer seems like a good start: http://davyd.livejournal.com/209243.html
#3 James am 2007-03-04T01:21:57+00:00
This looks very close to what I want, but I miss a downloading feature.
#4 Joachim Breitner (Homepage) am 2007-03-04T15:18:19+00:00
I downloaded version 1.0.2. It is meant to deal with bibtex file to be useful. I gave it the disk_failure.pdf, it did not know its url, did not store a text version of the document to allow you to search for any terms in the document. It can keep track of text and pdf files but only in a very basic way where you type the basic metadata by hand. A good base of code to start with, none the less.
#5 Kevin Mark (Homepage) am 2007-03-04T19:29:01+00:00
What about enhancing a del.icio.us like webapp (e.g. http://scuttle.org/) with a "mirror" function?
#6 mrf am 2007-03-04T12:11:47+00:00
Call me 1.0, but I don’t really digg that stuff :-) (at least at the moment)
#7 Joachim Breitner (Homepage) am 2007-03-04T15:20:06+00:00
<strong>Trackback:</strong> <a href="http://www.emplify.de/archives/529-Links-04.03.2007.html">Links - 04.03.2007</a><br />So, neue Kategorie namens GWT, welche in nächster Zeit gefüllt werden wird...Bitter: Wordpress hat aufgrund eines Cracks eine Version mit Trojaner im Code released...Ups. Da sollten wir Informatiker uns mal an die Nase fassen. Der Produktion eines Compute
#8 emplify (Homepage) am 2007-03-04T22:44:29+00:00
What you describe sounds very much like Pinot's My Web Pages functionality. At least that's how I use it. Whenever I come across something interesting on the Web, I import that document in Pinot and give it a label. I then search the My Web Pages index or browse its contents whenever I need to find a document. User-editable metadata is currently limited to title and labels (basically tags) but it's only a matter of time before I extend that.
Pinot is based on Xapian, which Kevin Mark mentioned.
#9 Fabrice Colin (Homepage) am 2007-03-08T08:16:13+00:00
<strong>Trackback:</strong> <a href="http://developer.berlios.de/blog/archives/99-Document-archival.html">Document archival</a><br />I came across this post on Joachim Breitner's blog where he describes his ideas on how a document archival tool should work.
All that time I had been using Pinot's My Web Pages for document archival without realizing it
Sure, things could be improved i
#10 BerliOS Developer Blogs (Homepage) am 2007-03-10T08:38:55+00:00
Similar Request on http://www.earth.li/~noodles/blog/dms.html
#11 Joachim Breitner (Homepage) am 2007-06-11T10:08:01+00:00

Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.