Advanced filesystems beyond metacrap

DISCLAIMER: This site is a mirror of original one that was once available at http://iki.fi/~tuomov/b/

With the Redmond megacorporation doing it, implementing database filesystems is popular these days. Without much success. Metacrap is not the only problem. The lack of human-readable unique identifiers is another potential problem, at least from the point of view of the command line user, and a programmer of a program that wants access to the program's data files in a known location.

An object on a conventional file system is uniquely identified by a string of characters, its path. In mathematical notation, the identifier is an element of an admissible subset of the set A^* of all strings formable from the alphabet A. The alphabet may be the characters of some encoding (A={'1','2',…,'a','b', …}), but for our purposes it is more convenient to consider A itself to be the set of all strings formable from a character set minus unallowed characters such as directory separators (A={"usr", "bin", "home", "README", "11 - Some Song.mp3", … }). For the abstract treatment it is, however, not necessary to set down any particular alphabet.

The nice thing about this kind of file systems is that the unique identifiers are most human readable, and can also be known in advance. Contrast this with a generic databases, where searches with any terms may return multiple answers, and any unique identifiers must be allocated before they can be known, and are unlikely to be human-readable. Alternatively, uniqueness must otherwise enforced on some indexing criteria, which will simply mean that the database file system is lying on top of a normal one, or the other way around. In mathematic notation, slightly simplifying, one may consider a database consisting of multiple sets B_key of pairs (value, object), where each object may be referenced multiple times, and each value may occur multiple times for different objects and the same key. An object may be uniquely found only by knowing a set of (key, value) pairs that no other objects match all. Such identification, however, isn't permanent unless uniqueness is enforced for values of some keys.

There is, however, a way to improve filesystems without going all the way to generic databases. The idea is to use subsets of the alphabet A instead of strings (ordered sets) to identify a file. That is, each object o on the file system is associated with a finite subset R_o ⊂ A, and no other object is associated with the same subset, but may be associated with a smaller or greater set. If A is finite, at most 2^#A objects can be identified. Therefore we need A to be an infinite set, such as the set of all words formable from an admissible subset of a character set, as already discussed.

In more everyday terms, a file is assigned to multiple categories, and the set of categories a file is assigned to, uniquely identifies it. No other file may be assigned to the exactly same set of categories. For example, the system-wide configuration file for the program "foo" might be identified with the set B = {"cfg", "foo"}, while the user configuration file might be identified with C = {"cfg", "foo", "username"}. The user might sometimes want to, for example, create a backup copy of this file as D = C ∪ {"backup"}.

So far, this doesn't look so different from a conventional file system that supports nested files (needed for the backup example to have to path components listed above). The difference lies in how these files may be accessed. Since sets do not have inherent order, every ordering of the elements of an identifying set as a path should be usable for accessing the file. /cfg/foo/username, /username/cfg/foo, /foo/cfg/username, and so on, should all work. Yes, this can be achieved with a symlink setup in conventional file systems, but with a lot of work.

How would one then navigate and view such a non-hierarchical collection? Obviously, it should be possible to search for files belonging to a certain set of categories. Thus a search for the category cfg should find all the files B, C and D, and many others. Some pruning of the results is obviously necessary. The idea is to prune all results that are supersets of others, i.e. take the minimal sets among the search results. Because B ⊂ C, D, in our example only the result B, i.e. the system-wide configuration file would then be shown, together with a hint indicating that there are more results under it, or that it has a dual-purpose as a directory, if you prefer the notion. In a shell interface to the file system, with cd setting the default search criteria, and ls performing the search, just the text foo/ (along with other results for other programs) could be shown for the command sequence cd /cfg; ls -F. (The -F flag to GNU ls has the function to add the final slash to indicate directories.) Therefore, so far it's just like navigating a directory hierarchy.

However, sometimes the minimal sets among the search results may not differ from the search set ("current directory") by just one element. Suppose, for example, that we instead had B={"cfg", "foo", "system"}. Then ls /cfg would have to list all the files, B, C and D as results. In some cases this may be desirable, and in some cases not. The manual remedy is to add a dummy file {"cfg", "foo"} with no content. Call it a directory, if you like. Often this would seem even desirable. For example, the file {"mp3", "Some Artist", "Some Album"} could contain information on the album, possibly a playlist of the files on it if you like playlists. The file {"mp3", "Some Artist"}, on the other hand, is probably something that is not even wanted to exist (if this pruning criteria is used). This way ls /mp3/ would list all your albums, but ls /mp3/Some\ Artist would list all the albums by this artist.

But, there should be no inherent need for such manual creation of "directories". The file system navigation tools can automatically prune search results. For all the results whose identifier sets are not supersets of those results whose identifier set differs from the search set by one element, the tools can find a number of sets (differing by one element from the search set) that best describe them. For example, for each search result, take all the subsets of this identifier set that consist of the search criteria and one extra element, and order all such subsets by their occurence frequency in all the search results. Then start picking out most frequent of those sets until each search result is a superset of at least one. This operation does need a bit of computation (polynomial time), but I do not expect it to be a big problem in practice. If it were, heuristics could be developed.

I think this kind of file system would be far more useful and especially usable from the command line, than what the popular plans for database file systems with support for arbitrary searches and so on seem to be (I have not delved into them in too much detail). Such can, of course, be implemented on top of this proposal, and there's nothing to stop using categories with "key=value" pairs. I also think that this kind of file system has a quality that helps avoiding metacrap. You don't have to fill in a zillion fields (possibly predetermined by the high priests of Semantopia) for the file's metadata. It simply is in a category or it isn't. You decide the categories to use, and as the categories are part of the file's "name", there's no complex process needed to create them. But you also do not have to decide on a compromise hierarchy, unlike with a conventional file system. There simply isn't any one, and yet there's all of them.

Article: