Monday, 22 March 2010

Infix search using a Trie data structure

I am authoring a jQuery plugin to enhance drop down selects; UFD: The Unobtrusive Fast-filter Drop-down. Try it out easily on your own app with the bookmarklet. Be amazed. Show your friends.

One of the important requirements was filtering of the the drop down options from the input. The naive way to do this is a linear search through each list item, but this is way too slow for large lists; the list needs to be re-filtered after every keystroke.

I decided on a Trie implementation, which gives great prefix search performance. A Trie stores objects on a tree, constructed by slicing the key-string for each object into characters. Starting at the root, each character maps to a node - see the incoming vertex in the diagram below.

Each node has a map of character to child nodes, and an optional target object. Target objects are placed where the chain of characters maps to the objects key, so you can easily traverse the nodes with a key-string to find a given the target object. The structure is a tree, but the path to a given object is similar to a linked list of each character in the key string.

trie example

The blue nodes are ones that have objects on them - in this example we see the words: to,tea,ted,ten,it,int,into. Notice how objects are on the leaf nodes, unless their key is a substring of another key.

It is easy to see how we can do quick prefix searching:start at the root node (which represents the empty string) and traverse the nodes, consuming each character of the key prefix string. Once the prefix key string is consumed, that subtree has all the matches of the given key prefix.

But what if we want to find all nodes that have a given key infix? After failing to find a better data structure, I decided to hack an extension to the data structure.

Is there a better data structure that I should be using? Suggestions welcome. Anyway:

Instead of starting at the root node, we need an array of start points. During creation I put a reference to each new node in a list. There is one list for every unique character across the whole key set. This means that each list has references to all nodes of that character. These lists of start points are called the infixRoots.

Yes, some of these lists could be quite large. For example, the "e" list for a 1000-item list may have over 1000 items.

Lets look at a filtering example: lets search for the objects that match the infix key "to".

trie example: t

We start by getting the infix list matching the start character ("t") from the infixRoots set. The highlighted nodes and all of their child nodes, by definition, have "t" in their key - no surprises there.

trie example: to

We next iterate over the that list of "t" nodes, creating a new list with child nodes that match the next character; "o". Any node without a match is not added to the new list. We keep iteratively mapping node lists from one character to the next, until we consume the infix keys' characters.

The nodes left in the final set, and all of their child nodes, all are matches. Iterate over each subtree and get the attached objects. In our example above, the nodes representing to,into are found. Had the key top been in the tree, it would also match, as it would be a child of to.

Have a look at the javascript implementation for UFD, the InfixTrie is a self-contained prototype inside the source near the bottom.

Tuesday, 2 March 2010

At last: a proper multi booting OSX/Win7/Linux box lives!

I have nutted out the final few problems with my multi-boot drama.
  • (hd0): MBR - grub2, win7, Liunx
  • (hd1): dataz
  • (hd2): bootable OSX 10.6 with GPT partition, Chameleon booter etc.
I can point my bios to hd2 and OSX boots, but it took ages to get grub2 to boot my OSX disc (hd2).  Ended up being real simple; in /etc/grub.d/40_custom :

menuentry "Mac OSX" {
  set root=(hd2)
  chainloader +1
}

and then a sudo update-grub

Yay, it boots.  My second major drama was with macFUSE and fuse-ext2; it doesn't work with the 64-bit kernel.  Here is the console output:

toolmans-Mac-Pro:Volumes toolman$ sudo fuse-ext2 /dev/disk2s1 /Volumes/Untitled\ 30
fuse-ext2: version:'0.0.7', fuse_version:'27' [main (../../fuse-ext2/fuse-ext2.c:324)]
fuse-ext2: enter [do_probe (../../fuse-ext2/do_probe.c:30)]
fuse-ext2: leave [do_probe (../../fuse-ext2/do_probe.c:55)]
Mounting /dev/disk2s1 Read-Only.
Use 'force' or 'rw+' options to enable Read-Write mode
fuse-ext2: opts.device: /dev/disk2s1 [main (../../fuse-ext2/fuse-ext2.c:351)]
fuse-ext2: opts.mnt_point: /Volumes/Untitled 30 [main (../../fuse-ext2/fuse-ext2.c:352)]
fuse-ext2: opts.volname:  [main (../../fuse-ext2/fuse-ext2.c:353)]
fuse-ext2: opts.options: (null) [main (../../fuse-ext2/fuse-ext2.c:354)]
fuse-ext2: parsed_options: allow_other,local,noappledouble,ro,fsname=/dev/disk2s1,fstypename=ext2,volname=disk2s1 [main (../../fuse-ext2/fuse-ext2.c:355)]
fuse-ext2: mounting read-only [main (../../fuse-ext2/fuse-ext2.c:371)]
/Library/Filesystems/fusefs.fs/Support/fusefs.kext failed to load - (libkern/kext) link error; check the system/kernel logs for errors or try kextutil(8).
the MacFUSE file system is not available (71)
toolmans-Mac-Pro:Volumes toolman$


The highlighted line is the clincher; the 32-bit kext won't work with 64 bit kernel.  thankfully, I can run 64 bit apps on the 32-bit kernel anyway :)

It took me ages to figure out that chameleon uses

/Extra/com.apple.Boot.plist

not the default kernel boot file !

/Library/Preferences/SystemConfiguration/com.apple.Boot.plist

By editing the correct file, and updating the kernel flags

Kernel Flags
-x32

And now I have grub2 booting to any OS I want, with my ext3 mounting in all OSes :D