Michael Mentele

software

Oh Prefix Trees... You Complete Me

You may have caught a clue from the goofy title but prefix trees have a well known usecase: autocomplete.

Given what you have typed so far, what would some possible completions be?

For example, given ‘ca’ I might see ‘cat’ or ‘catch’ or ‘california’.

How do we do that? How does that work?

This question will drive our discussion of a useful data structure for precomputing information about a colletion of strings.

The Auto Complete Problem

If I have a dictionary with n words of average size m, given a prefix of size k, how can I quickly return all words with that prefix?

Brute Force

Just search every word in the dictionary!

Time: O(n*m)

You might think – why not O(n*k)? Because I need to try my prefix on every word, and if it is a match – copy the word. The matching is n*k but the copying is n*m.

Space: O(1)

At home, the time complexity doesn’t matter, there might be a few hundred thousand words in the dictionary with an average length O(10). A few million operations is not a problem for modern systems. But let’s pretend we are Google and have high volume.

Time efficiency matters.

Hash it All!

The next obvious approach might be to hash EVERY prefix to a list of matching words…

Now, for any string of length m we will have m + m-1 + m-2 … 1 prefixes. For example, prefixes for cat: c, ca, cat might correspond to a dictionary of cat, catch, car like:

{
  'c': ['cat', 'car', 'catch'],
  'ca': ['cat', 'car', 'catch'],
  'cat': ['cat', 'catch']
}

This is a triangular series where every pair m-1 … 1 sums to m and there are m/2 such pairs. Long story short this works out to O(m^2). And so we will have O(n*m^2) space just for the prefixes.

What about the word lists we would store at each prefix? In the worstcase we could store n words for a single prefix – for example the prefix ‘a’ will have some proportion (1/26) of the words. So the space for any given prefix is n*m and there are n*m^2 then total space becomes O(n^2*m^3)

The time cost of retrieving all the matches is pretty good though – the lookup time of a hash is O(1). But the retrieval of the values is actually O(n*m) because we still have to copy all the words out. If we were just returning a count of the number of words it’s a prefix for it would be O(1).

One nit though is hashing the input string will take O(m) – so bear in mind hashmaps aren’t technically O(1) though we usual call it such.

The above is really still based on the size of the prefix being looked up.

We can’t do better in time complexity – but we can do better in space! Much better.

Prefix Tree AKA Trie

If we think about the hash solution for cat above where do we see duplication?

  1. the words are duplicated
  2. the prefixes overlap

You may have taken the leap already but we can effectively represent this as a tree like:

        t$ -> c -> h$
       /
 c -> a
       \
        r$

Where the $ represents a word. Notice how much more compact this is – we’ve removed all duplication!

This is as fast as hashing because in both case we are traversing the prefix. If we use a hash to store the children pointers we can also guarantee constant time lookup of the children.

Visit this repl for an example implementation.

Radix or PATRICIA Trees

You might also notice we have a lot of waste in the trie above. What is the point of having a node with 1 child? This is the key insight of a radix tree and a PATRICIA tree which are compressed versions of the trie.

The worst class complexity doesn’t change (in the worst case there is nothing to compress) but from a practical standpoint prefix trees are often sparse so it makes a huge difference in the average case.

        t -> ch$
       / \
    ca    $
       \
        r$

I’m not going to cover the implementation here so feel free to google this one if interested :)

Conclusion

We only scratched the surface of the usefulness of a prefix tree. Sure, autocomplete is one usecase. But we might also use a tree for doing IP address lookups or any number of other applications.

Whenever you need to prune down the search space based on a prefix i.e. what you’ve seen so far, a prefix tree is a natural fit.

Happy coding!