Michael Mentele

software

Why Strings are Sexy

Strings. What are they and why do we care?

These may seem like silly questions – as programmers we use strings everyday as primitives in our code. But there are more ways to view strings then just as text or templates to inject data into.

What are they? How would you define them?

Abstractly, strings are sequences that can be formed out of a bounded set of tokens called an alphabet.

So the question ‘why do we care about strings’ could become why do we care about sequences?

In a nutshell, sequences encode information – both in the order of the tokens as well as tokens themselves. Language is a great example of this. The meaning of a word depends on the characters and their sequence (relationship).

You could perhaps even go so far to decribe everything in computer science as an abstraction on top of a sequence. Afterall, RAM and memory is more or less a giant array of blocks, which themselves are arrays of smaller blocks of memory and so on until you get down to the register. And what is binary? Sequences that use a two token alphabet.

In any case, aknowledging the abstraction of a string as the sequence of a set of tokens that compress information is important because it expands our perspective and we realize that language, arrays, strings, etc. are all special cases of a sequence of tokens.

So, why do we care about strings? Strings are representave of a ubiquitous method to compress and communicate information.

This leads to the next question; how do we make use of that information? How do we access it? How do we search it? How do we store it?

These are more interesting questions!

How do we search strings?

First – what does it mean to search?

There are many examples of searching strings such as cmd+f, regular expressions, the google search bar, etc. What do these examples have in common?

Pattern matching. Searching strings is all about pattern matching.

What are the ways we can pattern match?

  • subsequence matching (ordered but non-contiguous)
  • substrings (ordered contiguous sequences)
  • set matching (token membership)

Definitions:

  • a substring is a contiguous sequence of a string. For example ‘apple’ has the substring ‘ppl’. Note that prefixes and suffixes are special substrings.
  • a subsequence is similar to a substring but it need not be contiguous. For example the string ‘apple’ has the subsequence ‘ae’.
  • set matching refers to membership of tokens of a string in another string. For example ‘apple’ has ‘a’ while ‘ardvark’ also has a. We can use this idea to compute edit distances and ‘likeness’.

In essence any search boils down to more or less complex combinations of these primitives. A simple autocomplete can be made with substring matching that has a special rule for prefixes. Regex describes ways of describing sets of sequences that are acceptable matches where strings contain values or not, contain substrings or not, contain subsequences or not. Consider \d* which matches any substring that starts with a digit. Or .*y.* which finds any string that contains a token y.

With these primitives we can ask questions like:

  • does a term exist in a text?
  • where does the term exist?
  • how many places does it exist?

If strings are about storing information and pattern matching is the key to unlocking that information – what is the next concern? Speed. Reading and witing information in the form of strings are the bread and butter actions.

What impacts speed?

The search space and how we make comparisons within that search space. How do we find substrings, subsequences, and check membership?

String algorithms!

How we Make Comparisons

First, we’ve talked about how strings are sequences – this is the fundamental way strings are stored – as a character array. But there are secondary data structures that represent strings in different ways. Precomputations to enable faster future search.

Popular data structures for precomputations on strings include:

  • prefix tree
  • prefix-suffix tables
  • suffix tree
  • radix tree
  • hashes
  • sets
  • special arrays (ropes)

Operating on these data structures we have some useful algorithms:

  • KMP & Rabin Karpe for substring search
  • k pointer algorithms
  • sliding window algorithms

Conclusion

The abstraction of a string is a fundamental strategy to store information that humans have been using ever since we invented language. As simple as they are, a few basic primitive ideas can lead to vast complexity and usefulness in many applications in computer science.