Title: Suffix Trees Author: JSDurand Created: 2020-01-03 ------------------- ====================================================================== Motivation to implement this algorithm in Emacs ====================================================================== The reason I want to implement this algorithm is a problem I encountered in using the package "orderless", which provides a completion-style for the built-in completion system in Emacs. The problem is that the function "orderless-try-completion" does not handle completion aggressively in a way that conforms to the documentation of "try-completion". To be more precise, if there is only one match for the current text input, then this function returns that match (rather than comforming to the requirement in the documentation of "try-completion" by returning t, since it wants to highlight the matches by itself). This isn't a problem from the perspective of the user, and if there are no matches, then this correctly returns nil. But in any other cases, this function returns the original string, instead of the longest common substring, as a user might desire. Initially, this does not seem like a serious concern: the user could still select a completion from the "*Completions*" buffer once the list of candidates becomes small enough to easily manage. But as time goes by, this starts to frustrate me. For example, when there are only two candidates left, but one is a substring of another, then I cannot use the completion feature to quickly select (or "complete to") the shorter candidate, unless I type out the full candidate string, or to choose from the "*Completions*" buffer. For me this kind of defeats the main purpose of the built-in completion system. So I start wondering, is it possible to fix the problem by finding the longest common substring in all the matches? From a naïve first impression, this seems to be what the user might expect in most cases. ====================================================================== Choice of the algorithm ====================================================================== After some thinking and searching through the internet, I found that perhaps the most flexible and performant solution to the problem of finding the longest common substring(s) of multiple strings is to build a "generalized suffix tree" of them, and then use a tree traversal to find the longest common substring(s). Well, this is all fun and great. The only problem is that it is difficult to build a suffix tree (or a generalized one). So I decide to implement the seemingly fastest algorithm to construct a suffix tree of strings and hope that this can not only solve my problem but also help others out in some other problems in the future. ====================================================================== Definitions ====================================================================== I describe the basic definition of a suffix tree briefly below. In this document a string is a sequence of alphabets. In particular, for the case of Emacs, these alphabets are just numbers. We begin by considering a string S of length n. A suffix of S is a substring of S of the form S[i..n] for some i from 1 to n. And we also say the empty string is a suffix of S. A suffix tree of S is defined as a rooted tree T (so it has a node called "root" that is the ancestor of every other node) whose every edge has a label which is a substring of S that satisfies the following conditions: - Starting from the root of T, walking down any path to a leaf, and concatenating the labels along the way, then we will get a suffix of S. And every suffix of S is obtained in this way as well. - Every node has at least two out-going edges. - For every node, every two out-going edges cannot have labels that start with the same letter. (So two edges with labels both starting with 'a' cannot emanate from the same node.) Intuitively speaking, this is to list all suffixes of S as an edge from the root to a leaf, and then "merge" these suffixes so that any common prefix among some of them is in only one edge. As a concrete example, the suffix tree for "hah$" is given as follows. I am using the library "hierarchy.el" to visualize trees. root h $ ah$ ah$ $ ====================================================================== Naive description of the algorithm ====================================================================== Below is a breif description of the algorithm. For a description in "plain English", see the accepted answer to this Stack Overflow post. https://stackoverflow.com/questions/9452701/ Fonr a more detailed survey on the principle behind the algorithm and on many other related topics, see the book "Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology" by Dan Gusfield, or if you prefer reading the original paper, then the original paper by Ukkonen is as follows. https://link.springer.com/article/10.1007/BF01206331 (I am fortunate enough to be able to access the article. If you want to read that PDF and don't want to pay Springer, let me know and I can send you the file.) Given a string S of length n, we will first append a symbol that is not present anywhere in S, in order to ensure that no suffix of S is also the prefix of another suffix of S; otherwise S cannot have a suffix tree. I refer to this terminating symbol as $. Also an edge of the tree is not labelled explicitly by strings. To do so would violate already the linear time constraint. Instead, we represent each edge with a pair of integers, interpreted as the indices of the starting and the ending positions of the associated substring in T. The algorithm has n + 1 iterations. We start with the following variables: - A root node. - An active node. - An active edge. - The length of the active edge. - "remainder" with value 0. Then we want to add n + 1 symbols to the tree iteratively. In the i-th iteration we look to add the substring S[(i-remainder+1)..i] of S. In Emacs Lisp this is expressed as (substring S (- i remainder -1) (1+ i)). And in each iteration we do the following things (the description that follows is not rigorous; see the sources mentioned above for more detailed and rigorous expositions): - Check if the "active point" specified by the triple (active node, active edge, active edge length) has an edge going out whose first letter of the label matches the i-th character of S. - If it matches, then that means we don't have to add anything to the tree: we just update the specifications of the active point to point to the right place. - Otherwise, we add the i-th charcter of S as a new leaf of the active point. This may involve splitting an edge. - In both of the sub-cases, if there is a node that we created in the last round, then we add a "suffix link" that points to the current node. This suffix link is the main reason this algorithm can run in linear time. - After updating the active point, we still need to make sure that our specification of the active point is valid. We do this by examining if the active length is >= the length of the active edge. If so, then we set the active point to follow the path. Then we repeat until S(i) equals the terminating symbol.