Title: Suffix Trees Author: JSDurand Created: 2020-01-03 ------------------- ====================================================================== Motivation to implement this algorithm in Emacs ====================================================================== The reason I want to implement this algorithm is a problem I encountered in using the package "orderless", which provides a completion-style for the built-in completion system in Emacs. The problem is that the function "orderless-try-completion" does not handle completion aggressively in a way that conforms to the documentation of "try-completion". To be more precise, if there is only one match for the current text input, then this function returns that match (rather than comforming to the requirement in the documentation of "try-completion" by returning t, since it wants to highlight the matches by itself). This isn't a problem from the perspective of the user, and if there are no matches, then this correctly returns nil. But in any other cases, this function returns the original string, instead of the longest common substring, as a user might desire. Initially, this does not seem like a serious concern: the user could still select a completion from the "*Completions*" buffer once the list of candidates becomes small enough to easily manage. But as time goes by, this starts to frustrate me. For example, when there are only two candidates left, but one is a substring of another, then I cannot use the completion feature to quickly select (or "complete to") the shorter candidate, unless I type out the full candidate string, or to choose from the "*Completions*" buffer. For me this kind of defeats the main purpose of the built-in completion system. So I start wondering, is it possible to fix the problem by finding the longest common substring in all the matches? From a naïve first impression, this seems to be what the user might expect in most cases. ====================================================================== Choice of the algorithm ====================================================================== After some thinking and searching through the internet, I found that perhaps the most flexible and performant solution to the problem of finding the longest common substring(s) of multiple strings is to build a "generalized suffix tree" of them, and then use a tree traversal to find the longest common substring(s). Well, this is all fun and great. The only problem is that it is difficult to build a suffix tree (or a generalized one). So I decide to implement the seemingly fastest algorithm to construct a suffix tree of strings and hope that this can not only solve my problem but also help others out in some other problems in the future. ====================================================================== Definitions ====================================================================== I describe the basic definition of a suffix tree briefly below. In this document a string is a sequence of alphabets. In particular, for the case of Emacs, these alphabets are just numbers. We begin by considering a string S of length n. A suffix of S is a substring of S of the form S[i..n] for some i from 1 to n. And we also say the empty string is a suffix of S. A suffix tree of S is defined as a rooted tree T (so it has a node called "root" that is the ancestor of every other node) whose every edge has a label which is a substring of S that satisfies the following conditions: - Starting from the root of T, walking down any path to a leaf, and concatenating the labels along the way, then we will get a suffix of S. And every suffix of S is obtained in this way as well. - Every node has at least two out-going edges. - For every node, every two out-going edges cannot have labels that start with the same letter. (So two edges with labels both starting with 'a' cannot emanate from the same node.) Intuitively speaking, this is to list all suffixes of S as an edge from the root to a leaf, and then "merge" these suffixes so that any common prefix among some of them is in only one edge. ====================================================================== Description of the algorithm ====================================================================== Below is a breif description of the algorithm. For a description in "plain English", see the accepted answer to this Stack Overflow post. https://stackoverflow.com/questions/9452701/ For a more detailed survey on the principle behind the algorithm and on many other related topics, see the book "Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology" by Dan Gusfield, or if you prefer reading the original paper, then the original paper by Ukkonen is as follows. https://link.springer.com/article/10.1007/BF01206331 (I am fortunate enough to be able to access the article. If you want to read that PDF and don't want to pay Springer, let me know and I can send you the file.) Given a string S of length n, we will first append a symbol that is not present anywhere in S, in order to ensure that no suffix of S is also the prefix of another suffix of S; otherwise S cannot have a suffix tree. I refer to this terminating symbol as $. Also an edge of the tree is not labelled explicitly by strings. To do so would violate already the linear time constraint. Instead, we represent each edge with a pair of integers, interpreted as the indices of the starting and the ending points of the associated substring in T. The algorithm has n + 1 iterations. We start with the following variables: - s = root - k = 1 - i = 0 # - A root node. # - An "active point" with value (root, nil, 0). # - "remainder" with value 1. Then we want to add n + 1 symbols to the tree iteratively. In the i-th iteration we look to add the i-th letter S (i) of S. In Emacs Lisp this is expressed as (aref S i). # substring # S[(i-remainder+1)..i] And in each iteration we do the following things: - (setq i (+ i 1)) - (let ((result (update s k i))) (setq s (car result)) (setq k (cadr result))) We update by adding the i-th letter S(i) to the current active point indicated by (s, k, i - 1), see below. - (let ((result (canonize s k i))) (setq s (car result)) (setq k (cadr result))) We canonize the new active point returned by the update function. The process of canonization is to find the closest node to the point. Then we repeat until S(i) equals the terminating symbol (this way we avoid calculating the length of the string beforehand, since in Emacs Lisp the string does not have the length pre-calculated. Though this does not affect the overall time complexity, it might affect the practical performance. The function "update is described below. It taks three arguments: s, k, and i. First let oldr = root. Then let (end-p, r) be the result of (test-and-split s k (- i 1) S(i)). # begin a "remainder loop". Whenever the remainder # loop ends, we go to the next iteration. We compare the letter S(i) with the labels of edges going out from the active point. That is, if the active point is (node, edge_label, m), then this is the length m point in the edge going out from node with the first letter of label given as edge_label. If the letter is the prefix of some edge label, then we set the active point to (node, edge_label, m+1) and increment remainder by 1. If m+1 is greater than or equal to the length of the current edge, then set the active point to follow that edge to the new point. Then we end the remainder loop and skip to the next iteration. --- If the letter is not the prefix of any edge label emanating from the active point, and m > 0, then we split the point (node, edge_label, m) into a node and add a new leaf to that node, with label S(i). And then we decrement remainder by 1. If this is not the first time we split a node in the current iteration, then we add a "suffix link" from the previously created node to the newly created node. If the "node" in the specification of the active point is the root, then we set the active point to (root, S(i-remainder+1), m-1). If the "node" is not the root, then if the node has a suffix link to other_node, then we set the active point to the following: (other_node, S(i-remainder+1), m-1) If the node has no suffix link, then we set the active point to the following: (root, S(i-remainder+1), m-1) --- If the letter is not the prefix of any edge label emanating from the active point, and if m = 0, then we add a new leaf with label S(i) to node, and decrement remainder by 1. Then we follow the same rules to reset the active point.