summaryrefslogtreecommitdiff
path: root/comb/suffiex-tree.txt
diff options
context:
space:
mode:
Diffstat (limited to 'comb/suffiex-tree.txt')
-rw-r--r--comb/suffiex-tree.txt217
1 files changed, 217 insertions, 0 deletions
diff --git a/comb/suffiex-tree.txt b/comb/suffiex-tree.txt
new file mode 100644
index 0000000..51908a3
--- /dev/null
+++ b/comb/suffiex-tree.txt
@@ -0,0 +1,217 @@
+Title: Suffix Trees
+Author: JSDurand
+Created: 2020-01-03
+-------------------
+
+======================================================================
+ Motivation to implement this algorithm in Emacs
+======================================================================
+
+The reason I want to implement this algorithm is a problem I
+encountered in using the package "orderless", which provides a
+completion-style for the built-in completion system in Emacs. The
+problem is that the function "orderless-try-completion" does not
+handle completion aggressively in a way that conforms to the
+documentation of "try-completion". To be more precise, if there is
+only one match for the current text input, then this function returns
+that match (rather than comforming to the requirement in the
+documentation of "try-completion" by returning t, since it wants to
+highlight the matches by itself). This isn't a problem from the
+perspective of the user, and if there are no matches, then this
+correctly returns nil. But in any other cases, this function returns
+the original string, instead of the longest common substring, as a
+user might desire.
+
+Initially, this does not seem like a serious concern: the user could
+still select a completion from the "*Completions*" buffer once the
+list of candidates becomes small enough to easily manage. But as time
+goes by, this starts to frustrate me. For example, when there are only
+two candidates left, but one is a substring of another, then I cannot
+use the completion feature to quickly select (or "complete to") the
+shorter candidate, unless I type out the full candidate string, or to
+choose from the "*Completions*" buffer. For me this kind of defeats
+the main purpose of the built-in completion system.
+
+So I start wondering, is it possible to fix the problem by finding the
+longest common substring in all the matches? From a naïve first
+impression, this seems to be what the user might expect in most cases.
+
+
+
+======================================================================
+ Choice of the algorithm
+======================================================================
+
+After some thinking and searching through the internet, I found that
+perhaps the most flexible and performant solution to the problem of
+finding the longest common substring(s) of multiple strings is to
+build a "generalized suffix tree" of them, and then use a tree
+traversal to find the longest common substring(s). Well, this is all
+fun and great. The only problem is that it is difficult to build a
+suffix tree (or a generalized one).
+
+So I decide to implement the seemingly fastest algorithm to construct
+a suffix tree of strings and hope that this can not only solve my
+problem but also help others out in some other problems in the future.
+
+
+
+======================================================================
+ Definitions
+======================================================================
+
+I describe the basic definition of a suffix tree briefly below.
+
+In this document a string is a sequence of alphabets. In particular,
+for the case of Emacs, these alphabets are just numbers. We begin by
+considering a string S of length n. A suffix of S is a substring of S
+of the form S[i..n] for some i from 1 to n. And we also say the empty
+string is a suffix of S. A suffix tree of S is defined as a rooted
+tree T (so it has a node called "root" that is the ancestor of every
+other node) whose every edge has a label which is a substring of S
+that satisfies the following conditions:
+
+- Starting from the root of T, walking down any path to a leaf, and
+ concatenating the labels along the way, then we will get a suffix of
+ S. And every suffix of S is obtained in this way as well.
+- Every node has at least two out-going edges.
+- For every node, every two out-going edges cannot have labels that
+ start with the same letter. (So two edges with labels both starting
+ with 'a' cannot emanate from the same node.)
+
+Intuitively speaking, this is to list all suffixes of S as an edge
+from the root to a leaf, and then "merge" these suffixes so that any
+common prefix among some of them is in only one edge.
+
+
+
+======================================================================
+ Description of the algorithm
+======================================================================
+
+Below is a breif description of the algorithm. For a description in
+"plain English", see the accepted answer to this Stack Overflow post.
+
+https://stackoverflow.com/questions/9452701/
+
+For a more detailed survey on the principle behind the algorithm and
+on many other related topics, see the book "Algorithms on Strings,
+Trees, and Sequences: Computer Science and Computational Biology" by
+Dan Gusfield, or if you prefer reading the original paper, then the
+original paper by Ukkonen is as follows.
+
+https://link.springer.com/article/10.1007/BF01206331
+
+(I am fortunate enough to be able to access the article. If you want
+to read that PDF and don't want to pay Springer, let me know and I can
+send you the file.)
+
+Given a string S of length n, we will first append a symbol that is
+not present anywhere in S, in order to ensure that no suffix of S is
+also the prefix of another suffix of S; otherwise S cannot have a
+suffix tree. I refer to this terminating symbol as $.
+
+Also an edge of the tree is not labelled explicitly by strings. To do
+so would violate already the linear time constraint. Instead, we
+represent each edge with a pair of integers, interpreted as the
+indices of the starting and the ending points of the associated
+substring in T.
+
+The algorithm has n + 1 iterations. We start with the following
+variables:
+
+- s = root
+- k = 1
+- i = 0
+# - A root node.
+# - An "active point" with value (root, nil, 0).
+# - "remainder" with value 1.
+
+Then we want to add n + 1 symbols to the tree iteratively.
+
+In the i-th iteration we look to add the i-th letter S (i) of S. In
+Emacs Lisp this is expressed as (aref S i).
+
+# substring
+# S[(i-remainder+1)..i]
+
+And in each iteration we do the following things:
+
+- (setq i (+ i 1))
+- (let ((result (update s k i)))
+ (setq s (car result))
+ (setq k (cadr result)))
+ We update by adding the i-th letter S(i) to the current active point
+ indicated by (s, k, i - 1), see below.
+- (let ((result (canonize s k i)))
+ (setq s (car result))
+ (setq k (cadr result)))
+ We canonize the new active point returned by the update function.
+ The process of canonization is to find the closest node to the
+ point.
+
+Then we repeat until S(i) equals the terminating symbol (this way we
+avoid calculating the length of the string beforehand, since in Emacs
+Lisp the string does not have the length pre-calculated. Though this
+does not affect the overall time complexity, it might affect the
+practical performance.
+
+The function "update is described below.
+
+It taks three arguments: s, k, and i. First let oldr = root. Then let
+(end-p, r) be the result of (test-and-split s k (- i 1) S(i)).
+
+
+
+
+
+
+
+
+
+
+
+
+
+# begin a "remainder loop". Whenever the remainder
+# loop ends, we go to the next iteration.
+
+We compare the letter S(i) with the labels of edges going out from the
+active point. That is, if the active point is (node, edge_label, m),
+then this is the length m point in the edge going out from node with
+the first letter of label given as edge_label.
+
+If the letter is the prefix of some edge label, then we set the
+active point to (node, edge_label, m+1) and increment remainder by 1.
+If m+1 is greater than or equal to the length of the current edge,
+then set the active point to follow that edge to the new point.
+
+Then we end the remainder loop and skip to the next iteration.
+
+---
+
+If the letter is not the prefix of any edge label emanating from the
+active point, and m > 0, then we split the point (node, edge_label, m)
+into a node and add a new leaf to that node, with label S(i). And then
+we decrement remainder by 1.
+
+If this is not the first time we split a node in the current
+iteration, then we add a "suffix link" from the previously created
+node to the newly created node.
+
+If the "node" in the specification of the active point is the root,
+then we set the active point to (root, S(i-remainder+1), m-1).
+
+If the "node" is not the root, then if the node has a suffix link to
+other_node, then we set the active point to the following:
+(other_node, S(i-remainder+1), m-1)
+
+If the node has no suffix link, then we set the active point to the
+following: (root, S(i-remainder+1), m-1)
+
+---
+
+If the letter is not the prefix of any edge label emanating from the
+active point, and if m = 0, then we add a new leaf with label S(i) to
+node, and decrement remainder by 1. Then we follow the same rules to
+reset the active point.