From dc206ebe9397d656971ba7fc3a092009ef4e797a Mon Sep 17 00:00:00 2001 From: JSDurand Date: Sat, 9 Jan 2021 14:22:00 +0800 Subject: temporary state --- comb/suffiex-tree.txt | 217 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 217 insertions(+) create mode 100644 comb/suffiex-tree.txt (limited to 'comb') diff --git a/comb/suffiex-tree.txt b/comb/suffiex-tree.txt new file mode 100644 index 0000000..51908a3 --- /dev/null +++ b/comb/suffiex-tree.txt @@ -0,0 +1,217 @@ +Title: Suffix Trees +Author: JSDurand +Created: 2020-01-03 +------------------- + +====================================================================== + Motivation to implement this algorithm in Emacs +====================================================================== + +The reason I want to implement this algorithm is a problem I +encountered in using the package "orderless", which provides a +completion-style for the built-in completion system in Emacs. The +problem is that the function "orderless-try-completion" does not +handle completion aggressively in a way that conforms to the +documentation of "try-completion". To be more precise, if there is +only one match for the current text input, then this function returns +that match (rather than comforming to the requirement in the +documentation of "try-completion" by returning t, since it wants to +highlight the matches by itself). This isn't a problem from the +perspective of the user, and if there are no matches, then this +correctly returns nil. But in any other cases, this function returns +the original string, instead of the longest common substring, as a +user might desire. + +Initially, this does not seem like a serious concern: the user could +still select a completion from the "*Completions*" buffer once the +list of candidates becomes small enough to easily manage. But as time +goes by, this starts to frustrate me. For example, when there are only +two candidates left, but one is a substring of another, then I cannot +use the completion feature to quickly select (or "complete to") the +shorter candidate, unless I type out the full candidate string, or to +choose from the "*Completions*" buffer. For me this kind of defeats +the main purpose of the built-in completion system. + +So I start wondering, is it possible to fix the problem by finding the +longest common substring in all the matches? From a naïve first +impression, this seems to be what the user might expect in most cases. + + + +====================================================================== + Choice of the algorithm +====================================================================== + +After some thinking and searching through the internet, I found that +perhaps the most flexible and performant solution to the problem of +finding the longest common substring(s) of multiple strings is to +build a "generalized suffix tree" of them, and then use a tree +traversal to find the longest common substring(s). Well, this is all +fun and great. The only problem is that it is difficult to build a +suffix tree (or a generalized one). + +So I decide to implement the seemingly fastest algorithm to construct +a suffix tree of strings and hope that this can not only solve my +problem but also help others out in some other problems in the future. + + + +====================================================================== + Definitions +====================================================================== + +I describe the basic definition of a suffix tree briefly below. + +In this document a string is a sequence of alphabets. In particular, +for the case of Emacs, these alphabets are just numbers. We begin by +considering a string S of length n. A suffix of S is a substring of S +of the form S[i..n] for some i from 1 to n. And we also say the empty +string is a suffix of S. A suffix tree of S is defined as a rooted +tree T (so it has a node called "root" that is the ancestor of every +other node) whose every edge has a label which is a substring of S +that satisfies the following conditions: + +- Starting from the root of T, walking down any path to a leaf, and + concatenating the labels along the way, then we will get a suffix of + S. And every suffix of S is obtained in this way as well. +- Every node has at least two out-going edges. +- For every node, every two out-going edges cannot have labels that + start with the same letter. (So two edges with labels both starting + with 'a' cannot emanate from the same node.) + +Intuitively speaking, this is to list all suffixes of S as an edge +from the root to a leaf, and then "merge" these suffixes so that any +common prefix among some of them is in only one edge. + + + +====================================================================== + Description of the algorithm +====================================================================== + +Below is a breif description of the algorithm. For a description in +"plain English", see the accepted answer to this Stack Overflow post. + +https://stackoverflow.com/questions/9452701/ + +For a more detailed survey on the principle behind the algorithm and +on many other related topics, see the book "Algorithms on Strings, +Trees, and Sequences: Computer Science and Computational Biology" by +Dan Gusfield, or if you prefer reading the original paper, then the +original paper by Ukkonen is as follows. + +https://link.springer.com/article/10.1007/BF01206331 + +(I am fortunate enough to be able to access the article. If you want +to read that PDF and don't want to pay Springer, let me know and I can +send you the file.) + +Given a string S of length n, we will first append a symbol that is +not present anywhere in S, in order to ensure that no suffix of S is +also the prefix of another suffix of S; otherwise S cannot have a +suffix tree. I refer to this terminating symbol as $. + +Also an edge of the tree is not labelled explicitly by strings. To do +so would violate already the linear time constraint. Instead, we +represent each edge with a pair of integers, interpreted as the +indices of the starting and the ending points of the associated +substring in T. + +The algorithm has n + 1 iterations. We start with the following +variables: + +- s = root +- k = 1 +- i = 0 +# - A root node. +# - An "active point" with value (root, nil, 0). +# - "remainder" with value 1. + +Then we want to add n + 1 symbols to the tree iteratively. + +In the i-th iteration we look to add the i-th letter S (i) of S. In +Emacs Lisp this is expressed as (aref S i). + +# substring +# S[(i-remainder+1)..i] + +And in each iteration we do the following things: + +- (setq i (+ i 1)) +- (let ((result (update s k i))) + (setq s (car result)) + (setq k (cadr result))) + We update by adding the i-th letter S(i) to the current active point + indicated by (s, k, i - 1), see below. +- (let ((result (canonize s k i))) + (setq s (car result)) + (setq k (cadr result))) + We canonize the new active point returned by the update function. + The process of canonization is to find the closest node to the + point. + +Then we repeat until S(i) equals the terminating symbol (this way we +avoid calculating the length of the string beforehand, since in Emacs +Lisp the string does not have the length pre-calculated. Though this +does not affect the overall time complexity, it might affect the +practical performance. + +The function "update is described below. + +It taks three arguments: s, k, and i. First let oldr = root. Then let +(end-p, r) be the result of (test-and-split s k (- i 1) S(i)). + + + + + + + + + + + + + +# begin a "remainder loop". Whenever the remainder +# loop ends, we go to the next iteration. + +We compare the letter S(i) with the labels of edges going out from the +active point. That is, if the active point is (node, edge_label, m), +then this is the length m point in the edge going out from node with +the first letter of label given as edge_label. + +If the letter is the prefix of some edge label, then we set the +active point to (node, edge_label, m+1) and increment remainder by 1. +If m+1 is greater than or equal to the length of the current edge, +then set the active point to follow that edge to the new point. + +Then we end the remainder loop and skip to the next iteration. + +--- + +If the letter is not the prefix of any edge label emanating from the +active point, and m > 0, then we split the point (node, edge_label, m) +into a node and add a new leaf to that node, with label S(i). And then +we decrement remainder by 1. + +If this is not the first time we split a node in the current +iteration, then we add a "suffix link" from the previously created +node to the newly created node. + +If the "node" in the specification of the active point is the root, +then we set the active point to (root, S(i-remainder+1), m-1). + +If the "node" is not the root, then if the node has a suffix link to +other_node, then we set the active point to the following: +(other_node, S(i-remainder+1), m-1) + +If the node has no suffix link, then we set the active point to the +following: (root, S(i-remainder+1), m-1) + +--- + +If the letter is not the prefix of any edge label emanating from the +active point, and if m = 0, then we add a new leaf with label S(i) to +node, and decrement remainder by 1. Then we follow the same rules to +reset the active point. -- cgit v1.2.3-18-g5258