diff options
author | JSDurand <mmemmew@gmail.com> | 2021-01-16 22:32:26 +0800 |
---|---|---|
committer | JSDurand <mmemmew@gmail.com> | 2021-01-16 22:32:26 +0800 |
commit | 5f61771063f35f5cd6492708d89eb220e4067f6e (patch) | |
tree | 0083ad3c09a5b5ade92363f7aec15a33da11d85f /suffix tree/suffiex-tree.txt | |
parent | 013253308671031a37eb904d809cd9823cecd14d (diff) |
Some QoL changes
* basic.el (set-mark-command-repeat-pop): Repeat poping
* tab-conf.el (durand-switch-tab-dwim): Show the default tab to close
in the prompt.
Diffstat (limited to 'suffix tree/suffiex-tree.txt')
-rw-r--r-- | suffix tree/suffiex-tree.txt | 168 |
1 files changed, 0 insertions, 168 deletions
diff --git a/suffix tree/suffiex-tree.txt b/suffix tree/suffiex-tree.txt deleted file mode 100644 index 6eb4b5a..0000000 --- a/suffix tree/suffiex-tree.txt +++ /dev/null @@ -1,168 +0,0 @@ -Title: Suffix Trees -Author: JSDurand -Created: 2020-01-03 -------------------- - -====================================================================== - Motivation to implement this algorithm in Emacs -====================================================================== - -The reason I want to implement this algorithm is a problem I -encountered in using the package "orderless", which provides a -completion-style for the built-in completion system in Emacs. The -problem is that the function "orderless-try-completion" does not -handle completion aggressively in a way that conforms to the -documentation of "try-completion". To be more precise, if there is -only one match for the current text input, then this function returns -that match (rather than comforming to the requirement in the -documentation of "try-completion" by returning t, since it wants to -highlight the matches by itself). This isn't a problem from the -perspective of the user, and if there are no matches, then this -correctly returns nil. But in any other cases, this function returns -the original string, instead of the longest common substring, as a -user might desire. - -Initially, this does not seem like a serious concern: the user could -still select a completion from the "*Completions*" buffer once the -list of candidates becomes small enough to easily manage. But as time -goes by, this starts to frustrate me. For example, when there are only -two candidates left, but one is a substring of another, then I cannot -use the completion feature to quickly select (or "complete to") the -shorter candidate, unless I type out the full candidate string, or to -choose from the "*Completions*" buffer. For me this kind of defeats -the main purpose of the built-in completion system. - -So I start wondering, is it possible to fix the problem by finding the -longest common substring in all the matches? From a naïve first -impression, this seems to be what the user might expect in most cases. - - - -====================================================================== - Choice of the algorithm -====================================================================== - -After some thinking and searching through the internet, I found that -perhaps the most flexible and performant solution to the problem of -finding the longest common substring(s) of multiple strings is to -build a "generalized suffix tree" of them, and then use a tree -traversal to find the longest common substring(s). Well, this is all -fun and great. The only problem is that it is difficult to build a -suffix tree (or a generalized one). - -So I decide to implement the seemingly fastest algorithm to construct -a suffix tree of strings and hope that this can not only solve my -problem but also help others out in some other problems in the future. - - - -====================================================================== - Definitions -====================================================================== - -I describe the basic definition of a suffix tree briefly below. - -In this document a string is a sequence of alphabets. In particular, -for the case of Emacs, these alphabets are just numbers. We begin by -considering a string S of length n. A suffix of S is a substring of S -of the form S[i..n] for some i from 1 to n. And we also say the empty -string is a suffix of S. A suffix tree of S is defined as a rooted -tree T (so it has a node called "root" that is the ancestor of every -other node) whose every edge has a label which is a substring of S -that satisfies the following conditions: - -- Starting from the root of T, walking down any path to a leaf, and - concatenating the labels along the way, then we will get a suffix of - S. And every suffix of S is obtained in this way as well. -- Every node has at least two out-going edges. -- For every node, every two out-going edges cannot have labels that - start with the same letter. (So two edges with labels both starting - with 'a' cannot emanate from the same node.) - -Intuitively speaking, this is to list all suffixes of S as an edge -from the root to a leaf, and then "merge" these suffixes so that any -common prefix among some of them is in only one edge. - -As a concrete example, the suffix tree for "hah$" is given as follows. -I am using the library "hierarchy.el" to visualize trees. - -root - h - $ - ah$ - ah$ - $ - - - - -====================================================================== - Naive description of the algorithm -====================================================================== - -Below is a breif description of the algorithm. For a description in -"plain English", see the accepted answer to this Stack Overflow post. - -https://stackoverflow.com/questions/9452701/ - -Fonr a more detailed survey on the principle behind the algorithm and -on many other related topics, see the book "Algorithms on Strings, -Trees, and Sequences: Computer Science and Computational Biology" by -Dan Gusfield, or if you prefer reading the original paper, then the -original paper by Ukkonen is as follows. - -https://link.springer.com/article/10.1007/BF01206331 - -(I am fortunate enough to be able to access the article. If you want -to read that PDF and don't want to pay Springer, let me know and I can -send you the file.) - -Given a string S of length n, we will first append a symbol that is -not present anywhere in S, in order to ensure that no suffix of S is -also the prefix of another suffix of S; otherwise S cannot have a -suffix tree. I refer to this terminating symbol as $. - -Also an edge of the tree is not labelled explicitly by strings. To do -so would violate already the linear time constraint. Instead, we -represent each edge with a pair of integers, interpreted as the -indices of the starting and the ending positions of the associated -substring in T. - -The algorithm has n + 1 iterations. We start with the following -variables: - -- A root node. -- An active node. -- An active edge. -- The length of the active edge. -- "remainder" with value 0. - -Then we want to add n + 1 symbols to the tree iteratively. - -In the i-th iteration we look to add the substring -S[(i-remainder+1)..i] of S. In Emacs Lisp this is expressed as -(substring S (- i remainder -1) (1+ i)). - -And in each iteration we do the following things (the description that -follows is not rigorous; see the sources mentioned above for more -detailed and rigorous expositions): - -- Check if the "active point" specified by the triple - (active node, active edge, active edge length) - has an edge going out whose first letter of the label matches the - i-th character of S. -- If it matches, then that means we don't have to add anything to - the tree: we just update the specifications of the active point to - point to the right place. -- Otherwise, we add the i-th charcter of S as a new leaf of the - active point. This may involve splitting an edge. -- In both of the sub-cases, if there is a node that we created in - the last round, then we add a "suffix link" that points to the - current node. This suffix link is the main reason this algorithm - can run in linear time. -- After updating the active point, we still need to make sure that our - specification of the active point is valid. We do this by examining - if the active length is >= the length of the active edge. If so, - then we set the active point to follow the path. - -Then we repeat until S(i) equals the terminating symbol. |