From 3666deaed5b0baf0a74f14db5872105c9e7865f9 Mon Sep 17 00:00:00 2001 From: JSDurand Date: Wed, 13 Jan 2021 13:01:34 +0800 Subject: A temporary intermeidate step Now I got almost every functionality that we need, including pdf, mu4e, magit, et cetera. --- suffix tree/suffiex-tree.txt | 168 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 168 insertions(+) create mode 100644 suffix tree/suffiex-tree.txt (limited to 'suffix tree/suffiex-tree.txt') diff --git a/suffix tree/suffiex-tree.txt b/suffix tree/suffiex-tree.txt new file mode 100644 index 0000000..6eb4b5a --- /dev/null +++ b/suffix tree/suffiex-tree.txt @@ -0,0 +1,168 @@ +Title: Suffix Trees +Author: JSDurand +Created: 2020-01-03 +------------------- + +====================================================================== + Motivation to implement this algorithm in Emacs +====================================================================== + +The reason I want to implement this algorithm is a problem I +encountered in using the package "orderless", which provides a +completion-style for the built-in completion system in Emacs. The +problem is that the function "orderless-try-completion" does not +handle completion aggressively in a way that conforms to the +documentation of "try-completion". To be more precise, if there is +only one match for the current text input, then this function returns +that match (rather than comforming to the requirement in the +documentation of "try-completion" by returning t, since it wants to +highlight the matches by itself). This isn't a problem from the +perspective of the user, and if there are no matches, then this +correctly returns nil. But in any other cases, this function returns +the original string, instead of the longest common substring, as a +user might desire. + +Initially, this does not seem like a serious concern: the user could +still select a completion from the "*Completions*" buffer once the +list of candidates becomes small enough to easily manage. But as time +goes by, this starts to frustrate me. For example, when there are only +two candidates left, but one is a substring of another, then I cannot +use the completion feature to quickly select (or "complete to") the +shorter candidate, unless I type out the full candidate string, or to +choose from the "*Completions*" buffer. For me this kind of defeats +the main purpose of the built-in completion system. + +So I start wondering, is it possible to fix the problem by finding the +longest common substring in all the matches? From a naïve first +impression, this seems to be what the user might expect in most cases. + + + +====================================================================== + Choice of the algorithm +====================================================================== + +After some thinking and searching through the internet, I found that +perhaps the most flexible and performant solution to the problem of +finding the longest common substring(s) of multiple strings is to +build a "generalized suffix tree" of them, and then use a tree +traversal to find the longest common substring(s). Well, this is all +fun and great. The only problem is that it is difficult to build a +suffix tree (or a generalized one). + +So I decide to implement the seemingly fastest algorithm to construct +a suffix tree of strings and hope that this can not only solve my +problem but also help others out in some other problems in the future. + + + +====================================================================== + Definitions +====================================================================== + +I describe the basic definition of a suffix tree briefly below. + +In this document a string is a sequence of alphabets. In particular, +for the case of Emacs, these alphabets are just numbers. We begin by +considering a string S of length n. A suffix of S is a substring of S +of the form S[i..n] for some i from 1 to n. And we also say the empty +string is a suffix of S. A suffix tree of S is defined as a rooted +tree T (so it has a node called "root" that is the ancestor of every +other node) whose every edge has a label which is a substring of S +that satisfies the following conditions: + +- Starting from the root of T, walking down any path to a leaf, and + concatenating the labels along the way, then we will get a suffix of + S. And every suffix of S is obtained in this way as well. +- Every node has at least two out-going edges. +- For every node, every two out-going edges cannot have labels that + start with the same letter. (So two edges with labels both starting + with 'a' cannot emanate from the same node.) + +Intuitively speaking, this is to list all suffixes of S as an edge +from the root to a leaf, and then "merge" these suffixes so that any +common prefix among some of them is in only one edge. + +As a concrete example, the suffix tree for "hah$" is given as follows. +I am using the library "hierarchy.el" to visualize trees. + +root + h + $ + ah$ + ah$ + $ + + + + +====================================================================== + Naive description of the algorithm +====================================================================== + +Below is a breif description of the algorithm. For a description in +"plain English", see the accepted answer to this Stack Overflow post. + +https://stackoverflow.com/questions/9452701/ + +Fonr a more detailed survey on the principle behind the algorithm and +on many other related topics, see the book "Algorithms on Strings, +Trees, and Sequences: Computer Science and Computational Biology" by +Dan Gusfield, or if you prefer reading the original paper, then the +original paper by Ukkonen is as follows. + +https://link.springer.com/article/10.1007/BF01206331 + +(I am fortunate enough to be able to access the article. If you want +to read that PDF and don't want to pay Springer, let me know and I can +send you the file.) + +Given a string S of length n, we will first append a symbol that is +not present anywhere in S, in order to ensure that no suffix of S is +also the prefix of another suffix of S; otherwise S cannot have a +suffix tree. I refer to this terminating symbol as $. + +Also an edge of the tree is not labelled explicitly by strings. To do +so would violate already the linear time constraint. Instead, we +represent each edge with a pair of integers, interpreted as the +indices of the starting and the ending positions of the associated +substring in T. + +The algorithm has n + 1 iterations. We start with the following +variables: + +- A root node. +- An active node. +- An active edge. +- The length of the active edge. +- "remainder" with value 0. + +Then we want to add n + 1 symbols to the tree iteratively. + +In the i-th iteration we look to add the substring +S[(i-remainder+1)..i] of S. In Emacs Lisp this is expressed as +(substring S (- i remainder -1) (1+ i)). + +And in each iteration we do the following things (the description that +follows is not rigorous; see the sources mentioned above for more +detailed and rigorous expositions): + +- Check if the "active point" specified by the triple + (active node, active edge, active edge length) + has an edge going out whose first letter of the label matches the + i-th character of S. +- If it matches, then that means we don't have to add anything to + the tree: we just update the specifications of the active point to + point to the right place. +- Otherwise, we add the i-th charcter of S as a new leaf of the + active point. This may involve splitting an edge. +- In both of the sub-cases, if there is a node that we created in + the last round, then we add a "suffix link" that points to the + current node. This suffix link is the main reason this algorithm + can run in linear time. +- After updating the active point, we still need to make sure that our + specification of the active point is valid. We do this by examining + if the active length is >= the length of the active edge. If so, + then we set the active point to follow the path. + +Then we repeat until S(i) equals the terminating symbol. -- cgit v1.2.3-18-g5258