summaryrefslogtreecommitdiff
path: root/suffix tree/suffiex-tree.txt
diff options
context:
space:
mode:
Diffstat (limited to 'suffix tree/suffiex-tree.txt')
-rw-r--r--suffix tree/suffiex-tree.txt168
1 files changed, 0 insertions, 168 deletions
diff --git a/suffix tree/suffiex-tree.txt b/suffix tree/suffiex-tree.txt
deleted file mode 100644
index 6eb4b5a..0000000
--- a/suffix tree/suffiex-tree.txt
+++ /dev/null
@@ -1,168 +0,0 @@
-Title: Suffix Trees
-Author: JSDurand
-Created: 2020-01-03
--------------------
-
-======================================================================
- Motivation to implement this algorithm in Emacs
-======================================================================
-
-The reason I want to implement this algorithm is a problem I
-encountered in using the package "orderless", which provides a
-completion-style for the built-in completion system in Emacs. The
-problem is that the function "orderless-try-completion" does not
-handle completion aggressively in a way that conforms to the
-documentation of "try-completion". To be more precise, if there is
-only one match for the current text input, then this function returns
-that match (rather than comforming to the requirement in the
-documentation of "try-completion" by returning t, since it wants to
-highlight the matches by itself). This isn't a problem from the
-perspective of the user, and if there are no matches, then this
-correctly returns nil. But in any other cases, this function returns
-the original string, instead of the longest common substring, as a
-user might desire.
-
-Initially, this does not seem like a serious concern: the user could
-still select a completion from the "*Completions*" buffer once the
-list of candidates becomes small enough to easily manage. But as time
-goes by, this starts to frustrate me. For example, when there are only
-two candidates left, but one is a substring of another, then I cannot
-use the completion feature to quickly select (or "complete to") the
-shorter candidate, unless I type out the full candidate string, or to
-choose from the "*Completions*" buffer. For me this kind of defeats
-the main purpose of the built-in completion system.
-
-So I start wondering, is it possible to fix the problem by finding the
-longest common substring in all the matches? From a naïve first
-impression, this seems to be what the user might expect in most cases.
-
-
-
-======================================================================
- Choice of the algorithm
-======================================================================
-
-After some thinking and searching through the internet, I found that
-perhaps the most flexible and performant solution to the problem of
-finding the longest common substring(s) of multiple strings is to
-build a "generalized suffix tree" of them, and then use a tree
-traversal to find the longest common substring(s). Well, this is all
-fun and great. The only problem is that it is difficult to build a
-suffix tree (or a generalized one).
-
-So I decide to implement the seemingly fastest algorithm to construct
-a suffix tree of strings and hope that this can not only solve my
-problem but also help others out in some other problems in the future.
-
-
-
-======================================================================
- Definitions
-======================================================================
-
-I describe the basic definition of a suffix tree briefly below.
-
-In this document a string is a sequence of alphabets. In particular,
-for the case of Emacs, these alphabets are just numbers. We begin by
-considering a string S of length n. A suffix of S is a substring of S
-of the form S[i..n] for some i from 1 to n. And we also say the empty
-string is a suffix of S. A suffix tree of S is defined as a rooted
-tree T (so it has a node called "root" that is the ancestor of every
-other node) whose every edge has a label which is a substring of S
-that satisfies the following conditions:
-
-- Starting from the root of T, walking down any path to a leaf, and
- concatenating the labels along the way, then we will get a suffix of
- S. And every suffix of S is obtained in this way as well.
-- Every node has at least two out-going edges.
-- For every node, every two out-going edges cannot have labels that
- start with the same letter. (So two edges with labels both starting
- with 'a' cannot emanate from the same node.)
-
-Intuitively speaking, this is to list all suffixes of S as an edge
-from the root to a leaf, and then "merge" these suffixes so that any
-common prefix among some of them is in only one edge.
-
-As a concrete example, the suffix tree for "hah$" is given as follows.
-I am using the library "hierarchy.el" to visualize trees.
-
-root
- h
- $
- ah$
- ah$
- $
-
-
-
-
-======================================================================
- Naive description of the algorithm
-======================================================================
-
-Below is a breif description of the algorithm. For a description in
-"plain English", see the accepted answer to this Stack Overflow post.
-
-https://stackoverflow.com/questions/9452701/
-
-Fonr a more detailed survey on the principle behind the algorithm and
-on many other related topics, see the book "Algorithms on Strings,
-Trees, and Sequences: Computer Science and Computational Biology" by
-Dan Gusfield, or if you prefer reading the original paper, then the
-original paper by Ukkonen is as follows.
-
-https://link.springer.com/article/10.1007/BF01206331
-
-(I am fortunate enough to be able to access the article. If you want
-to read that PDF and don't want to pay Springer, let me know and I can
-send you the file.)
-
-Given a string S of length n, we will first append a symbol that is
-not present anywhere in S, in order to ensure that no suffix of S is
-also the prefix of another suffix of S; otherwise S cannot have a
-suffix tree. I refer to this terminating symbol as $.
-
-Also an edge of the tree is not labelled explicitly by strings. To do
-so would violate already the linear time constraint. Instead, we
-represent each edge with a pair of integers, interpreted as the
-indices of the starting and the ending positions of the associated
-substring in T.
-
-The algorithm has n + 1 iterations. We start with the following
-variables:
-
-- A root node.
-- An active node.
-- An active edge.
-- The length of the active edge.
-- "remainder" with value 0.
-
-Then we want to add n + 1 symbols to the tree iteratively.
-
-In the i-th iteration we look to add the substring
-S[(i-remainder+1)..i] of S. In Emacs Lisp this is expressed as
-(substring S (- i remainder -1) (1+ i)).
-
-And in each iteration we do the following things (the description that
-follows is not rigorous; see the sources mentioned above for more
-detailed and rigorous expositions):
-
-- Check if the "active point" specified by the triple
- (active node, active edge, active edge length)
- has an edge going out whose first letter of the label matches the
- i-th character of S.
-- If it matches, then that means we don't have to add anything to
- the tree: we just update the specifications of the active point to
- point to the right place.
-- Otherwise, we add the i-th charcter of S as a new leaf of the
- active point. This may involve splitting an edge.
-- In both of the sub-cases, if there is a node that we created in
- the last round, then we add a "suffix link" that points to the
- current node. This suffix link is the main reason this algorithm
- can run in linear time.
-- After updating the active point, we still need to make sure that our
- specification of the active point is valid. We do this by examining
- if the active length is >= the length of the active edge. If so,
- then we set the active point to follow the path.
-
-Then we repeat until S(i) equals the terminating symbol.