summaryrefslogtreecommitdiff
path: root/suffix tree/suffiex-tree.txt
blob: 6eb4b5ab119aaee1ecca4e7a517cd2e0cf2e28f8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
Title: Suffix Trees
Author: JSDurand
Created: 2020-01-03
-------------------

======================================================================
           Motivation to implement this algorithm in Emacs
======================================================================

The reason I want to implement this algorithm is a problem I
encountered in using the package "orderless", which provides a
completion-style for the built-in completion system in Emacs. The
problem is that the function "orderless-try-completion" does not
handle completion aggressively in a way that conforms to the
documentation of "try-completion". To be more precise, if there is
only one match for the current text input, then this function returns
that match (rather than comforming to the requirement in the
documentation of "try-completion" by returning t, since it wants to
highlight the matches by itself). This isn't a problem from the
perspective of the user, and if there are no matches, then this
correctly returns nil. But in any other cases, this function returns
the original string, instead of the longest common substring, as a
user might desire.

Initially, this does not seem like a serious concern: the user could
still select a completion from the "*Completions*" buffer once the
list of candidates becomes small enough to easily manage. But as time
goes by, this starts to frustrate me. For example, when there are only
two candidates left, but one is a substring of another, then I cannot
use the completion feature to quickly select (or "complete to") the
shorter candidate, unless I type out the full candidate string, or to
choose from the "*Completions*" buffer. For me this kind of defeats
the main purpose of the built-in completion system.

So I start wondering, is it possible to fix the problem by finding the
longest common substring in all the matches? From a naïve first
impression, this seems to be what the user might expect in most cases.



======================================================================
                       Choice of the algorithm
======================================================================

After some thinking and searching through the internet, I found that
perhaps the most flexible and performant solution to the problem of
finding the longest common substring(s) of multiple strings is to
build a "generalized suffix tree" of them, and then use a tree
traversal to find the longest common substring(s). Well, this is all
fun and great. The only problem is that it is difficult to build a
suffix tree (or a generalized one).

So I decide to implement the seemingly fastest algorithm to construct
a suffix tree of strings and hope that this can not only solve my
problem but also help others out in some other problems in the future.



======================================================================
                             Definitions
======================================================================

I describe the basic definition of a suffix tree briefly below.

In this document a string is a sequence of alphabets. In particular,
for the case of Emacs, these alphabets are just numbers. We begin by
considering a string S of length n. A suffix of S is a substring of S
of the form S[i..n] for some i from 1 to n. And we also say the empty
string is a suffix of S. A suffix tree of S is defined as a rooted
tree T (so it has a node called "root" that is the ancestor of every
other node) whose every edge has a label which is a substring of S
that satisfies the following conditions:

- Starting from the root of T, walking down any path to a leaf, and
  concatenating the labels along the way, then we will get a suffix of
  S. And every suffix of S is obtained in this way as well.
- Every node has at least two out-going edges.
- For every node, every two out-going edges cannot have labels that
  start with the same letter. (So two edges with labels both starting
  with 'a' cannot emanate from the same node.)

Intuitively speaking, this is to list all suffixes of S as an edge
from the root to a leaf, and then "merge" these suffixes so that any
common prefix among some of them is in only one edge.

As a concrete example, the suffix tree for "hah$" is given as follows.
I am using the library "hierarchy.el" to visualize trees.
                                   
root
  h
    $
    ah$
  ah$
  $




======================================================================
                  Naive description of the algorithm
======================================================================

Below is a breif description of the algorithm. For a description in
"plain English", see the accepted answer to this Stack Overflow post.

https://stackoverflow.com/questions/9452701/

Fonr a more detailed survey on the principle behind the algorithm and
on many other related topics, see the book "Algorithms on Strings,
Trees, and Sequences: Computer Science and Computational Biology" by
Dan Gusfield, or if you prefer reading the original paper, then the
original paper by Ukkonen is as follows.

https://link.springer.com/article/10.1007/BF01206331

(I am fortunate enough to be able to access the article. If you want
to read that PDF and don't want to pay Springer, let me know and I can
send you the file.)

Given a string S of length n, we will first append a symbol that is
not present anywhere in S, in order to ensure that no suffix of S is
also the prefix of another suffix of S; otherwise S cannot have a
suffix tree. I refer to this terminating symbol as $.

Also an edge of the tree is not labelled explicitly by strings. To do
so would violate already the linear time constraint. Instead, we
represent each edge with a pair of integers, interpreted as the
indices of the starting and the ending positions of the associated
substring in T.

The algorithm has n + 1 iterations. We start with the following
variables:

- A root node.
- An active node.
- An active edge.
- The length of the active edge.
- "remainder" with value 0.

Then we want to add n + 1 symbols to the tree iteratively.

In the i-th iteration we look to add the substring
S[(i-remainder+1)..i] of S. In Emacs Lisp this is expressed as
(substring S (- i remainder -1) (1+ i)).

And in each iteration we do the following things (the description that
follows is not rigorous; see the sources mentioned above for more
detailed and rigorous expositions):

- Check if the "active point" specified by the triple
  (active node, active edge, active edge length)
  has an edge going out whose first letter of the label matches the
  i-th character of S.
- If it matches, then that means we don't have to add anything to
  the tree: we just update the specifications of the active point to
  point to the right place.
- Otherwise, we add the i-th charcter of S as a new leaf of the
  active point. This may involve splitting an edge.
- In both of the sub-cases, if there is a node that we created in
  the last round, then we add a "suffix link" that points to the
  current node. This suffix link is the main reason this algorithm
  can run in linear time.
- After updating the active point, we still need to make sure that our
  specification of the active point is valid. We do this by examining
  if the active length is >= the length of the active edge. If so,
  then we set the active point to follow the path.

Then we repeat until S(i) equals the terminating symbol.