summaryrefslogtreecommitdiff
path: root/comb/suffiex-tree.txt
blob: 51908a3609a4ef5bf825d128f79eef1e58c07384 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
Title: Suffix Trees
Author: JSDurand
Created: 2020-01-03
-------------------

======================================================================
           Motivation to implement this algorithm in Emacs
======================================================================

The reason I want to implement this algorithm is a problem I
encountered in using the package "orderless", which provides a
completion-style for the built-in completion system in Emacs. The
problem is that the function "orderless-try-completion" does not
handle completion aggressively in a way that conforms to the
documentation of "try-completion". To be more precise, if there is
only one match for the current text input, then this function returns
that match (rather than comforming to the requirement in the
documentation of "try-completion" by returning t, since it wants to
highlight the matches by itself). This isn't a problem from the
perspective of the user, and if there are no matches, then this
correctly returns nil. But in any other cases, this function returns
the original string, instead of the longest common substring, as a
user might desire.

Initially, this does not seem like a serious concern: the user could
still select a completion from the "*Completions*" buffer once the
list of candidates becomes small enough to easily manage. But as time
goes by, this starts to frustrate me. For example, when there are only
two candidates left, but one is a substring of another, then I cannot
use the completion feature to quickly select (or "complete to") the
shorter candidate, unless I type out the full candidate string, or to
choose from the "*Completions*" buffer. For me this kind of defeats
the main purpose of the built-in completion system.

So I start wondering, is it possible to fix the problem by finding the
longest common substring in all the matches? From a naïve first
impression, this seems to be what the user might expect in most cases.



======================================================================
                       Choice of the algorithm
======================================================================

After some thinking and searching through the internet, I found that
perhaps the most flexible and performant solution to the problem of
finding the longest common substring(s) of multiple strings is to
build a "generalized suffix tree" of them, and then use a tree
traversal to find the longest common substring(s). Well, this is all
fun and great. The only problem is that it is difficult to build a
suffix tree (or a generalized one).

So I decide to implement the seemingly fastest algorithm to construct
a suffix tree of strings and hope that this can not only solve my
problem but also help others out in some other problems in the future.



======================================================================
                             Definitions
======================================================================

I describe the basic definition of a suffix tree briefly below.

In this document a string is a sequence of alphabets. In particular,
for the case of Emacs, these alphabets are just numbers. We begin by
considering a string S of length n. A suffix of S is a substring of S
of the form S[i..n] for some i from 1 to n. And we also say the empty
string is a suffix of S. A suffix tree of S is defined as a rooted
tree T (so it has a node called "root" that is the ancestor of every
other node) whose every edge has a label which is a substring of S
that satisfies the following conditions:

- Starting from the root of T, walking down any path to a leaf, and
  concatenating the labels along the way, then we will get a suffix of
  S. And every suffix of S is obtained in this way as well.
- Every node has at least two out-going edges.
- For every node, every two out-going edges cannot have labels that
  start with the same letter. (So two edges with labels both starting
  with 'a' cannot emanate from the same node.)

Intuitively speaking, this is to list all suffixes of S as an edge
from the root to a leaf, and then "merge" these suffixes so that any
common prefix among some of them is in only one edge.



======================================================================
                     Description of the algorithm
======================================================================

Below is a breif description of the algorithm. For a description in
"plain English", see the accepted answer to this Stack Overflow post.

https://stackoverflow.com/questions/9452701/

For a more detailed survey on the principle behind the algorithm and
on many other related topics, see the book "Algorithms on Strings,
Trees, and Sequences: Computer Science and Computational Biology" by
Dan Gusfield, or if you prefer reading the original paper, then the
original paper by Ukkonen is as follows.

https://link.springer.com/article/10.1007/BF01206331

(I am fortunate enough to be able to access the article. If you want
to read that PDF and don't want to pay Springer, let me know and I can
send you the file.)

Given a string S of length n, we will first append a symbol that is
not present anywhere in S, in order to ensure that no suffix of S is
also the prefix of another suffix of S; otherwise S cannot have a
suffix tree. I refer to this terminating symbol as $.

Also an edge of the tree is not labelled explicitly by strings. To do
so would violate already the linear time constraint. Instead, we
represent each edge with a pair of integers, interpreted as the
indices of the starting and the ending points of the associated
substring in T.

The algorithm has n + 1 iterations. We start with the following
variables:

- s = root
- k = 1
- i = 0
# - A root node.
# - An "active point" with value (root, nil, 0).
# - "remainder" with value 1.

Then we want to add n + 1 symbols to the tree iteratively.

In the i-th iteration we look to add the i-th letter S (i) of S. In
Emacs Lisp this is expressed as (aref S i).

# substring
# S[(i-remainder+1)..i]

And in each iteration we do the following things:

- (setq i (+ i 1))
- (let ((result (update s k i)))
    (setq s (car result))
    (setq k (cadr result)))
  We update by adding the i-th letter S(i) to the current active point
  indicated by (s, k, i - 1), see below.
- (let ((result (canonize s k i)))
    (setq s (car result))
    (setq k (cadr result)))
  We canonize the new active point returned by the update function.
  The process of canonization is to find the closest node to the
  point.

Then we repeat until S(i) equals the terminating symbol (this way we
avoid calculating the length of the string beforehand, since in Emacs
Lisp the string does not have the length pre-calculated. Though this
does not affect the overall time complexity, it might affect the
practical performance.

The function "update is described below.

It taks three arguments: s, k, and i. First let oldr = root. Then let
(end-p, r) be the result of (test-and-split s k (- i 1) S(i)).













#  begin a "remainder loop". Whenever the remainder
# loop ends, we go to the next iteration.

We compare the letter S(i) with the labels of edges going out from the
active point. That is, if the active point is (node, edge_label, m),
then this is the length m point in the edge going out from node with
the first letter of label given as edge_label.

If the letter is the prefix of some edge label, then we set the
active point to (node, edge_label, m+1) and increment remainder by 1.
If m+1 is greater than or equal to the length of the current edge,
then set the active point to follow that edge to the new point.

Then we end the remainder loop and skip to the next iteration.

---

If the letter is not the prefix of any edge label emanating from the
active point, and m > 0, then we split the point (node, edge_label, m)
into a node and add a new leaf to that node, with label S(i). And then
we decrement remainder by 1.

If this is not the first time we split a node in the current
iteration, then we add a "suffix link" from the previously created
node to the newly created node.

If the "node" in the specification of the active point is the root,
then we set the active point to (root, S(i-remainder+1), m-1).

If the "node" is not the root, then if the node has a suffix link to
other_node, then we set the active point to the following:
(other_node, S(i-remainder+1), m-1)

If the node has no suffix link, then we set the active point to the
following: (root, S(i-remainder+1), m-1)

---

If the letter is not the prefix of any edge label emanating from the
active point, and if m = 0, then we add a new leaf with label S(i) to
node, and decrement remainder by 1. Then we follow the same rules to
reset the active point.