summaryrefslogtreecommitdiff
path: root/algorithm.tex
blob: 20fb1ac4bbeaab6de7ccc5b0f4311575c1521e80 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
\documentclass[12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{grffile}
\usepackage{longtable}
\usepackage{wrapfig}
\usepackage{rotating}
\usepackage[normalem]{ulem}
\usepackage{amsmath}
\usepackage{mathrsfs}
\usepackage{textcomp}
\usepackage{amssymb}
\usepackage{capt-of}
\usepackage{hyperref}
\usepackage{amsfonts, amsthm}
\usepackage{enumitem}
\usepackage{tikz}
\author{Durand}
\date{2023--08--05}
\title{\vspace{-5ex}Algorithm used by the package REP}
\hypersetup{
  pdfauthor={Durand},
  pdftitle={Algorithm used by the package REP},
  pdfkeywords={},
  pdfsubject={},
  pdfcreator={Durand},
  pdflang={English}}

\newtheorem{lem}{Lemma}[section]
\newtheorem{coro}[lem]{Corollary}
\newtheorem{thm}{Theorem}[section]
\newtheorem{prop}{Proposition}[section]
\newtheorem{ex}{Exercise}[section]
\theoremstyle{definition}
\newtheorem*{rmk}{Remark}
\newtheorem{defn}{Definition}[section]

\newcommand{\n}{\mathbb{N}}
\newcommand{\z}{\mathbb{Z}}
\newcommand{\q}{\mathbb{Q}}
\newcommand{\im}{\operatorname{Im}}
\renewcommand{\labelenumi}{\arabic{enumi}.}
\renewcommand{\labelenumii}{(\alph{enumii})}
\begin{document}
\maketitle

\tableofcontents

\begin{abstract}
  This package uses an algorithm that is not widely adapted, at least
  not to my knowledge.  So I think I will briefly explain how it
  works.

  The essence of the algorithm, in my opinion, is the chain-rule for
  derivatives of languages.  And an optimization technique that makes
  a great difference in the performance of the algorithm, not yet
  implemented in this package, is the \textbf{context-free
    memoization}.
\end{abstract}

\section{Grammars}\label{sec:grammars}

Before explaining what parsers are, we need first to explain what
grammars are.

In short, a grammar can be viewed as a \emph{substitution machine}.
That is, a grammar consists of a set of variables, constants, and
rules

\begin{equation*}
X \mapsto Y \ldots Z,
\end{equation*}

where \(X\) is a variable and \(Y \ldots Z\) are either variables or
constants.  The grammar then substitutes every occurence of \(X\) by
the sequence at the right-hand side \(Y\ldots Z\).  And there is a
\emph{starting variable}, usually denoted as \(S\).

To sum up, we start from the sequence \(S\), and perform substitutions
according to the rules.  If the substituted sequence cannot be further
substituted, i.e. if the sequence consists of constants only, we say
that this constant-only sequence is "generated" by the grammar.

One can think of this substitution machine as macros in \TeX, except
that this only allow macros without arguments.  Or one can think of
the machine as macros in programming languages like \texttt{Lisp},
again without arguments.

Notice that a variable can have multiple rules to substitute, and a
given sequence can be generated by different sets of substitutions.

\section{Parsers}\label{sec:parsers}

Parsers are \emph{inverses} of grammars.  More precisely, while
grammars perform substitutions to generate sequences, parsers try to
find out what substitutions the grammar performed in order to generate
a given sequence.  Since there might be multiple ways to generate a
given sequence, the parser needs to find all possible ways.

In fact, the number of ways to generate a sequence can even be
infinite.  For example, if the grammar has the following two rules:

\begin{align*}
S &\mapsto S\\
S &\mapsto "".
\end{align*}

This grammar means that \(S\) can be subtituted by \(S\) itself, and
it can be substituted by the empty sequence as well.  In this case,
the grammar can only generate the empty sequence, but it can generate
the empty sequence by infinitely many ways: it can first substitute
\(S\) by itself any number of times, until it substitutes \(S\) by the
empty sequence.

Moreover, even if we do not consider the generations that produce
empty sequences, the number of ways a grammar can generate a given
sequence is in general \(O(n^3)\), where \(O\) means the
\textbf{big-O} notation, and \(n\) is the length of the sequence to be
generated.  This is why efficient general parsers that work for every
grammar are hard to construct.

\section{Parser Generators}\label{sec:parser-generators}

Now we come to the topic of this package: parser generators.  Simply
speaking, a parser generator takes a grammar as input, and produces a
parser for that grammar as output.

One can classify parser generators into two types: comiled and
interpreted.  The former means that the generator produces source
codes of a parser that needs to be compiled to produce a parser.  The
latter means that the generator simply produces a parser that needs to
be run by an interpreter.  The well-known \emph{tree-sitter} parser
generator belongs to the first category, for example.

This package is interpreted.  But there is a plan to also turn on a
\emph{compiled mode} to potentially make the generated parsers more
performant.

\section{Augmented Backus-Naur Form}\label{sec:augm-back-naur}

There are various ways to write grammars.  This package chooses the
\emph{augmented backus-naur form}.  See
\href{https://en.wikipedia.org/wiki/Augmented_Backus-Naur_Form}{Wiki}
for an overview, and see
\href{https://datatracker.ietf.org/doc/html/rfc5234}{RFC 5234} for its
detailed definition, or
\href{https://www.rfc-editor.org/rfc/rfc5234.txt}{the text version}.

The reason for this choice is that the \textbf{RFC documents} use this
format, so there are an abundance of mature grammars to use.  For
example, the ABNF format itself has a grammar in the ABNF format, so
we can use the generated parser to parse ABNF formats, immediately.

There is also a plan to let the users choose their own favourite
format, of course.

\section{Languages and derivatives}\label{sec:lang-deriv}

This section is involved and optional.  One can jump to the
section~\ref{sec:tokenization} on Tokenization directly if the details
of the algorithm are not so interesting.

To understand the algorithm, we shall first talk about the derivatives
of languages.

For the sake of convenience, let \(\Sigma\) denote a finite set of
alphabets.  The set of all Unicode code points can be an example.  We
call a sequence of elements in \(\Sigma\) a \emph{sentence}, and we
call a (possibly infinite) set of sentences a \emph{language}.  For
example, the set of all grammatically correct English sentences is a
language, to which we may give the name \emph{English}.

Now, if \(L\) is a language and \(s\) is a sentence, we define the
\textbf{derivative of L with respect to s}, denoted as
\(\frac{dL}{ds}\), as the set of all sentences \(t\) such that the
concatenated sentence \(s \cdot t\) belongs to \(L\).  We can
re-formulate this definition in more mathematical terms:

\[
  \frac{dL}{ds} := \left\{t \text{ is a sentence } \mid s \cdot t
    \text{ belongs to } L\right\}.
\]

As a side note, the reason to call it the \emph{derivative} is not
only definitional similarity (to some extent), but also that this
operation of taking the derivative of a language with respect to a
sentence satisfies some similar properties as the \emph{derivatives in
  Calculus}.

\section{Parsing by taking derivatives}\label{sec:pars-taking-deriv}

Why consider derivatives in a document about parser generators?
Because we can parse sentences by taking derivatives.

More precisely, let \(L\) be a language and let \(s\) be a sentence.
If the derivative \(\frac{dL}{ds}\) of \(L\) by \(s\) (which is a set
of sentences) contains the empty sentence, then by definition this
means that \(s\) belongs to \(L\).  This observation is the basis of
parsing by taking derivatives.

\begin{quotation}
  Those who have a rigorous mind are probably frowning at the above
  paragraph because we say we can parse by taking derivatives, whereas
  we actually only determine if the sentence belongs to the language.
  This gap comes from that the details of actually constructing the
  parse forests by taking derivatives are too complicated for me to
  explain clearly here.  I might come up with a clear explanation in
  the future, though.
\end{quotation}

\section{Regular languages}\label{sec:regular-languages}

With the above definition in mind, we now say that a language \(L\) is
\emph{regular} if and only if the set of derivatives of \(L\) is
finite, i.e.

\[
  \left\{\frac{dL}{ds} \mid s \text{ is a sentence } \right\} \text{
    is finite}.
\]

Note that this is a set of languages, namely, a set of sets of
sentences.  And we say it is finite if it contains finitely many sets.

For those familiar with regular expressions, a language is regular if
and only if there is a regular expression such that a sentence belongs
to the language if and only if it matches that regular expression.  I
chose to define regular languages using derivatives as it is easier to
reason about them in the following, in my opinion.

\section{Languages of grammars}\label{sec:languages-grammars}

If \(G\) is a grammar, we say that \emph{\(G\) generates a language
  \(L\)}, where \(L\) consists of all sentences generated by the
grammar.  Denote the language generated by \(G\) as \(L_G\).

\section{Derivatives of the languages of grammars}
\label{sec:deriv-lang-gramm}

If \(s\) is a sentence, we ask the question:

\begin{quotation}
  What is the derivative \(\frac{dL_{G}}{ds}\) of \(L_G\) by \(s\)?
\end{quotation}

To answer this, note that if \(s\) and \(t\) are sentences, and if
\(L\) is any language, then \(\frac{dL}{d(s\cdot t)}\) is equal to
\(\frac d{dt}\frac{dL}{ds}\), the derivative by \(t\) of the
derivative by \(s\) of \(L\).  This can be seen by definition: a
sentence \(r\) belongs to \(\frac{dL}{d(s\cdot t)}\) if and only if
\((s\cdot t\cdot r)\) belongs to \(L\), which is exactly the condition
for \((t\cdot r)\) to belong to \(\frac{dL}{ds}\).  And this is the
condition for \(r\) to belong to \(\frac d{dt} \frac{dL}{ds}\).

The above observation tells us that we can obtain \(\frac{dL_G}{ds}\)
by taking derivatives of \(L_G\) by sentences consisting of one single
alphabet, successively.

Now consider \(\frac{dL_G}{da}\).

Suppose \(G\) consists of \(n\) rules

\begin{equation}
  \label{eq:rules_of_g}
  \begin{aligned}
    X_{1}&\mapsto Y_{1}\ldots Y_{m_{1}}\\
    \vdots\\
    X_{n}&\mapsto Z_{1} \ldots Z_{m_{n}},
  \end{aligned}
\end{equation}
where \(X_i\) are variables, \(m_i\) is an integer depending on \(i\),
for \(1\leq i\leq n\), and
\(Y_1, \ldots, Y_{m_1}, \ldots, Z_1, \ldots, Z_{m_n}\) are either
variables or constants.

It is then natural to try to determine \(\frac{dL_G}{da}\) by some
kind of substitution.  So for \(1\leq i\leq n\), let \(X_i\) also
denote the language that is generated if we treat the variable \(X_i\)
as the starting variable in the grammar.  For instance, if \(X_1\) is
the original starting variable for the grammar \(G\), then \(L_G\) is
also regarded as equal to \(X_1\) under this view.

Further, we observe that if both \(M_1\) and \(M_2\) are languages,
and if \(M\) is their concatenation:

\begin{equation*}
  M := \left\{ s \cdot t\mid s \in M_{1}, t\in M_{2}\right\}.
\end{equation*}

Then \(\frac{dM}{da}\) is equal to either
\[\frac{dM_{1}}{da}\cdot M_{2}\] or
\[\frac{dM_{1}}{da} \cdot M_{2}\bigcup \frac{dM_{2}}{da},\]

depending on whether or not \(M_1\) contains the empty sentence.  This
can be regarded as the \emph{product-rule} of the derivatives of
languages, as an aside.

Consequently, the derivative \(\frac{dX_1}{da}\) is equal to the union
of

\begin{align*}
  &\frac{dY_{1}}{da}\cdot Y_{2}\cdots Y_{m_{1}},\\
  &\frac{dY_{2}}{da}\cdot Y_{3}\cdots Y_{m_{1}},\\
  &\vdots\\
  &\frac{dY_{i}}{da}\cdot Y_{i+1}\cdots Y_{m_{1}},\\
\end{align*}

where \(i\) is the unique integer such that the language \(Y_{i}\)
does not contain the empty sentence.  If no such \(i\) exists, then we
define \(i=m_{1}\).  And \emph{mutatis mutandis} for
\(\frac{dX_{2}}{da},\ldots, \frac{dX_{n}}{da}\).

Now we treat the languages \(\frac{dY_i}{da}\) as variables: if
\(Y_i\) is a variable, this denotes the derivative by \(a\) of the
language generated by \(Y_i\); if \(Y_i\) is a constant, then this
denotes either the language consisting of the empty sentence, or the
empty set, depending upon whether \(Y_i\) is equal to \(a\) or not.
Then the previous paragraph defines the substitution rules of these
new variables \(\frac{dY_i}{da}\).  We call these rules as \emph{the
  derived rules} of \(\frac{dY_i}{da}\).

It follows that we can obtain the derivative
\(\frac{dL_{G}}{da} = \frac{dX_{1}}{da}\) by \emph{substituting} these
\(\frac{dY_i}{da}\) according to the derived rules.

Combining the above, we see that the derivatives of the language
\(L_G\) by any sentence are generated by grammars.  We denote the
above grammar, which generates \(\frac{dL_G}{da}\), as \emph{the
  derived grammar} \(\frac{dG}{da}\).  Now our problem is transformed
to determine whether the successive derivatives of grammars
\(\frac{dG}{da_1}, \frac d{da_2}\frac{dG}{da_1} ,\ldots\), generates
the empty sentence or not.  This is relatively easy to do, once we can
compute the successive derivatives efficiently.

\section{Chain-rule of derivatives}\label{sec:chain-rule-deriv}

All of the above was not new when the author proposed the algorithm.
The core idea of the algorithm, in my opinion, is the
\textbf{chain-rule} of the operation of taking derivatives of
languages.

In calculus, roughly speaking, the chain-rule of derivatives refers to
the formula

\begin{equation*}
  \frac{df}{dx} =
  \sum_{i=1}^{n} \frac{df}{dy_{i}} \cdot \frac{dy_{i}}{dx},
\end{equation*}

where \(f\) is a function in \(n\) variables \(y_1, ..., y_n\), which
are also functions in the variable \(x\).

If we replace the function \(f\) by a language \(L\) generated by a
grammar \(G\), the variables \(y_1, \ldots , y_n\) by the variables in
\(G\), \(x\) by an alphabet \(a\), the multiplication \(\cdot\) by the
concatenation of languages, and the summation \(+\) by the union of
languages, then we obtain a formula writing the derivative
\(\frac{dG}{da}\) as a union of the concatenations:

\begin{equation}\label{eq:chain_rule}
  \frac{dG}{da} =
  \bigcup_{i=1}^{n}\frac{dG}{dy_{i}} \cdot \frac{dy_{i}}{da}.
\end{equation}

The only remaining questions are: What does \(\frac{dG}{dy_{i}}\)
actually mean, and is the formula~\eqref{eq:chain_rule} true?

\subsection{Derivatives with respect to variables}
\label{sec:deriv-wrt-var}

We want to generalize the construction in the
section~\ref{sec:deriv-lang-gramm} to be able to take the derivatives
with respect to variables.

First, for the sake of brevity, let \(\Sigma^{\star}\) denote the set
of all sentences, i.e. possibly empty sequences of elements in
\(\Sigma\).

We define a new alphabet
\(\Sigma' := \Sigma\bigcup \left\{X_{i}\,\vert\, 1\leq i\leq
  n\right\}\) and a new set of variables
\[
  \left\{ Z_{i}\ \bigg\vert\ 1\leq i\leq n \right\},
\]
where \(X_{i}\) are the variables in \(G\), according to its
rules~\ref{eq:rules_of_g}.

Then for each rule \(X_{i}\mapsto Y_{1}\ldots Y_{m_{i}}\), we define
\(j\) rules: for \(1\leq k\leq j\),
\begin{equation}\label{eq:closed_grammar_rules}
  Z_{i}\mapsto Z_{k} \cdot Y_{k+1}\cdots Y_{m_{i}},
\end{equation}
where \(j\) is the smallest integer such that the language \(Y_{j}\)
does not contain the empty sequence \(\varepsilon\in\Sigma^{\star}\).
Here if \(Y_{k}\) belongs to \(\Sigma\), then \(Z_{k}\) is replaced by
\(Y_{k}\) itself, i.e. the right-hand side of that rule does not
contain variables.

This defines a grammar, which we call the \emph{closed grammar}
\(\overline G\).

With this closed grammar, we can naturally take derivatives with
respect to the original variables \(X_{i}\), as they are now alphabets
of this closed grammar \(\overline G\).

\subsection{Proof of the chain-rule}\label{sec:proof-chain-rule}

\begin{prop}\label{prop:proof-of-chain-rule}
  \[
    \frac{dL_{G}}{da} = \bigcup_{i=1}^{n} \frac{dX_{i}}{da} \cdot
    \left( \frac{dL_{\overline G}}{dX_{i}} \bigcap
      \operatorname{P}\left(\Sigma^{\star}\right) \right) ,\, \forall
    a\in\Sigma.
  \]
\end{prop}

\begin{proof}
  For any \(i\), any sequence in \(\frac{dL_{\overline G}}{dX_{i}}\)
  is generated from \(G\) by partial substitutions, and removing empty
  sequences.  So the containement of the right-hand side in the
  left-hand side is obvious.

  Conversely, if a sentence \(s\in\Sigma^{\star}\) begins with \(a\),
  then \(s\) must be obtained by substitution from some sentence in
  \({(\Sigma')}^{\star}\), which means that it belongs to the
  right-hand side.  So the proof is complete.
\end{proof}

\begin{rmk}
  It follows that we can determine whether or not
  \(\frac{dL_{G}}{da}\) contains the empty sequence by examining
  \(\frac{dX_{i}}{da}\) and \(\frac{dL_{\overline G}}{dX_{i}}\)
  respectively.

  Further, since
  \(\frac{dX_{i}}{da} = \frac{dZ_{i}}{da}\bigcap\operatorname{P}
  (\Sigma^{\star})\) and
  \(\frac{dL_{\overline G}}{dX_{i}} = \frac{dZ_{1}}{dX_{i}}\), we
  conclude that we only have to examine \(\frac{dZ_{i}}{dX_{j}}\), for
  \(1\leq i,j\leq n\).
\end{rmk}

What makes this grammar interesting is its regularity.

\subsection{The closed grammar is regular}\label{sec:clos-gramm-regul}

\begin{prop}
  The closed grammar is regular.  In fact, the set
  \[
    \left\{\frac{dZ_{i}}{ds}\ \bigg\vert\
      1\leq i\leq n,\,s\in{(\Sigma')}^{\star}\right\}
  \]
  is finite.
\end{prop}

\begin{proof}
  Observe the rules~\eqref{eq:closed_grammar_rules}.  From the
  definition, we see that the right-hand sides of rules only contain
  at most one variable, \(Z_{k}\), and that variable must appear as
  the first symbol in the sequence on the right-hand side.  This
  observation is sufficient to show that the closed grammar is
  regular: the rigorous-minded readers can prove this statement by
  first showing that the language generated by the closed grammar is a
  finite union of concatenations of languages of two types: either a
  set consisting of only one finite sentence, or a set consisting of
  zero or more self-repetitions of a finite sequence.  Then the reader
  can easily verify that such languages are regular, by showing that
  the derivatives of a language consisting of zero or more
  self-repetitions of a finite sequence are finitely many.  A hint for
  the last step is to notice that the derivatives of the language
  \(s^{\star}\) of zero or more self-repetitions of a sequence
  \(s:=s_{1},\ldots,s_{p}\) can be injected in the finite set
  \[
    \left\{\left\{s_{i},s_{i+1},\ldots,s_{p}\right\}\cdot s^{\star}
      \mid 1\leq i\leq p+1\right\}.
  \]
\end{proof}

\section{Tokenization}\label{sec:tokenization}

There is another abstraction used by this package.



\end{document}