summaryrefslogtreecommitdiff
path: root/algorithm.tex
diff options
context:
space:
mode:
Diffstat (limited to 'algorithm.tex')
-rw-r--r--algorithm.tex486
1 files changed, 486 insertions, 0 deletions
diff --git a/algorithm.tex b/algorithm.tex
new file mode 100644
index 0000000..20fb1ac
--- /dev/null
+++ b/algorithm.tex
@@ -0,0 +1,486 @@
+\documentclass[12pt]{article}
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
+\usepackage{graphicx}
+\usepackage{grffile}
+\usepackage{longtable}
+\usepackage{wrapfig}
+\usepackage{rotating}
+\usepackage[normalem]{ulem}
+\usepackage{amsmath}
+\usepackage{mathrsfs}
+\usepackage{textcomp}
+\usepackage{amssymb}
+\usepackage{capt-of}
+\usepackage{hyperref}
+\usepackage{amsfonts, amsthm}
+\usepackage{enumitem}
+\usepackage{tikz}
+\author{Durand}
+\date{2023--08--05}
+\title{\vspace{-5ex}Algorithm used by the package REP}
+\hypersetup{
+ pdfauthor={Durand},
+ pdftitle={Algorithm used by the package REP},
+ pdfkeywords={},
+ pdfsubject={},
+ pdfcreator={Durand},
+ pdflang={English}}
+
+\newtheorem{lem}{Lemma}[section]
+\newtheorem{coro}[lem]{Corollary}
+\newtheorem{thm}{Theorem}[section]
+\newtheorem{prop}{Proposition}[section]
+\newtheorem{ex}{Exercise}[section]
+\theoremstyle{definition}
+\newtheorem*{rmk}{Remark}
+\newtheorem{defn}{Definition}[section]
+
+\newcommand{\n}{\mathbb{N}}
+\newcommand{\z}{\mathbb{Z}}
+\newcommand{\q}{\mathbb{Q}}
+\newcommand{\im}{\operatorname{Im}}
+\renewcommand{\labelenumi}{\arabic{enumi}.}
+\renewcommand{\labelenumii}{(\alph{enumii})}
+\begin{document}
+\maketitle
+
+\tableofcontents
+
+\begin{abstract}
+ This package uses an algorithm that is not widely adapted, at least
+ not to my knowledge. So I think I will briefly explain how it
+ works.
+
+ The essence of the algorithm, in my opinion, is the chain-rule for
+ derivatives of languages. And an optimization technique that makes
+ a great difference in the performance of the algorithm, not yet
+ implemented in this package, is the \textbf{context-free
+ memoization}.
+\end{abstract}
+
+\section{Grammars}\label{sec:grammars}
+
+Before explaining what parsers are, we need first to explain what
+grammars are.
+
+In short, a grammar can be viewed as a \emph{substitution machine}.
+That is, a grammar consists of a set of variables, constants, and
+rules
+
+\begin{equation*}
+X \mapsto Y \ldots Z,
+\end{equation*}
+
+where \(X\) is a variable and \(Y \ldots Z\) are either variables or
+constants. The grammar then substitutes every occurence of \(X\) by
+the sequence at the right-hand side \(Y\ldots Z\). And there is a
+\emph{starting variable}, usually denoted as \(S\).
+
+To sum up, we start from the sequence \(S\), and perform substitutions
+according to the rules. If the substituted sequence cannot be further
+substituted, i.e. if the sequence consists of constants only, we say
+that this constant-only sequence is "generated" by the grammar.
+
+One can think of this substitution machine as macros in \TeX, except
+that this only allow macros without arguments. Or one can think of
+the machine as macros in programming languages like \texttt{Lisp},
+again without arguments.
+
+Notice that a variable can have multiple rules to substitute, and a
+given sequence can be generated by different sets of substitutions.
+
+\section{Parsers}\label{sec:parsers}
+
+Parsers are \emph{inverses} of grammars. More precisely, while
+grammars perform substitutions to generate sequences, parsers try to
+find out what substitutions the grammar performed in order to generate
+a given sequence. Since there might be multiple ways to generate a
+given sequence, the parser needs to find all possible ways.
+
+In fact, the number of ways to generate a sequence can even be
+infinite. For example, if the grammar has the following two rules:
+
+\begin{align*}
+S &\mapsto S\\
+S &\mapsto "".
+\end{align*}
+
+This grammar means that \(S\) can be subtituted by \(S\) itself, and
+it can be substituted by the empty sequence as well. In this case,
+the grammar can only generate the empty sequence, but it can generate
+the empty sequence by infinitely many ways: it can first substitute
+\(S\) by itself any number of times, until it substitutes \(S\) by the
+empty sequence.
+
+Moreover, even if we do not consider the generations that produce
+empty sequences, the number of ways a grammar can generate a given
+sequence is in general \(O(n^3)\), where \(O\) means the
+\textbf{big-O} notation, and \(n\) is the length of the sequence to be
+generated. This is why efficient general parsers that work for every
+grammar are hard to construct.
+
+\section{Parser Generators}\label{sec:parser-generators}
+
+Now we come to the topic of this package: parser generators. Simply
+speaking, a parser generator takes a grammar as input, and produces a
+parser for that grammar as output.
+
+One can classify parser generators into two types: comiled and
+interpreted. The former means that the generator produces source
+codes of a parser that needs to be compiled to produce a parser. The
+latter means that the generator simply produces a parser that needs to
+be run by an interpreter. The well-known \emph{tree-sitter} parser
+generator belongs to the first category, for example.
+
+This package is interpreted. But there is a plan to also turn on a
+\emph{compiled mode} to potentially make the generated parsers more
+performant.
+
+\section{Augmented Backus-Naur Form}\label{sec:augm-back-naur}
+
+There are various ways to write grammars. This package chooses the
+\emph{augmented backus-naur form}. See
+\href{https://en.wikipedia.org/wiki/Augmented_Backus-Naur_Form}{Wiki}
+for an overview, and see
+\href{https://datatracker.ietf.org/doc/html/rfc5234}{RFC 5234} for its
+detailed definition, or
+\href{https://www.rfc-editor.org/rfc/rfc5234.txt}{the text version}.
+
+The reason for this choice is that the \textbf{RFC documents} use this
+format, so there are an abundance of mature grammars to use. For
+example, the ABNF format itself has a grammar in the ABNF format, so
+we can use the generated parser to parse ABNF formats, immediately.
+
+There is also a plan to let the users choose their own favourite
+format, of course.
+
+\section{Languages and derivatives}\label{sec:lang-deriv}
+
+This section is involved and optional. One can jump to the
+section~\ref{sec:tokenization} on Tokenization directly if the details
+of the algorithm are not so interesting.
+
+To understand the algorithm, we shall first talk about the derivatives
+of languages.
+
+For the sake of convenience, let \(\Sigma\) denote a finite set of
+alphabets. The set of all Unicode code points can be an example. We
+call a sequence of elements in \(\Sigma\) a \emph{sentence}, and we
+call a (possibly infinite) set of sentences a \emph{language}. For
+example, the set of all grammatically correct English sentences is a
+language, to which we may give the name \emph{English}.
+
+Now, if \(L\) is a language and \(s\) is a sentence, we define the
+\textbf{derivative of L with respect to s}, denoted as
+\(\frac{dL}{ds}\), as the set of all sentences \(t\) such that the
+concatenated sentence \(s \cdot t\) belongs to \(L\). We can
+re-formulate this definition in more mathematical terms:
+
+\[
+ \frac{dL}{ds} := \left\{t \text{ is a sentence } \mid s \cdot t
+ \text{ belongs to } L\right\}.
+\]
+
+As a side note, the reason to call it the \emph{derivative} is not
+only definitional similarity (to some extent), but also that this
+operation of taking the derivative of a language with respect to a
+sentence satisfies some similar properties as the \emph{derivatives in
+ Calculus}.
+
+\section{Parsing by taking derivatives}\label{sec:pars-taking-deriv}
+
+Why consider derivatives in a document about parser generators?
+Because we can parse sentences by taking derivatives.
+
+More precisely, let \(L\) be a language and let \(s\) be a sentence.
+If the derivative \(\frac{dL}{ds}\) of \(L\) by \(s\) (which is a set
+of sentences) contains the empty sentence, then by definition this
+means that \(s\) belongs to \(L\). This observation is the basis of
+parsing by taking derivatives.
+
+\begin{quotation}
+ Those who have a rigorous mind are probably frowning at the above
+ paragraph because we say we can parse by taking derivatives, whereas
+ we actually only determine if the sentence belongs to the language.
+ This gap comes from that the details of actually constructing the
+ parse forests by taking derivatives are too complicated for me to
+ explain clearly here. I might come up with a clear explanation in
+ the future, though.
+\end{quotation}
+
+\section{Regular languages}\label{sec:regular-languages}
+
+With the above definition in mind, we now say that a language \(L\) is
+\emph{regular} if and only if the set of derivatives of \(L\) is
+finite, i.e.
+
+\[
+ \left\{\frac{dL}{ds} \mid s \text{ is a sentence } \right\} \text{
+ is finite}.
+\]
+
+Note that this is a set of languages, namely, a set of sets of
+sentences. And we say it is finite if it contains finitely many sets.
+
+For those familiar with regular expressions, a language is regular if
+and only if there is a regular expression such that a sentence belongs
+to the language if and only if it matches that regular expression. I
+chose to define regular languages using derivatives as it is easier to
+reason about them in the following, in my opinion.
+
+\section{Languages of grammars}\label{sec:languages-grammars}
+
+If \(G\) is a grammar, we say that \emph{\(G\) generates a language
+ \(L\)}, where \(L\) consists of all sentences generated by the
+grammar. Denote the language generated by \(G\) as \(L_G\).
+
+\section{Derivatives of the languages of grammars}
+\label{sec:deriv-lang-gramm}
+
+If \(s\) is a sentence, we ask the question:
+
+\begin{quotation}
+ What is the derivative \(\frac{dL_{G}}{ds}\) of \(L_G\) by \(s\)?
+\end{quotation}
+
+To answer this, note that if \(s\) and \(t\) are sentences, and if
+\(L\) is any language, then \(\frac{dL}{d(s\cdot t)}\) is equal to
+\(\frac d{dt}\frac{dL}{ds}\), the derivative by \(t\) of the
+derivative by \(s\) of \(L\). This can be seen by definition: a
+sentence \(r\) belongs to \(\frac{dL}{d(s\cdot t)}\) if and only if
+\((s\cdot t\cdot r)\) belongs to \(L\), which is exactly the condition
+for \((t\cdot r)\) to belong to \(\frac{dL}{ds}\). And this is the
+condition for \(r\) to belong to \(\frac d{dt} \frac{dL}{ds}\).
+
+The above observation tells us that we can obtain \(\frac{dL_G}{ds}\)
+by taking derivatives of \(L_G\) by sentences consisting of one single
+alphabet, successively.
+
+Now consider \(\frac{dL_G}{da}\).
+
+Suppose \(G\) consists of \(n\) rules
+
+\begin{equation}
+ \label{eq:rules_of_g}
+ \begin{aligned}
+ X_{1}&\mapsto Y_{1}\ldots Y_{m_{1}}\\
+ \vdots\\
+ X_{n}&\mapsto Z_{1} \ldots Z_{m_{n}},
+ \end{aligned}
+\end{equation}
+where \(X_i\) are variables, \(m_i\) is an integer depending on \(i\),
+for \(1\leq i\leq n\), and
+\(Y_1, \ldots, Y_{m_1}, \ldots, Z_1, \ldots, Z_{m_n}\) are either
+variables or constants.
+
+It is then natural to try to determine \(\frac{dL_G}{da}\) by some
+kind of substitution. So for \(1\leq i\leq n\), let \(X_i\) also
+denote the language that is generated if we treat the variable \(X_i\)
+as the starting variable in the grammar. For instance, if \(X_1\) is
+the original starting variable for the grammar \(G\), then \(L_G\) is
+also regarded as equal to \(X_1\) under this view.
+
+Further, we observe that if both \(M_1\) and \(M_2\) are languages,
+and if \(M\) is their concatenation:
+
+\begin{equation*}
+ M := \left\{ s \cdot t\mid s \in M_{1}, t\in M_{2}\right\}.
+\end{equation*}
+
+Then \(\frac{dM}{da}\) is equal to either
+\[\frac{dM_{1}}{da}\cdot M_{2}\] or
+\[\frac{dM_{1}}{da} \cdot M_{2}\bigcup \frac{dM_{2}}{da},\]
+
+depending on whether or not \(M_1\) contains the empty sentence. This
+can be regarded as the \emph{product-rule} of the derivatives of
+languages, as an aside.
+
+Consequently, the derivative \(\frac{dX_1}{da}\) is equal to the union
+of
+
+\begin{align*}
+ &\frac{dY_{1}}{da}\cdot Y_{2}\cdots Y_{m_{1}},\\
+ &\frac{dY_{2}}{da}\cdot Y_{3}\cdots Y_{m_{1}},\\
+ &\vdots\\
+ &\frac{dY_{i}}{da}\cdot Y_{i+1}\cdots Y_{m_{1}},\\
+\end{align*}
+
+where \(i\) is the unique integer such that the language \(Y_{i}\)
+does not contain the empty sentence. If no such \(i\) exists, then we
+define \(i=m_{1}\). And \emph{mutatis mutandis} for
+\(\frac{dX_{2}}{da},\ldots, \frac{dX_{n}}{da}\).
+
+Now we treat the languages \(\frac{dY_i}{da}\) as variables: if
+\(Y_i\) is a variable, this denotes the derivative by \(a\) of the
+language generated by \(Y_i\); if \(Y_i\) is a constant, then this
+denotes either the language consisting of the empty sentence, or the
+empty set, depending upon whether \(Y_i\) is equal to \(a\) or not.
+Then the previous paragraph defines the substitution rules of these
+new variables \(\frac{dY_i}{da}\). We call these rules as \emph{the
+ derived rules} of \(\frac{dY_i}{da}\).
+
+It follows that we can obtain the derivative
+\(\frac{dL_{G}}{da} = \frac{dX_{1}}{da}\) by \emph{substituting} these
+\(\frac{dY_i}{da}\) according to the derived rules.
+
+Combining the above, we see that the derivatives of the language
+\(L_G\) by any sentence are generated by grammars. We denote the
+above grammar, which generates \(\frac{dL_G}{da}\), as \emph{the
+ derived grammar} \(\frac{dG}{da}\). Now our problem is transformed
+to determine whether the successive derivatives of grammars
+\(\frac{dG}{da_1}, \frac d{da_2}\frac{dG}{da_1} ,\ldots\), generates
+the empty sentence or not. This is relatively easy to do, once we can
+compute the successive derivatives efficiently.
+
+\section{Chain-rule of derivatives}\label{sec:chain-rule-deriv}
+
+All of the above was not new when the author proposed the algorithm.
+The core idea of the algorithm, in my opinion, is the
+\textbf{chain-rule} of the operation of taking derivatives of
+languages.
+
+In calculus, roughly speaking, the chain-rule of derivatives refers to
+the formula
+
+\begin{equation*}
+ \frac{df}{dx} =
+ \sum_{i=1}^{n} \frac{df}{dy_{i}} \cdot \frac{dy_{i}}{dx},
+\end{equation*}
+
+where \(f\) is a function in \(n\) variables \(y_1, ..., y_n\), which
+are also functions in the variable \(x\).
+
+If we replace the function \(f\) by a language \(L\) generated by a
+grammar \(G\), the variables \(y_1, \ldots , y_n\) by the variables in
+\(G\), \(x\) by an alphabet \(a\), the multiplication \(\cdot\) by the
+concatenation of languages, and the summation \(+\) by the union of
+languages, then we obtain a formula writing the derivative
+\(\frac{dG}{da}\) as a union of the concatenations:
+
+\begin{equation}\label{eq:chain_rule}
+ \frac{dG}{da} =
+ \bigcup_{i=1}^{n}\frac{dG}{dy_{i}} \cdot \frac{dy_{i}}{da}.
+\end{equation}
+
+The only remaining questions are: What does \(\frac{dG}{dy_{i}}\)
+actually mean, and is the formula~\eqref{eq:chain_rule} true?
+
+\subsection{Derivatives with respect to variables}
+\label{sec:deriv-wrt-var}
+
+We want to generalize the construction in the
+section~\ref{sec:deriv-lang-gramm} to be able to take the derivatives
+with respect to variables.
+
+First, for the sake of brevity, let \(\Sigma^{\star}\) denote the set
+of all sentences, i.e. possibly empty sequences of elements in
+\(\Sigma\).
+
+We define a new alphabet
+\(\Sigma' := \Sigma\bigcup \left\{X_{i}\,\vert\, 1\leq i\leq
+ n\right\}\) and a new set of variables
+\[
+ \left\{ Z_{i}\ \bigg\vert\ 1\leq i\leq n \right\},
+\]
+where \(X_{i}\) are the variables in \(G\), according to its
+rules~\ref{eq:rules_of_g}.
+
+Then for each rule \(X_{i}\mapsto Y_{1}\ldots Y_{m_{i}}\), we define
+\(j\) rules: for \(1\leq k\leq j\),
+\begin{equation}\label{eq:closed_grammar_rules}
+ Z_{i}\mapsto Z_{k} \cdot Y_{k+1}\cdots Y_{m_{i}},
+\end{equation}
+where \(j\) is the smallest integer such that the language \(Y_{j}\)
+does not contain the empty sequence \(\varepsilon\in\Sigma^{\star}\).
+Here if \(Y_{k}\) belongs to \(\Sigma\), then \(Z_{k}\) is replaced by
+\(Y_{k}\) itself, i.e. the right-hand side of that rule does not
+contain variables.
+
+This defines a grammar, which we call the \emph{closed grammar}
+\(\overline G\).
+
+With this closed grammar, we can naturally take derivatives with
+respect to the original variables \(X_{i}\), as they are now alphabets
+of this closed grammar \(\overline G\).
+
+\subsection{Proof of the chain-rule}\label{sec:proof-chain-rule}
+
+\begin{prop}\label{prop:proof-of-chain-rule}
+ \[
+ \frac{dL_{G}}{da} = \bigcup_{i=1}^{n} \frac{dX_{i}}{da} \cdot
+ \left( \frac{dL_{\overline G}}{dX_{i}} \bigcap
+ \operatorname{P}\left(\Sigma^{\star}\right) \right) ,\, \forall
+ a\in\Sigma.
+ \]
+\end{prop}
+
+\begin{proof}
+ For any \(i\), any sequence in \(\frac{dL_{\overline G}}{dX_{i}}\)
+ is generated from \(G\) by partial substitutions, and removing empty
+ sequences. So the containement of the right-hand side in the
+ left-hand side is obvious.
+
+ Conversely, if a sentence \(s\in\Sigma^{\star}\) begins with \(a\),
+ then \(s\) must be obtained by substitution from some sentence in
+ \({(\Sigma')}^{\star}\), which means that it belongs to the
+ right-hand side. So the proof is complete.
+\end{proof}
+
+\begin{rmk}
+ It follows that we can determine whether or not
+ \(\frac{dL_{G}}{da}\) contains the empty sequence by examining
+ \(\frac{dX_{i}}{da}\) and \(\frac{dL_{\overline G}}{dX_{i}}\)
+ respectively.
+
+ Further, since
+ \(\frac{dX_{i}}{da} = \frac{dZ_{i}}{da}\bigcap\operatorname{P}
+ (\Sigma^{\star})\) and
+ \(\frac{dL_{\overline G}}{dX_{i}} = \frac{dZ_{1}}{dX_{i}}\), we
+ conclude that we only have to examine \(\frac{dZ_{i}}{dX_{j}}\), for
+ \(1\leq i,j\leq n\).
+\end{rmk}
+
+What makes this grammar interesting is its regularity.
+
+\subsection{The closed grammar is regular}\label{sec:clos-gramm-regul}
+
+\begin{prop}
+ The closed grammar is regular. In fact, the set
+ \[
+ \left\{\frac{dZ_{i}}{ds}\ \bigg\vert\
+ 1\leq i\leq n,\,s\in{(\Sigma')}^{\star}\right\}
+ \]
+ is finite.
+\end{prop}
+
+\begin{proof}
+ Observe the rules~\eqref{eq:closed_grammar_rules}. From the
+ definition, we see that the right-hand sides of rules only contain
+ at most one variable, \(Z_{k}\), and that variable must appear as
+ the first symbol in the sequence on the right-hand side. This
+ observation is sufficient to show that the closed grammar is
+ regular: the rigorous-minded readers can prove this statement by
+ first showing that the language generated by the closed grammar is a
+ finite union of concatenations of languages of two types: either a
+ set consisting of only one finite sentence, or a set consisting of
+ zero or more self-repetitions of a finite sequence. Then the reader
+ can easily verify that such languages are regular, by showing that
+ the derivatives of a language consisting of zero or more
+ self-repetitions of a finite sequence are finitely many. A hint for
+ the last step is to notice that the derivatives of the language
+ \(s^{\star}\) of zero or more self-repetitions of a sequence
+ \(s:=s_{1},\ldots,s_{p}\) can be injected in the finite set
+ \[
+ \left\{\left\{s_{i},s_{i+1},\ldots,s_{p}\right\}\cdot s^{\star}
+ \mid 1\leq i\leq p+1\right\}.
+ \]
+\end{proof}
+
+\section{Tokenization}\label{sec:tokenization}
+
+There is another abstraction used by this package.
+
+
+
+\end{document} \ No newline at end of file