1 files changed, 486 insertions, 0 deletions
diff --git a/algorithm.tex b/algorithm.tex
new file mode 100644
index 0000000..20fb1ac
--- /dev/null
+++ b/algorithm.tex
@@ -0,0 +1,486 @@
+\documentclass[12pt]{article}
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
+\usepackage{graphicx}
+\usepackage{grffile}
+\usepackage{longtable}
+\usepackage{wrapfig}
+\usepackage{rotating}
+\usepackage[normalem]{ulem}
+\usepackage{amsmath}
+\usepackage{mathrsfs}
+\usepackage{textcomp}
+\usepackage{amssymb}
+\usepackage{capt-of}
+\usepackage{hyperref}
+\usepackage{amsfonts, amsthm}
+\usepackage{enumitem}
+\usepackage{tikz}
+\author{Durand}
+\date{2023--08--05}
+\title{\vspace{-5ex}Algorithm used by the package REP}
+\hypersetup{
+  pdfauthor={Durand},
+  pdftitle={Algorithm used by the package REP},
+  pdfkeywords={},
+  pdfsubject={},
+  pdfcreator={Durand},
+  pdflang={English}}
+
+\newtheorem{lem}{Lemma}[section]
+\newtheorem{coro}[lem]{Corollary}
+\newtheorem{thm}{Theorem}[section]
+\newtheorem{prop}{Proposition}[section]
+\newtheorem{ex}{Exercise}[section]
+\theoremstyle{definition}
+\newtheorem*{rmk}{Remark}
+\newtheorem{defn}{Definition}[section]
+
+\newcommand{\n}{\mathbb{N}}
+\newcommand{\z}{\mathbb{Z}}
+\newcommand{\q}{\mathbb{Q}}
+\newcommand{\im}{\operatorname{Im}}
+\renewcommand{\labelenumi}{\arabic{enumi}.}
+\renewcommand{\labelenumii}{(\alph{enumii})}
+\begin{document}
+\maketitle
+
+\tableofcontents
+
+\begin{abstract}
+  This package uses an algorithm that is not widely adapted, at least
+  not to my knowledge.  So I think I will briefly explain how it
+  works.
+
+  The essence of the algorithm, in my opinion, is the chain-rule for
+  derivatives of languages.  And an optimization technique that makes
+  a great difference in the performance of the algorithm, not yet
+  implemented in this package, is the \textbf{context-free
+    memoization}.
+\end{abstract}
+
+\section{Grammars}\label{sec:grammars}
+
+Before explaining what parsers are, we need first to explain what
+grammars are.
+
+In short, a grammar can be viewed as a \emph{substitution machine}.
+That is, a grammar consists of a set of variables, constants, and
+rules
+
+\begin{equation*}
+X \mapsto Y \ldots Z,
+\end{equation*}
+
+where \(X\) is a variable and \(Y \ldots Z\) are either variables or
+constants.  The grammar then substitutes every occurence of \(X\) by
+the sequence at the right-hand side \(Y\ldots Z\).  And there is a
+\emph{starting variable}, usually denoted as \(S\).
+
+To sum up, we start from the sequence \(S\), and perform substitutions
+according to the rules.  If the substituted sequence cannot be further
+substituted, i.e. if the sequence consists of constants only, we say
+that this constant-only sequence is "generated" by the grammar.
+
+One can think of this substitution machine as macros in \TeX, except
+that this only allow macros without arguments.  Or one can think of
+the machine as macros in programming languages like \texttt{Lisp},
+again without arguments.
+
+Notice that a variable can have multiple rules to substitute, and a
+given sequence can be generated by different sets of substitutions.
+
+\section{Parsers}\label{sec:parsers}
+
+Parsers are \emph{inverses} of grammars.  More precisely, while
+grammars perform substitutions to generate sequences, parsers try to
+find out what substitutions the grammar performed in order to generate
+a given sequence.  Since there might be multiple ways to generate a
+given sequence, the parser needs to find all possible ways.
+
+In fact, the number of ways to generate a sequence can even be
+infinite.  For example, if the grammar has the following two rules:
+
+\begin{align*}
+S &\mapsto S\\
+S &\mapsto "".
+\end{align*}
+
+This grammar means that \(S\) can be subtituted by \(S\) itself, and
+it can be substituted by the empty sequence as well.  In this case,
+the grammar can only generate the empty sequence, but it can generate
+the empty sequence by infinitely many ways: it can first substitute
+\(S\) by itself any number of times, until it substitutes \(S\) by the
+empty sequence.
+
+Moreover, even if we do not consider the generations that produce
+empty sequences, the number of ways a grammar can generate a given
+sequence is in general \(O(n^3)\), where \(O\) means the
+\textbf{big-O} notation, and \(n\) is the length of the sequence to be
+generated.  This is why efficient general parsers that work for every
+grammar are hard to construct.
+
+\section{Parser Generators}\label{sec:parser-generators}
+
+Now we come to the topic of this package: parser generators.  Simply
+speaking, a parser generator takes a grammar as input, and produces a
+parser for that grammar as output.
+
+One can classify parser generators into two types: comiled and
+interpreted.  The former means that the generator produces source
+codes of a parser that needs to be compiled to produce a parser.  The
+latter means that the generator simply produces a parser that needs to
+be run by an interpreter.  The well-known \emph{tree-sitter} parser
+generator belongs to the first category, for example.
+
+This package is interpreted.  But there is a plan to also turn on a
+\emph{compiled mode} to potentially make the generated parsers more
+performant.
+
+\section{Augmented Backus-Naur Form}\label{sec:augm-back-naur}
+
+There are various ways to write grammars.  This package chooses the
+\emph{augmented backus-naur form}.  See
+\href{https://en.wikipedia.org/wiki/Augmented_Backus-Naur_Form}{Wiki}
+for an overview, and see
+\href{https://datatracker.ietf.org/doc/html/rfc5234}{RFC 5234} for its
+detailed definition, or
+\href{https://www.rfc-editor.org/rfc/rfc5234.txt}{the text version}.
+
+The reason for this choice is that the \textbf{RFC documents} use this
+format, so there are an abundance of mature grammars to use.  For
+example, the ABNF format itself has a grammar in the ABNF format, so
+we can use the generated parser to parse ABNF formats, immediately.
+
+There is also a plan to let the users choose their own favourite
+format, of course.
+
+\section{Languages and derivatives}\label{sec:lang-deriv}
+
+This section is involved and optional.  One can jump to the
+section~\ref{sec:tokenization} on Tokenization directly if the details
+of the algorithm are not so interesting.
+
+To understand the algorithm, we shall first talk about the derivatives
+of languages.
+
+For the sake of convenience, let \(\Sigma\) denote a finite set of
+alphabets.  The set of all Unicode code points can be an example.  We
+call a sequence of elements in \(\Sigma\) a \emph{sentence}, and we
+call a (possibly infinite) set of sentences a \emph{language}.  For
+example, the set of all grammatically correct English sentences is a
+language, to which we may give the name \emph{English}.
+
+Now, if \(L\) is a language and \(s\) is a sentence, we define the
+\textbf{derivative of L with respect to s}, denoted as
+\(\frac{dL}{ds}\), as the set of all sentences \(t\) such that the
+concatenated sentence \(s \cdot t\) belongs to \(L\).  We can
+re-formulate this definition in more mathematical terms:
+
+\[
+  \frac{dL}{ds} := \left\{t \text{ is a sentence } \mid s \cdot t
+    \text{ belongs to } L\right\}.
+\]
+
+As a side note, the reason to call it the \emph{derivative} is not
+only definitional similarity (to some extent), but also that this
+operation of taking the derivative of a language with respect to a
+sentence satisfies some similar properties as the \emph{derivatives in
+  Calculus}.
+
+\section{Parsing by taking derivatives}\label{sec:pars-taking-deriv}
+
+Why consider derivatives in a document about parser generators?
+Because we can parse sentences by taking derivatives.
+
+More precisely, let \(L\) be a language and let \(s\) be a sentence.
+If the derivative \(\frac{dL}{ds}\) of \(L\) by \(s\) (which is a set
+of sentences) contains the empty sentence, then by definition this
+means that \(s\) belongs to \(L\).  This observation is the basis of
+parsing by taking derivatives.
+
+\begin{quotation}
+  Those who have a rigorous mind are probably frowning at the above
+  paragraph because we say we can parse by taking derivatives, whereas
+  we actually only determine if the sentence belongs to the language.
+  This gap comes from that the details of actually constructing the
+  parse forests by taking derivatives are too complicated for me to
+  explain clearly here.  I might come up with a clear explanation in
+  the future, though.
+\end{quotation}
+
+\section{Regular languages}\label{sec:regular-languages}
+
+With the above definition in mind, we now say that a language \(L\) is
+\emph{regular} if and only if the set of derivatives of \(L\) is
+finite, i.e.
+
+\[
+  \left\{\frac{dL}{ds} \mid s \text{ is a sentence } \right\} \text{
+    is finite}.
+\]
+
+Note that this is a set of languages, namely, a set of sets of
+sentences.  And we say it is finite if it contains finitely many sets.
+
+For those familiar with regular expressions, a language is regular if
+and only if there is a regular expression such that a sentence belongs
+to the language if and only if it matches that regular expression.  I
+chose to define regular languages using derivatives as it is easier to
+reason about them in the following, in my opinion.
+
+\section{Languages of grammars}\label{sec:languages-grammars}
+
+If \(G\) is a grammar, we say that \emph{\(G\) generates a language
+  \(L\)}, where \(L\) consists of all sentences generated by the
+grammar.  Denote the language generated by \(G\) as \(L_G\).
+
+\section{Derivatives of the languages of grammars}
+\label{sec:deriv-lang-gramm}
+
+If \(s\) is a sentence, we ask the question:
+
+\begin{quotation}
+  What is the derivative \(\frac{dL_{G}}{ds}\) of \(L_G\) by \(s\)?
+\end{quotation}
+
+To answer this, note that if \(s\) and \(t\) are sentences, and if
+\(L\) is any language, then \(\frac{dL}{d(s\cdot t)}\) is equal to
+\(\frac d{dt}\frac{dL}{ds}\), the derivative by \(t\) of the
+derivative by \(s\) of \(L\).  This can be seen by definition: a
+sentence \(r\) belongs to \(\frac{dL}{d(s\cdot t)}\) if and only if
+\((s\cdot t\cdot r)\) belongs to \(L\), which is exactly the condition
+for \((t\cdot r)\) to belong to \(\frac{dL}{ds}\).  And this is the
+condition for \(r\) to belong to \(\frac d{dt} \frac{dL}{ds}\).
+
+The above observation tells us that we can obtain \(\frac{dL_G}{ds}\)
+by taking derivatives of \(L_G\) by sentences consisting of one single
+alphabet, successively.
+
+Now consider \(\frac{dL_G}{da}\).
+
+Suppose \(G\) consists of \(n\) rules
+
+\begin{equation}
+  \label{eq:rules_of_g}
+  \begin{aligned}
+    X_{1}&\mapsto Y_{1}\ldots Y_{m_{1}}\\
+    \vdots\\
+    X_{n}&\mapsto Z_{1} \ldots Z_{m_{n}},
+  \end{aligned}
+\end{equation}
+where \(X_i\) are variables, \(m_i\) is an integer depending on \(i\),
+for \(1\leq i\leq n\), and
+\(Y_1, \ldots, Y_{m_1}, \ldots, Z_1, \ldots, Z_{m_n}\) are either
+variables or constants.
+
+It is then natural to try to determine \(\frac{dL_G}{da}\) by some
+kind of substitution.  So for \(1\leq i\leq n\), let \(X_i\) also
+denote the language that is generated if we treat the variable \(X_i\)
+as the starting variable in the grammar.  For instance, if \(X_1\) is
+the original starting variable for the grammar \(G\), then \(L_G\) is
+also regarded as equal to \(X_1\) under this view.
+
+Further, we observe that if both \(M_1\) and \(M_2\) are languages,
+and if \(M\) is their concatenation:
+
+\begin{equation*}
+  M := \left\{ s \cdot t\mid s \in M_{1}, t\in M_{2}\right\}.
+\end{equation*}
+
+Then \(\frac{dM}{da}\) is equal to either
+\[\frac{dM_{1}}{da}\cdot M_{2}\] or
+\[\frac{dM_{1}}{da} \cdot M_{2}\bigcup \frac{dM_{2}}{da},\]
+
+depending on whether or not \(M_1\) contains the empty sentence.  This
+can be regarded as the \emph{product-rule} of the derivatives of
+languages, as an aside.
+
+Consequently, the derivative \(\frac{dX_1}{da}\) is equal to the union
+of
+
+\begin{align*}
+  &\frac{dY_{1}}{da}\cdot Y_{2}\cdots Y_{m_{1}},\\
+  &\frac{dY_{2}}{da}\cdot Y_{3}\cdots Y_{m_{1}},\\
+  &\vdots\\
+  &\frac{dY_{i}}{da}\cdot Y_{i+1}\cdots Y_{m_{1}},\\
+\end{align*}
+
+where \(i\) is the unique integer such that the language \(Y_{i}\)
+does not contain the empty sentence.  If no such \(i\) exists, then we
+define \(i=m_{1}\).  And \emph{mutatis mutandis} for
+\(\frac{dX_{2}}{da},\ldots, \frac{dX_{n}}{da}\).
+
+Now we treat the languages \(\frac{dY_i}{da}\) as variables: if
+\(Y_i\) is a variable, this denotes the derivative by \(a\) of the
+language generated by \(Y_i\); if \(Y_i\) is a constant, then this
+denotes either the language consisting of the empty sentence, or the
+empty set, depending upon whether \(Y_i\) is equal to \(a\) or not.
+Then the previous paragraph defines the substitution rules of these
+new variables \(\frac{dY_i}{da}\).  We call these rules as \emph{the
+  derived rules} of \(\frac{dY_i}{da}\).
+
+It follows that we can obtain the derivative
+\(\frac{dL_{G}}{da} = \frac{dX_{1}}{da}\) by \emph{substituting} these
+\(\frac{dY_i}{da}\) according to the derived rules.
+
+Combining the above, we see that the derivatives of the language
+\(L_G\) by any sentence are generated by grammars.  We denote the
+above grammar, which generates \(\frac{dL_G}{da}\), as \emph{the
+  derived grammar} \(\frac{dG}{da}\).  Now our problem is transformed
+to determine whether the successive derivatives of grammars
+\(\frac{dG}{da_1}, \frac d{da_2}\frac{dG}{da_1} ,\ldots\), generates
+the empty sentence or not.  This is relatively easy to do, once we can
+compute the successive derivatives efficiently.
+
+\section{Chain-rule of derivatives}\label{sec:chain-rule-deriv}
+
+All of the above was not new when the author proposed the algorithm.
+The core idea of the algorithm, in my opinion, is the
+\textbf{chain-rule} of the operation of taking derivatives of
+languages.
+
+In calculus, roughly speaking, the chain-rule of derivatives refers to
+the formula
+
+\begin{equation*}
+  \frac{df}{dx} =
+  \sum_{i=1}^{n} \frac{df}{dy_{i}} \cdot \frac{dy_{i}}{dx},
+\end{equation*}
+
+where \(f\) is a function in \(n\) variables \(y_1, ..., y_n\), which
+are also functions in the variable \(x\).
+
+If we replace the function \(f\) by a language \(L\) generated by a
+grammar \(G\), the variables \(y_1, \ldots , y_n\) by the variables in
+\(G\), \(x\) by an alphabet \(a\), the multiplication \(\cdot\) by the
+concatenation of languages, and the summation \(+\) by the union of
+languages, then we obtain a formula writing the derivative
+\(\frac{dG}{da}\) as a union of the concatenations:
+
+\begin{equation}\label{eq:chain_rule}
+  \frac{dG}{da} =
+  \bigcup_{i=1}^{n}\frac{dG}{dy_{i}} \cdot \frac{dy_{i}}{da}.
+\end{equation}
+
+The only remaining questions are: What does \(\frac{dG}{dy_{i}}\)
+actually mean, and is the formula~\eqref{eq:chain_rule} true?
+
+\subsection{Derivatives with respect to variables}
+\label{sec:deriv-wrt-var}
+
+We want to generalize the construction in the
+section~\ref{sec:deriv-lang-gramm} to be able to take the derivatives
+with respect to variables.
+
+First, for the sake of brevity, let \(\Sigma^{\star}\) denote the set
+of all sentences, i.e. possibly empty sequences of elements in
+\(\Sigma\).
+
+We define a new alphabet
+\(\Sigma' := \Sigma\bigcup \left\{X_{i}\,\vert\, 1\leq i\leq
+  n\right\}\) and a new set of variables
+\[
+  \left\{ Z_{i}\ \bigg\vert\ 1\leq i\leq n \right\},
+\]
+where \(X_{i}\) are the variables in \(G\), according to its
+rules~\ref{eq:rules_of_g}.
+
+Then for each rule \(X_{i}\mapsto Y_{1}\ldots Y_{m_{i}}\), we define
+\(j\) rules: for \(1\leq k\leq j\),
+\begin{equation}\label{eq:closed_grammar_rules}
+  Z_{i}\mapsto Z_{k} \cdot Y_{k+1}\cdots Y_{m_{i}},
+\end{equation}
+where \(j\) is the smallest integer such that the language \(Y_{j}\)
+does not contain the empty sequence \(\varepsilon\in\Sigma^{\star}\).
+Here if \(Y_{k}\) belongs to \(\Sigma\), then \(Z_{k}\) is replaced by
+\(Y_{k}\) itself, i.e. the right-hand side of that rule does not
+contain variables.
+
+This defines a grammar, which we call the \emph{closed grammar}
+\(\overline G\).
+
+With this closed grammar, we can naturally take derivatives with
+respect to the original variables \(X_{i}\), as they are now alphabets
+of this closed grammar \(\overline G\).
+
+\subsection{Proof of the chain-rule}\label{sec:proof-chain-rule}
+
+\begin{prop}\label{prop:proof-of-chain-rule}
+  \[
+    \frac{dL_{G}}{da} = \bigcup_{i=1}^{n} \frac{dX_{i}}{da} \cdot
+    \left( \frac{dL_{\overline G}}{dX_{i}} \bigcap
+      \operatorname{P}\left(\Sigma^{\star}\right) \right) ,\, \forall
+    a\in\Sigma.
+  \]
+\end{prop}
+
+\begin{proof}
+  For any \(i\), any sequence in \(\frac{dL_{\overline G}}{dX_{i}}\)
+  is generated from \(G\) by partial substitutions, and removing empty
+  sequences.  So the containement of the right-hand side in the
+  left-hand side is obvious.
+
+  Conversely, if a sentence \(s\in\Sigma^{\star}\) begins with \(a\),
+  then \(s\) must be obtained by substitution from some sentence in
+  \({(\Sigma')}^{\star}\), which means that it belongs to the
+  right-hand side.  So the proof is complete.
+\end{proof}
+
+\begin{rmk}
+  It follows that we can determine whether or not
+  \(\frac{dL_{G}}{da}\) contains the empty sequence by examining
+  \(\frac{dX_{i}}{da}\) and \(\frac{dL_{\overline G}}{dX_{i}}\)
+  respectively.
+
+  Further, since
+  \(\frac{dX_{i}}{da} = \frac{dZ_{i}}{da}\bigcap\operatorname{P}
+  (\Sigma^{\star})\) and
+  \(\frac{dL_{\overline G}}{dX_{i}} = \frac{dZ_{1}}{dX_{i}}\), we
+  conclude that we only have to examine \(\frac{dZ_{i}}{dX_{j}}\), for
+  \(1\leq i,j\leq n\).
+\end{rmk}
+
+What makes this grammar interesting is its regularity.
+
+\subsection{The closed grammar is regular}\label{sec:clos-gramm-regul}
+
+\begin{prop}
+  The closed grammar is regular.  In fact, the set
+  \[
+    \left\{\frac{dZ_{i}}{ds}\ \bigg\vert\
+      1\leq i\leq n,\,s\in{(\Sigma')}^{\star}\right\}
+  \]
+  is finite.
+\end{prop}
+
+\begin{proof}
+  Observe the rules~\eqref{eq:closed_grammar_rules}.  From the
+  definition, we see that the right-hand sides of rules only contain
+  at most one variable, \(Z_{k}\), and that variable must appear as
+  the first symbol in the sequence on the right-hand side.  This
+  observation is sufficient to show that the closed grammar is
+  regular: the rigorous-minded readers can prove this statement by
+  first showing that the language generated by the closed grammar is a
+  finite union of concatenations of languages of two types: either a
+  set consisting of only one finite sentence, or a set consisting of
+  zero or more self-repetitions of a finite sequence.  Then the reader
+  can easily verify that such languages are regular, by showing that
+  the derivatives of a language consisting of zero or more
+  self-repetitions of a finite sequence are finitely many.  A hint for
+  the last step is to notice that the derivatives of the language
+  \(s^{\star}\) of zero or more self-repetitions of a sequence
+  \(s:=s_{1},\ldots,s_{p}\) can be injected in the finite set
+  \[
+    \left\{\left\{s_{i},s_{i+1},\ldots,s_{p}\right\}\cdot s^{\star}
+      \mid 1\leq i\leq p+1\right\}.
+  \]
+\end{proof}
+
+\section{Tokenization}\label{sec:tokenization}
+
+There is another abstraction used by this package.
+
+
+
+\end{document}
+\ No newline at end of file