\documentclass[12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{grffile}
\usepackage{longtable}
\usepackage{wrapfig}
\usepackage{rotating}
\usepackage[normalem]{ulem}
\usepackage{amsmath}
\usepackage{mathrsfs}
\usepackage{textcomp}
\usepackage{amssymb}
\usepackage{capt-of}
\usepackage{hyperref}
\usepackage{amsfonts, amsthm}
\usepackage{enumitem}
\usepackage{tikz}
\author{Durand}
\date{2023--08--05}
\title{\vspace{-5ex}Algorithm used by the package REP}
\hypersetup{
  pdfauthor={Durand},
  pdftitle={Algorithm used by the package REP},
  pdfkeywords={},
  pdfsubject={},
  pdfcreator={Durand},
  pdflang={English}}

\newtheorem{lem}{Lemma}[section]
\newtheorem{coro}[lem]{Corollary}
\newtheorem{thm}{Theorem}[section]
\newtheorem{prop}{Proposition}[section]
\newtheorem{ex}{Exercise}[section]
\theoremstyle{definition}
\newtheorem*{rmk}{Remark}
\newtheorem{defn}{Definition}[section]

\newcommand{\n}{\mathbb{N}}
\newcommand{\z}{\mathbb{Z}}
\newcommand{\q}{\mathbb{Q}}
\newcommand{\im}{\operatorname{Im}}
\renewcommand{\labelenumi}{\arabic{enumi}.}
\renewcommand{\labelenumii}{(\alph{enumii})}
\begin{document}
\maketitle

\tableofcontents

\begin{abstract}
  This package uses an algorithm that is not widely adapted, at least
  not to my knowledge.  So I think I will briefly explain how it
  works.

  The essence of the algorithm, in my opinion, is the chain-rule for
  derivatives of languages.  And an optimization technique that makes
  a great difference in the performance of the algorithm, not yet
  implemented in this package, is the \textbf{context-free
    memoization}.
\end{abstract}

\section{Grammars}\label{sec:grammars}

Before explaining what parsers are, we need first to explain what
grammars are.

In short, a grammar can be viewed as a \emph{substitution machine}.
That is, a grammar consists of a set of variables, constants, and
rules

\begin{equation*}
X \mapsto Y \ldots Z,
\end{equation*}

where \(X\) is a variable and \(Y \ldots Z\) are either variables or
constants.  The grammar then substitutes every occurence of \(X\) by
the sequence at the right-hand side \(Y\ldots Z\).  And there is a
\emph{starting variable}, usually denoted as \(S\).

To sum up, we start from the sequence \(S\), and perform substitutions
according to the rules.  If the substituted sequence cannot be further
substituted, i.e. if the sequence consists of constants only, we say
that this constant-only sequence is "generated" by the grammar.

One can think of this substitution machine as macros in \TeX, except
that this only allow macros without arguments.  Or one can think of
the machine as macros in programming languages like \texttt{Lisp},
again without arguments.

Notice that a variable can have multiple rules to substitute, and a
given sequence can be generated by different sets of substitutions.

\section{Parsers}\label{sec:parsers}

Parsers are \emph{inverses} of grammars.  More precisely, while
grammars perform substitutions to generate sequences, parsers try to
find out what substitutions the grammar performed in order to generate
a given sequence.  Since there might be multiple ways to generate a
given sequence, the parser needs to find all possible ways.

In fact, the number of ways to generate a sequence can even be
infinite.  For example, if the grammar has the following two rules:

\begin{align*}
S &\mapsto S\\
S &\mapsto "".
\end{align*}

This grammar means that \(S\) can be subtituted by \(S\) itself, and
it can be substituted by the empty sequence as well.  In this case,
the grammar can only generate the empty sequence, but it can generate
the empty sequence by infinitely many ways: it can first substitute
\(S\) by itself any number of times, until it substitutes \(S\) by the
empty sequence.

Moreover, even if we do not consider the generations that produce
empty sequences, the number of ways a grammar can generate a given
sequence is in general \(O(n^3)\), where \(O\) means the
\textbf{big-O} notation, and \(n\) is the length of the sequence to be
generated.  This is why efficient general parsers that work for every
grammar are hard to construct.

\section{Parser Generators}\label{sec:parser-generators}

Now we come to the topic of this package: parser generators.  Simply
speaking, a parser generator takes a grammar as input, and produces a
parser for that grammar as output.

One can classify parser generators into two types: comiled and
interpreted.  The former means that the generator produces source
codes of a parser that needs to be compiled to produce a parser.  The
latter means that the generator simply produces a parser that needs to
be run by an interpreter.  The well-known \emph{tree-sitter} parser
generator belongs to the first category, for example.

This package is interpreted.  But there is a plan to also turn on a
\emph{compiled mode} to potentially make the generated parsers more
performant.

\section{Augmented Backus-Naur Form}\label{sec:augm-back-naur}

There are various ways to write grammars.  This package chooses the
\emph{augmented backus-naur form}.  See
\href{https://en.wikipedia.org/wiki/Augmented_Backus-Naur_Form}{Wiki}
for an overview, and see
\href{https://datatracker.ietf.org/doc/html/rfc5234}{RFC 5234} for its
detailed definition, or
\href{https://www.rfc-editor.org/rfc/rfc5234.txt}{the text version}.

The reason for this choice is that the \textbf{RFC documents} use this
format, so there are an abundance of mature grammars to use.  For
example, the ABNF format itself has a grammar in the ABNF format, so
we can use the generated parser to parse ABNF formats, immediately.

There is also a plan to let the users choose their own favourite
format, of course.

\section{Languages and derivatives}\label{sec:lang-deriv}

This section is involved and optional.  One can jump to the
section~\ref{sec:tokenization} on Tokenization directly if the details
of the algorithm are not so interesting.

To understand the algorithm, we shall first talk about the derivatives
of languages.

For the sake of convenience, let \(\Sigma\) denote a finite set of
alphabets.  The set of all Unicode code points can be an example.  We
call a sequence of elements in \(\Sigma\) a \emph{sentence}, and we
call a (possibly infinite) set of sentences a \emph{language}.  For
example, the set of all grammatically correct English sentences is a
language, to which we may give the name \emph{English}.

Now, if \(L\) is a language and \(s\) is a sentence, we define the
\textbf{derivative of L with respect to s}, denoted as
\(\frac{dL}{ds}\), as the set of all sentences \(t\) such that the
concatenated sentence \(s \cdot t\) belongs to \(L\).  We can
re-formulate this definition in more mathematical terms:

\[
  \frac{dL}{ds} := \left\{t \text{ is a sentence } \mid s \cdot t
    \text{ belongs to } L\right\}.
\]

As a side note, the reason to call it the \emph{derivative} is not
only definitional similarity (to some extent), but also that this
operation of taking the derivative of a language with respect to a
sentence satisfies some similar properties as the \emph{derivatives in
  Calculus}.

\section{Parsing by taking derivatives}\label{sec:pars-taking-deriv}

Why consider derivatives in a document about parser generators?
Because we can parse sentences by taking derivatives.

More precisely, let \(L\) be a language and let \(s\) be a sentence.
If the derivative \(\frac{dL}{ds}\) of \(L\) by \(s\) (which is a set
of sentences) contains the empty sentence, then by definition this
means that \(s\) belongs to \(L\).  This observation is the basis of
parsing by taking derivatives.

\begin{quotation}
  Those who have a rigorous mind are probably frowning at the above
  paragraph because we say we can parse by taking derivatives, whereas
  we actually only determine if the sentence belongs to the language.
  This gap comes from that the details of actually constructing the
  parse forests by taking derivatives are too complicated for me to
  explain clearly here.  I might come up with a clear explanation in
  the future, though.
\end{quotation}

\section{Regular languages}\label{sec:regular-languages}

With the above definition in mind, we now say that a language \(L\) is
\emph{regular} if and only if the set of derivatives of \(L\) is
finite, i.e.

\[
  \left\{\frac{dL}{ds} \mid s \text{ is a sentence } \right\} \text{
    is finite}.
\]

Note that this is a set of languages, namely, a set of sets of
sentences.  And we say it is finite if it contains finitely many sets.

For those familiar with regular expressions, a language is regular if
and only if there is a regular expression such that a sentence belongs
to the language if and only if it matches that regular expression.  I
chose to define regular languages using derivatives as it is easier to
reason about them in the following, in my opinion.

\section{Languages of grammars}\label{sec:languages-grammars}

If \(G\) is a grammar, we say that \emph{\(G\) generates a language
  \(L\)}, where \(L\) consists of all sentences generated by the
grammar.  Denote the language generated by \(G\) as \(L_G\).

\section{Derivatives of the languages of grammars}
\label{sec:deriv-lang-gramm}

If \(s\) is a sentence, we ask the question:

\begin{quotation}
  What is the derivative \(\frac{dL_{G}}{ds}\) of \(L_G\) by \(s\)?
\end{quotation}

To answer this, note that if \(s\) and \(t\) are sentences, and if
\(L\) is any language, then \(\frac{dL}{d(s\cdot t)}\) is equal to
\(\frac d{dt}\frac{dL}{ds}\), the derivative by \(t\) of the
derivative by \(s\) of \(L\).  This can be seen by definition: a
sentence \(r\) belongs to \(\frac{dL}{d(s\cdot t)}\) if and only if
\((s\cdot t\cdot r)\) belongs to \(L\), which is exactly the condition
for \((t\cdot r)\) to belong to \(\frac{dL}{ds}\).  And this is the
condition for \(r\) to belong to \(\frac d{dt} \frac{dL}{ds}\).

The above observation tells us that we can obtain \(\frac{dL_G}{ds}\)
by taking derivatives of \(L_G\) by sentences consisting of one single
alphabet, successively.

Now consider \(\frac{dL_G}{da}\).

Suppose \(G\) consists of \(n\) rules

\begin{equation}
  \label{eq:rules_of_g}
  \begin{aligned}
    X_{1}&\mapsto Y_{1}\ldots Y_{m_{1}}\\
    \vdots\\
    X_{n}&\mapsto Z_{1} \ldots Z_{m_{n}},
  \end{aligned}
\end{equation}
where \(X_i\) are variables, \(m_i\) is an integer depending on \(i\),
for \(1\leq i\leq n\), and
\(Y_1, \ldots, Y_{m_1}, \ldots, Z_1, \ldots, Z_{m_n}\) are either
variables or constants.

It is then natural to try to determine \(\frac{dL_G}{da}\) by some
kind of substitution.  So for \(1\leq i\leq n\), let \(X_i\) also
denote the language that is generated if we treat the variable \(X_i\)
as the starting variable in the grammar.  For instance, if \(X_1\) is
the original starting variable for the grammar \(G\), then \(L_G\) is
also regarded as equal to \(X_1\) under this view.

Further, we observe that if both \(M_1\) and \(M_2\) are languages,
and if \(M\) is their concatenation:

\begin{equation*}
  M := \left\{ s \cdot t\mid s \in M_{1}, t\in M_{2}\right\}.
\end{equation*}

Then \(\frac{dM}{da}\) is equal to either
\[\frac{dM_{1}}{da}\cdot M_{2}\] or
\[\frac{dM_{1}}{da} \cdot M_{2}\bigcup \frac{dM_{2}}{da},\]

depending on whether or not \(M_1\) contains the empty sentence.  This
can be regarded as the \emph{product-rule} of the derivatives of
languages, as an aside.

Consequently, the derivative \(\frac{dX_1}{da}\) is equal to the union
of

\begin{align*}
  &\frac{dY_{1}}{da}\cdot Y_{2}\cdots Y_{m_{1}},\\
  &\frac{dY_{2}}{da}\cdot Y_{3}\cdots Y_{m_{1}},\\
  &\vdots\\
  &\frac{dY_{i}}{da}\cdot Y_{i+1}\cdots Y_{m_{1}},\\
\end{align*}

where \(i\) is the unique integer such that the language \(Y_{i}\)
does not contain the empty sentence.  If no such \(i\) exists, then we
define \(i=m_{1}\).  And \emph{mutatis mutandis} for
\(\frac{dX_{2}}{da},\ldots, \frac{dX_{n}}{da}\).

Now we treat the languages \(\frac{dY_i}{da}\) as variables: if
\(Y_i\) is a variable, this denotes the derivative by \(a\) of the
language generated by \(Y_i\); if \(Y_i\) is a constant, then this
denotes either the language consisting of the empty sentence, or the
empty set, depending upon whether \(Y_i\) is equal to \(a\) or not.
Then the previous paragraph defines the substitution rules of these
new variables \(\frac{dY_i}{da}\).  We call these rules as \emph{the
  derived rules} of \(\frac{dY_i}{da}\).

It follows that we can obtain the derivative
\(\frac{dL_{G}}{da} = \frac{dX_{1}}{da}\) by \emph{substituting} these
\(\frac{dY_i}{da}\) according to the derived rules.

Combining the above, we see that the derivatives of the language
\(L_G\) by any sentence are generated by grammars.  We denote the
above grammar, which generates \(\frac{dL_G}{da}\), as \emph{the
  derived grammar} \(\frac{dG}{da}\).  Now our problem is transformed
to determine whether the successive derivatives of grammars
\(\frac{dG}{da_1}, \frac d{da_2}\frac{dG}{da_1} ,\ldots\), generates
the empty sentence or not.  This is relatively easy to do, once we can
compute the successive derivatives efficiently.

\section{Chain-rule of derivatives}\label{sec:chain-rule-deriv}

All of the above was not new when the author proposed the algorithm.
The core idea of the algorithm, in my opinion, is the
\textbf{chain-rule} of the operation of taking derivatives of
languages.

In calculus, roughly speaking, the chain-rule of derivatives refers to
the formula

\begin{equation*}
  \frac{df}{dx} =
  \sum_{i=1}^{n} \frac{df}{dy_{i}} \cdot \frac{dy_{i}}{dx},
\end{equation*}

where \(f\) is a function in \(n\) variables \(y_1, ..., y_n\), which
are also functions in the variable \(x\).

If we replace the function \(f\) by a language \(L\) generated by a
grammar \(G\), the variables \(y_1, \ldots , y_n\) by the variables in
\(G\), \(x\) by an alphabet \(a\), the multiplication \(\cdot\) by the
concatenation of languages, and the summation \(+\) by the union of
languages, then we obtain a formula writing the derivative
\(\frac{dG}{da}\) as a union of the concatenations:

\begin{equation}\label{eq:chain_rule}
  \frac{dG}{da} =
  \bigcup_{i=1}^{n}\frac{dG}{dy_{i}} \cdot \frac{dy_{i}}{da}.
\end{equation}

The only remaining questions are: What does \(\frac{dG}{dy_{i}}\)
actually mean, and is the formula~\eqref{eq:chain_rule} true?

\subsection{Derivatives with respect to variables}
\label{sec:deriv-wrt-var}

We want to generalize the construction in the
section~\ref{sec:deriv-lang-gramm} to be able to take the derivatives
with respect to variables.

First, for the sake of brevity, let \(\Sigma^{\star}\) denote the set
of all sentences, i.e. possibly empty sequences of elements in
\(\Sigma\).

We define a new alphabet
\(\Sigma' := \Sigma\bigcup \left\{X_{i}\,\vert\, 1\leq i\leq
  n\right\}\) and a new set of variables
\[
  \left\{ Z_{i}\ \bigg\vert\ 1\leq i\leq n \right\},
\]
where \(X_{i}\) are the variables in \(G\), according to its
rules~\ref{eq:rules_of_g}.

Then for each rule \(X_{i}\mapsto Y_{1}\ldots Y_{m_{i}}\), we define
\(j\) rules: for \(1\leq k\leq j\),
\begin{equation}\label{eq:closed_grammar_rules}
  Z_{i}\mapsto Z_{k} \cdot Y_{k+1}\cdots Y_{m_{i}},
\end{equation}
where \(j\) is the smallest integer such that the language \(Y_{j}\)
does not contain the empty sequence \(\varepsilon\in\Sigma^{\star}\).
Here if \(Y_{k}\) belongs to \(\Sigma\), then \(Z_{k}\) is replaced by
\(Y_{k}\) itself, i.e. the right-hand side of that rule does not
contain variables.

This defines a grammar, which we call the \emph{closed grammar}
\(\overline G\).

With this closed grammar, we can naturally take derivatives with
respect to the original variables \(X_{i}\), as they are now alphabets
of this closed grammar \(\overline G\).

\subsection{Proof of the chain-rule}\label{sec:proof-chain-rule}

\begin{prop}\label{prop:proof-of-chain-rule}
  \[
    \frac{dL_{G}}{da} = \bigcup_{i=1}^{n} \frac{dX_{i}}{da} \cdot
    \left( \frac{dL_{\overline G}}{dX_{i}} \bigcap
      \operatorname{P}\left(\Sigma^{\star}\right) \right) ,\, \forall
    a\in\Sigma.
  \]
\end{prop}

\begin{proof}
  For any \(i\), any sequence in \(\frac{dL_{\overline G}}{dX_{i}}\)
  is generated from \(G\) by partial substitutions, and removing empty
  sequences.  So the containement of the right-hand side in the
  left-hand side is obvious.

  Conversely, if a sentence \(s\in\Sigma^{\star}\) begins with \(a\),
  then \(s\) must be obtained by substitution from some sentence in
  \({(\Sigma')}^{\star}\), which means that it belongs to the
  right-hand side.  So the proof is complete.
\end{proof}

\begin{rmk}
  It follows that we can determine whether or not
  \(\frac{dL_{G}}{da}\) contains the empty sequence by examining
  \(\frac{dX_{i}}{da}\) and \(\frac{dL_{\overline G}}{dX_{i}}\)
  respectively.

  Further, since
  \(\frac{dX_{i}}{da} = \frac{dZ_{i}}{da}\bigcap\operatorname{P}
  (\Sigma^{\star})\) and
  \(\frac{dL_{\overline G}}{dX_{i}} = \frac{dZ_{1}}{dX_{i}}\), we
  conclude that we only have to examine \(\frac{dZ_{i}}{dX_{j}}\), for
  \(1\leq i,j\leq n\).
\end{rmk}

What makes this grammar interesting is its regularity.

\subsection{The closed grammar is regular}\label{sec:clos-gramm-regul}

\begin{prop}
  The closed grammar is regular.  In fact, the set
  \[
    \left\{\frac{dZ_{i}}{ds}\ \bigg\vert\
      1\leq i\leq n,\,s\in{(\Sigma')}^{\star}\right\}
  \]
  is finite.
\end{prop}

\begin{proof}
  Observe the rules~\eqref{eq:closed_grammar_rules}.  From the
  definition, we see that the right-hand sides of rules only contain
  at most one variable, \(Z_{k}\), and that variable must appear as
  the first symbol in the sequence on the right-hand side.  This
  observation is sufficient to show that the closed grammar is
  regular: the rigorous-minded readers can prove this statement by
  first showing that the language generated by the closed grammar is a
  finite union of concatenations of languages of two types: either a
  set consisting of only one finite sentence, or a set consisting of
  zero or more self-repetitions of a finite sequence.  Then the reader
  can easily verify that such languages are regular, by showing that
  the derivatives of a language consisting of zero or more
  self-repetitions of a finite sequence are finitely many.  A hint for
  the last step is to notice that the derivatives of the language
  \(s^{\star}\) of zero or more self-repetitions of a sequence
  \(s:=s_{1},\ldots,s_{p}\) can be injected in the finite set
  \[
    \left\{\left\{s_{i},s_{i+1},\ldots,s_{p}\right\}\cdot s^{\star}
      \mid 1\leq i\leq p+1\right\}.
  \]
\end{proof}

\section{Tokenization}\label{sec:tokenization}

There is another abstraction used by this package.


\end{document}