\documentclass[12pt]{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{graphicx} \usepackage{grffile} \usepackage{longtable} \usepackage{wrapfig} \usepackage{rotating} \usepackage[normalem]{ulem} \usepackage{amsmath} \usepackage{mathrsfs} \usepackage{textcomp} \usepackage{amssymb} \usepackage{capt-of} \usepackage{hyperref} \usepackage{amsfonts, amsthm} \usepackage{enumitem} \usepackage{tikz} \author{Durand} \date{2023--08--05} \title{\vspace{-5ex}Algorithm used by the package REP} \hypersetup{ pdfauthor={Durand}, pdftitle={Algorithm used by the package REP}, pdfkeywords={}, pdfsubject={}, pdfcreator={Durand}, pdflang={English}} \newtheorem{lem}{Lemma}[section] \newtheorem{coro}[lem]{Corollary} \newtheorem{thm}{Theorem}[section] \newtheorem{prop}{Proposition}[section] \newtheorem{ex}{Exercise}[section] \theoremstyle{definition} \newtheorem*{rmk}{Remark} \newtheorem{defn}{Definition}[section] \newcommand{\n}{\mathbb{N}} \newcommand{\z}{\mathbb{Z}} \newcommand{\q}{\mathbb{Q}} \newcommand{\im}{\operatorname{Im}} \renewcommand{\labelenumi}{\arabic{enumi}.} \renewcommand{\labelenumii}{(\alph{enumii})} \begin{document} \maketitle \tableofcontents \begin{abstract} This package uses an algorithm that is not widely adapted, at least not to my knowledge. So I think I will briefly explain how it works. The essence of the algorithm, in my opinion, is the chain-rule for derivatives of languages. And an optimization technique that makes a great difference in the performance of the algorithm, not yet implemented in this package, is the \textbf{context-free memoization}. \end{abstract} \section{Grammars}\label{sec:grammars} Before explaining what parsers are, we need first to explain what grammars are. In short, a grammar can be viewed as a \emph{substitution machine}. That is, a grammar consists of a set of variables, constants, and rules \begin{equation*} X \mapsto Y \ldots Z, \end{equation*} where \(X\) is a variable and \(Y \ldots Z\) are either variables or constants. The grammar then substitutes every occurence of \(X\) by the sequence at the right-hand side \(Y\ldots Z\). And there is a \emph{starting variable}, usually denoted as \(S\). To sum up, we start from the sequence \(S\), and perform substitutions according to the rules. If the substituted sequence cannot be further substituted, i.e. if the sequence consists of constants only, we say that this constant-only sequence is "generated" by the grammar. One can think of this substitution machine as macros in \TeX, except that this only allow macros without arguments. Or one can think of the machine as macros in programming languages like \texttt{Lisp}, again without arguments. Notice that a variable can have multiple rules to substitute, and a given sequence can be generated by different sets of substitutions. \section{Parsers}\label{sec:parsers} Parsers are \emph{inverses} of grammars. More precisely, while grammars perform substitutions to generate sequences, parsers try to find out what substitutions the grammar performed in order to generate a given sequence. Since there might be multiple ways to generate a given sequence, the parser needs to find all possible ways. In fact, the number of ways to generate a sequence can even be infinite. For example, if the grammar has the following two rules: \begin{align*} S &\mapsto S\\ S &\mapsto "". \end{align*} This grammar means that \(S\) can be subtituted by \(S\) itself, and it can be substituted by the empty sequence as well. In this case, the grammar can only generate the empty sequence, but it can generate the empty sequence by infinitely many ways: it can first substitute \(S\) by itself any number of times, until it substitutes \(S\) by the empty sequence. Moreover, even if we do not consider the generations that produce empty sequences, the number of ways a grammar can generate a given sequence is in general \(O(n^3)\), where \(O\) means the \textbf{big-O} notation, and \(n\) is the length of the sequence to be generated. This is why efficient general parsers that work for every grammar are hard to construct. \section{Parser Generators}\label{sec:parser-generators} Now we come to the topic of this package: parser generators. Simply speaking, a parser generator takes a grammar as input, and produces a parser for that grammar as output. One can classify parser generators into two types: comiled and interpreted. The former means that the generator produces source codes of a parser that needs to be compiled to produce a parser. The latter means that the generator simply produces a parser that needs to be run by an interpreter. The well-known \emph{tree-sitter} parser generator belongs to the first category, for example. This package is interpreted. But there is a plan to also turn on a \emph{compiled mode} to potentially make the generated parsers more performant. \section{Augmented Backus-Naur Form}\label{sec:augm-back-naur} There are various ways to write grammars. This package chooses the \emph{augmented backus-naur form}. See \href{https://en.wikipedia.org/wiki/Augmented_Backus-Naur_Form}{Wiki} for an overview, and see \href{https://datatracker.ietf.org/doc/html/rfc5234}{RFC 5234} for its detailed definition, or \href{https://www.rfc-editor.org/rfc/rfc5234.txt}{the text version}. The reason for this choice is that the \textbf{RFC documents} use this format, so there are an abundance of mature grammars to use. For example, the ABNF format itself has a grammar in the ABNF format, so we can use the generated parser to parse ABNF formats, immediately. There is also a plan to let the users choose their own favourite format, of course. \section{Languages and derivatives}\label{sec:lang-deriv} This section is involved and optional. One can jump to the section~\ref{sec:tokenization} on Tokenization directly if the details of the algorithm are not so interesting. To understand the algorithm, we shall first talk about the derivatives of languages. For the sake of convenience, let \(\Sigma\) denote a finite set of alphabets. The set of all Unicode code points can be an example. We call a sequence of elements in \(\Sigma\) a \emph{sentence}, and we call a (possibly infinite) set of sentences a \emph{language}. For example, the set of all grammatically correct English sentences is a language, to which we may give the name \emph{English}. Now, if \(L\) is a language and \(s\) is a sentence, we define the \textbf{derivative of L with respect to s}, denoted as \(\frac{dL}{ds}\), as the set of all sentences \(t\) such that the concatenated sentence \(s \cdot t\) belongs to \(L\). We can re-formulate this definition in more mathematical terms: \[ \frac{dL}{ds} := \left\{t \text{ is a sentence } \mid s \cdot t \text{ belongs to } L\right\}. \] As a side note, the reason to call it the \emph{derivative} is not only definitional similarity (to some extent), but also that this operation of taking the derivative of a language with respect to a sentence satisfies some similar properties as the \emph{derivatives in Calculus}. \section{Parsing by taking derivatives}\label{sec:pars-taking-deriv} Why consider derivatives in a document about parser generators? Because we can parse sentences by taking derivatives. More precisely, let \(L\) be a language and let \(s\) be a sentence. If the derivative \(\frac{dL}{ds}\) of \(L\) by \(s\) (which is a set of sentences) contains the empty sentence, then by definition this means that \(s\) belongs to \(L\). This observation is the basis of parsing by taking derivatives. \begin{quotation} Those who have a rigorous mind are probably frowning at the above paragraph because we say we can parse by taking derivatives, whereas we actually only determine if the sentence belongs to the language. This gap comes from that the details of actually constructing the parse forests by taking derivatives are too complicated for me to explain clearly here. I might come up with a clear explanation in the future, though. \end{quotation} \section{Regular languages}\label{sec:regular-languages} With the above definition in mind, we now say that a language \(L\) is \emph{regular} if and only if the set of derivatives of \(L\) is finite, i.e. \[ \left\{\frac{dL}{ds} \mid s \text{ is a sentence } \right\} \text{ is finite}. \] Note that this is a set of languages, namely, a set of sets of sentences. And we say it is finite if it contains finitely many sets. For those familiar with regular expressions, a language is regular if and only if there is a regular expression such that a sentence belongs to the language if and only if it matches that regular expression. I chose to define regular languages using derivatives as it is easier to reason about them in the following, in my opinion. \section{Languages of grammars}\label{sec:languages-grammars} If \(G\) is a grammar, we say that \emph{\(G\) generates a language \(L\)}, where \(L\) consists of all sentences generated by the grammar. Denote the language generated by \(G\) as \(L_G\). \section{Derivatives of the languages of grammars} \label{sec:deriv-lang-gramm} If \(s\) is a sentence, we ask the question: \begin{quotation} What is the derivative \(\frac{dL_{G}}{ds}\) of \(L_G\) by \(s\)? \end{quotation} To answer this, note that if \(s\) and \(t\) are sentences, and if \(L\) is any language, then \(\frac{dL}{d(s\cdot t)}\) is equal to \(\frac d{dt}\frac{dL}{ds}\), the derivative by \(t\) of the derivative by \(s\) of \(L\). This can be seen by definition: a sentence \(r\) belongs to \(\frac{dL}{d(s\cdot t)}\) if and only if \((s\cdot t\cdot r)\) belongs to \(L\), which is exactly the condition for \((t\cdot r)\) to belong to \(\frac{dL}{ds}\). And this is the condition for \(r\) to belong to \(\frac d{dt} \frac{dL}{ds}\). The above observation tells us that we can obtain \(\frac{dL_G}{ds}\) by taking derivatives of \(L_G\) by sentences consisting of one single alphabet, successively. Now consider \(\frac{dL_G}{da}\). Suppose \(G\) consists of \(n\) rules \begin{equation} \label{eq:rules_of_g} \begin{aligned} X_{1}&\mapsto Y_{1}\ldots Y_{m_{1}}\\ \vdots\\ X_{n}&\mapsto Z_{1} \ldots Z_{m_{n}}, \end{aligned} \end{equation} where \(X_i\) are variables, \(m_i\) is an integer depending on \(i\), for \(1\leq i\leq n\), and \(Y_1, \ldots, Y_{m_1}, \ldots, Z_1, \ldots, Z_{m_n}\) are either variables or constants. It is then natural to try to determine \(\frac{dL_G}{da}\) by some kind of substitution. So for \(1\leq i\leq n\), let \(X_i\) also denote the language that is generated if we treat the variable \(X_i\) as the starting variable in the grammar. For instance, if \(X_1\) is the original starting variable for the grammar \(G\), then \(L_G\) is also regarded as equal to \(X_1\) under this view. Further, we observe that if both \(M_1\) and \(M_2\) are languages, and if \(M\) is their concatenation: \begin{equation*} M := \left\{ s \cdot t\mid s \in M_{1}, t\in M_{2}\right\}. \end{equation*} Then \(\frac{dM}{da}\) is equal to either \[\frac{dM_{1}}{da}\cdot M_{2}\] or \[\frac{dM_{1}}{da} \cdot M_{2}\bigcup \frac{dM_{2}}{da},\] depending on whether or not \(M_1\) contains the empty sentence. This can be regarded as the \emph{product-rule} of the derivatives of languages, as an aside. Consequently, the derivative \(\frac{dX_1}{da}\) is equal to the union of \begin{align*} &\frac{dY_{1}}{da}\cdot Y_{2}\cdots Y_{m_{1}},\\ &\frac{dY_{2}}{da}\cdot Y_{3}\cdots Y_{m_{1}},\\ &\vdots\\ &\frac{dY_{i}}{da}\cdot Y_{i+1}\cdots Y_{m_{1}},\\ \end{align*} where \(i\) is the unique integer such that the language \(Y_{i}\) does not contain the empty sentence. If no such \(i\) exists, then we define \(i=m_{1}\). And \emph{mutatis mutandis} for \(\frac{dX_{2}}{da},\ldots, \frac{dX_{n}}{da}\). Now we treat the languages \(\frac{dY_i}{da}\) as variables: if \(Y_i\) is a variable, this denotes the derivative by \(a\) of the language generated by \(Y_i\); if \(Y_i\) is a constant, then this denotes either the language consisting of the empty sentence, or the empty set, depending upon whether \(Y_i\) is equal to \(a\) or not. Then the previous paragraph defines the substitution rules of these new variables \(\frac{dY_i}{da}\). We call these rules as \emph{the derived rules} of \(\frac{dY_i}{da}\). It follows that we can obtain the derivative \(\frac{dL_{G}}{da} = \frac{dX_{1}}{da}\) by \emph{substituting} these \(\frac{dY_i}{da}\) according to the derived rules. Combining the above, we see that the derivatives of the language \(L_G\) by any sentence are generated by grammars. We denote the above grammar, which generates \(\frac{dL_G}{da}\), as \emph{the derived grammar} \(\frac{dG}{da}\). Now our problem is transformed to determine whether the successive derivatives of grammars \(\frac{dG}{da_1}, \frac d{da_2}\frac{dG}{da_1} ,\ldots\), generates the empty sentence or not. This is relatively easy to do, once we can compute the successive derivatives efficiently. \section{Chain-rule of derivatives}\label{sec:chain-rule-deriv} All of the above was not new when the author proposed the algorithm. The core idea of the algorithm, in my opinion, is the \textbf{chain-rule} of the operation of taking derivatives of languages. In calculus, roughly speaking, the chain-rule of derivatives refers to the formula \begin{equation*} \frac{df}{dx} = \sum_{i=1}^{n} \frac{df}{dy_{i}} \cdot \frac{dy_{i}}{dx}, \end{equation*} where \(f\) is a function in \(n\) variables \(y_1, ..., y_n\), which are also functions in the variable \(x\). If we replace the function \(f\) by a language \(L\) generated by a grammar \(G\), the variables \(y_1, \ldots , y_n\) by the variables in \(G\), \(x\) by an alphabet \(a\), the multiplication \(\cdot\) by the concatenation of languages, and the summation \(+\) by the union of languages, then we obtain a formula writing the derivative \(\frac{dG}{da}\) as a union of the concatenations: \begin{equation}\label{eq:chain_rule} \frac{dG}{da} = \bigcup_{i=1}^{n}\frac{dG}{dy_{i}} \cdot \frac{dy_{i}}{da}. \end{equation} The only remaining questions are: What does \(\frac{dG}{dy_{i}}\) actually mean, and is the formula~\eqref{eq:chain_rule} true? \subsection{Derivatives with respect to variables} \label{sec:deriv-wrt-var} We want to generalize the construction in the section~\ref{sec:deriv-lang-gramm} to be able to take the derivatives with respect to variables. First, for the sake of brevity, let \(\Sigma^{\star}\) denote the set of all sentences, i.e. possibly empty sequences of elements in \(\Sigma\). We define a new alphabet \(\Sigma' := \Sigma\bigcup \left\{X_{i}\,\vert\, 1\leq i\leq n\right\}\) and a new set of variables \[ \left\{ Z_{i}\ \bigg\vert\ 1\leq i\leq n \right\}, \] where \(X_{i}\) are the variables in \(G\), according to its rules~\ref{eq:rules_of_g}. Then for each rule \(X_{i}\mapsto Y_{1}\ldots Y_{m_{i}}\), we define \(j\) rules: for \(1\leq k\leq j\), \begin{equation}\label{eq:closed_grammar_rules} Z_{i}\mapsto Z_{k} \cdot Y_{k+1}\cdots Y_{m_{i}}, \end{equation} where \(j\) is the smallest integer such that the language \(Y_{j}\) does not contain the empty sequence \(\varepsilon\in\Sigma^{\star}\). Here if \(Y_{k}\) belongs to \(\Sigma\), then \(Z_{k}\) is replaced by \(Y_{k}\) itself, i.e. the right-hand side of that rule does not contain variables. This defines a grammar, which we call the \emph{closed grammar} \(\overline G\). With this closed grammar, we can naturally take derivatives with respect to the original variables \(X_{i}\), as they are now alphabets of this closed grammar \(\overline G\). \subsection{Proof of the chain-rule}\label{sec:proof-chain-rule} \begin{prop}\label{prop:proof-of-chain-rule} \[ \frac{dL_{G}}{da} = \bigcup_{i=1}^{n} \frac{dX_{i}}{da} \cdot \left( \frac{dL_{\overline G}}{dX_{i}} \bigcap \operatorname{P}\left(\Sigma^{\star}\right) \right) ,\, \forall a\in\Sigma. \] \end{prop} \begin{proof} For any \(i\), any sequence in \(\frac{dL_{\overline G}}{dX_{i}}\) is generated from \(G\) by partial substitutions, and removing empty sequences. So the containement of the right-hand side in the left-hand side is obvious. Conversely, if a sentence \(s\in\Sigma^{\star}\) begins with \(a\), then \(s\) must be obtained by substitution from some sentence in \({(\Sigma')}^{\star}\), which means that it belongs to the right-hand side. So the proof is complete. \end{proof} \begin{rmk} It follows that we can determine whether or not \(\frac{dL_{G}}{da}\) contains the empty sequence by examining \(\frac{dX_{i}}{da}\) and \(\frac{dL_{\overline G}}{dX_{i}}\) respectively. Further, since \(\frac{dX_{i}}{da} = \frac{dZ_{i}}{da}\bigcap\operatorname{P} (\Sigma^{\star})\) and \(\frac{dL_{\overline G}}{dX_{i}} = \frac{dZ_{1}}{dX_{i}}\), we conclude that we only have to examine \(\frac{dZ_{i}}{dX_{j}}\), for \(1\leq i,j\leq n\). \end{rmk} What makes this grammar interesting is its regularity. \subsection{The closed grammar is regular}\label{sec:clos-gramm-regul} \begin{prop} The closed grammar is regular. In fact, the set \[ \left\{\frac{dZ_{i}}{ds}\ \bigg\vert\ 1\leq i\leq n,\,s\in{(\Sigma')}^{\star}\right\} \] is finite. \end{prop} \begin{proof} Observe the rules~\eqref{eq:closed_grammar_rules}. From the definition, we see that the right-hand sides of rules only contain at most one variable, \(Z_{k}\), and that variable must appear as the first symbol in the sequence on the right-hand side. This observation is sufficient to show that the closed grammar is regular: the rigorous-minded readers can prove this statement by first showing that the language generated by the closed grammar is a finite union of concatenations of languages of two types: either a set consisting of only one finite sentence, or a set consisting of zero or more self-repetitions of a finite sequence. Then the reader can easily verify that such languages are regular, by showing that the derivatives of a language consisting of zero or more self-repetitions of a finite sequence are finitely many. A hint for the last step is to notice that the derivatives of the language \(s^{\star}\) of zero or more self-repetitions of a sequence \(s:=s_{1},\ldots,s_{p}\) can be injected in the finite set \[ \left\{\left\{s_{i},s_{i+1},\ldots,s_{p}\right\}\cdot s^{\star} \mid 1\leq i\leq p+1\right\}. \] \end{proof} \section{Tokenization}\label{sec:tokenization} There is another abstraction used by this package. \end{document}