From b2ca9a285f09d90259aaa7c4d09bb7335211020b Mon Sep 17 00:00:00 2001 From: JSDurand Date: Thu, 10 Aug 2023 11:14:54 +0800 Subject: Modify README Working my way towards a user-friendly introduction to the package. --- algorithm.tex | 486 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 486 insertions(+) create mode 100644 algorithm.tex (limited to 'algorithm.tex') diff --git a/algorithm.tex b/algorithm.tex new file mode 100644 index 0000000..20fb1ac --- /dev/null +++ b/algorithm.tex @@ -0,0 +1,486 @@ +\documentclass[12pt]{article} +\usepackage[utf8]{inputenc} +\usepackage[T1]{fontenc} +\usepackage{graphicx} +\usepackage{grffile} +\usepackage{longtable} +\usepackage{wrapfig} +\usepackage{rotating} +\usepackage[normalem]{ulem} +\usepackage{amsmath} +\usepackage{mathrsfs} +\usepackage{textcomp} +\usepackage{amssymb} +\usepackage{capt-of} +\usepackage{hyperref} +\usepackage{amsfonts, amsthm} +\usepackage{enumitem} +\usepackage{tikz} +\author{Durand} +\date{2023--08--05} +\title{\vspace{-5ex}Algorithm used by the package REP} +\hypersetup{ + pdfauthor={Durand}, + pdftitle={Algorithm used by the package REP}, + pdfkeywords={}, + pdfsubject={}, + pdfcreator={Durand}, + pdflang={English}} + +\newtheorem{lem}{Lemma}[section] +\newtheorem{coro}[lem]{Corollary} +\newtheorem{thm}{Theorem}[section] +\newtheorem{prop}{Proposition}[section] +\newtheorem{ex}{Exercise}[section] +\theoremstyle{definition} +\newtheorem*{rmk}{Remark} +\newtheorem{defn}{Definition}[section] + +\newcommand{\n}{\mathbb{N}} +\newcommand{\z}{\mathbb{Z}} +\newcommand{\q}{\mathbb{Q}} +\newcommand{\im}{\operatorname{Im}} +\renewcommand{\labelenumi}{\arabic{enumi}.} +\renewcommand{\labelenumii}{(\alph{enumii})} +\begin{document} +\maketitle + +\tableofcontents + +\begin{abstract} + This package uses an algorithm that is not widely adapted, at least + not to my knowledge. So I think I will briefly explain how it + works. + + The essence of the algorithm, in my opinion, is the chain-rule for + derivatives of languages. And an optimization technique that makes + a great difference in the performance of the algorithm, not yet + implemented in this package, is the \textbf{context-free + memoization}. +\end{abstract} + +\section{Grammars}\label{sec:grammars} + +Before explaining what parsers are, we need first to explain what +grammars are. + +In short, a grammar can be viewed as a \emph{substitution machine}. +That is, a grammar consists of a set of variables, constants, and +rules + +\begin{equation*} +X \mapsto Y \ldots Z, +\end{equation*} + +where \(X\) is a variable and \(Y \ldots Z\) are either variables or +constants. The grammar then substitutes every occurence of \(X\) by +the sequence at the right-hand side \(Y\ldots Z\). And there is a +\emph{starting variable}, usually denoted as \(S\). + +To sum up, we start from the sequence \(S\), and perform substitutions +according to the rules. If the substituted sequence cannot be further +substituted, i.e. if the sequence consists of constants only, we say +that this constant-only sequence is "generated" by the grammar. + +One can think of this substitution machine as macros in \TeX, except +that this only allow macros without arguments. Or one can think of +the machine as macros in programming languages like \texttt{Lisp}, +again without arguments. + +Notice that a variable can have multiple rules to substitute, and a +given sequence can be generated by different sets of substitutions. + +\section{Parsers}\label{sec:parsers} + +Parsers are \emph{inverses} of grammars. More precisely, while +grammars perform substitutions to generate sequences, parsers try to +find out what substitutions the grammar performed in order to generate +a given sequence. Since there might be multiple ways to generate a +given sequence, the parser needs to find all possible ways. + +In fact, the number of ways to generate a sequence can even be +infinite. For example, if the grammar has the following two rules: + +\begin{align*} +S &\mapsto S\\ +S &\mapsto "". +\end{align*} + +This grammar means that \(S\) can be subtituted by \(S\) itself, and +it can be substituted by the empty sequence as well. In this case, +the grammar can only generate the empty sequence, but it can generate +the empty sequence by infinitely many ways: it can first substitute +\(S\) by itself any number of times, until it substitutes \(S\) by the +empty sequence. + +Moreover, even if we do not consider the generations that produce +empty sequences, the number of ways a grammar can generate a given +sequence is in general \(O(n^3)\), where \(O\) means the +\textbf{big-O} notation, and \(n\) is the length of the sequence to be +generated. This is why efficient general parsers that work for every +grammar are hard to construct. + +\section{Parser Generators}\label{sec:parser-generators} + +Now we come to the topic of this package: parser generators. Simply +speaking, a parser generator takes a grammar as input, and produces a +parser for that grammar as output. + +One can classify parser generators into two types: comiled and +interpreted. The former means that the generator produces source +codes of a parser that needs to be compiled to produce a parser. The +latter means that the generator simply produces a parser that needs to +be run by an interpreter. The well-known \emph{tree-sitter} parser +generator belongs to the first category, for example. + +This package is interpreted. But there is a plan to also turn on a +\emph{compiled mode} to potentially make the generated parsers more +performant. + +\section{Augmented Backus-Naur Form}\label{sec:augm-back-naur} + +There are various ways to write grammars. This package chooses the +\emph{augmented backus-naur form}. See +\href{https://en.wikipedia.org/wiki/Augmented_Backus-Naur_Form}{Wiki} +for an overview, and see +\href{https://datatracker.ietf.org/doc/html/rfc5234}{RFC 5234} for its +detailed definition, or +\href{https://www.rfc-editor.org/rfc/rfc5234.txt}{the text version}. + +The reason for this choice is that the \textbf{RFC documents} use this +format, so there are an abundance of mature grammars to use. For +example, the ABNF format itself has a grammar in the ABNF format, so +we can use the generated parser to parse ABNF formats, immediately. + +There is also a plan to let the users choose their own favourite +format, of course. + +\section{Languages and derivatives}\label{sec:lang-deriv} + +This section is involved and optional. One can jump to the +section~\ref{sec:tokenization} on Tokenization directly if the details +of the algorithm are not so interesting. + +To understand the algorithm, we shall first talk about the derivatives +of languages. + +For the sake of convenience, let \(\Sigma\) denote a finite set of +alphabets. The set of all Unicode code points can be an example. We +call a sequence of elements in \(\Sigma\) a \emph{sentence}, and we +call a (possibly infinite) set of sentences a \emph{language}. For +example, the set of all grammatically correct English sentences is a +language, to which we may give the name \emph{English}. + +Now, if \(L\) is a language and \(s\) is a sentence, we define the +\textbf{derivative of L with respect to s}, denoted as +\(\frac{dL}{ds}\), as the set of all sentences \(t\) such that the +concatenated sentence \(s \cdot t\) belongs to \(L\). We can +re-formulate this definition in more mathematical terms: + +\[ + \frac{dL}{ds} := \left\{t \text{ is a sentence } \mid s \cdot t + \text{ belongs to } L\right\}. +\] + +As a side note, the reason to call it the \emph{derivative} is not +only definitional similarity (to some extent), but also that this +operation of taking the derivative of a language with respect to a +sentence satisfies some similar properties as the \emph{derivatives in + Calculus}. + +\section{Parsing by taking derivatives}\label{sec:pars-taking-deriv} + +Why consider derivatives in a document about parser generators? +Because we can parse sentences by taking derivatives. + +More precisely, let \(L\) be a language and let \(s\) be a sentence. +If the derivative \(\frac{dL}{ds}\) of \(L\) by \(s\) (which is a set +of sentences) contains the empty sentence, then by definition this +means that \(s\) belongs to \(L\). This observation is the basis of +parsing by taking derivatives. + +\begin{quotation} + Those who have a rigorous mind are probably frowning at the above + paragraph because we say we can parse by taking derivatives, whereas + we actually only determine if the sentence belongs to the language. + This gap comes from that the details of actually constructing the + parse forests by taking derivatives are too complicated for me to + explain clearly here. I might come up with a clear explanation in + the future, though. +\end{quotation} + +\section{Regular languages}\label{sec:regular-languages} + +With the above definition in mind, we now say that a language \(L\) is +\emph{regular} if and only if the set of derivatives of \(L\) is +finite, i.e. + +\[ + \left\{\frac{dL}{ds} \mid s \text{ is a sentence } \right\} \text{ + is finite}. +\] + +Note that this is a set of languages, namely, a set of sets of +sentences. And we say it is finite if it contains finitely many sets. + +For those familiar with regular expressions, a language is regular if +and only if there is a regular expression such that a sentence belongs +to the language if and only if it matches that regular expression. I +chose to define regular languages using derivatives as it is easier to +reason about them in the following, in my opinion. + +\section{Languages of grammars}\label{sec:languages-grammars} + +If \(G\) is a grammar, we say that \emph{\(G\) generates a language + \(L\)}, where \(L\) consists of all sentences generated by the +grammar. Denote the language generated by \(G\) as \(L_G\). + +\section{Derivatives of the languages of grammars} +\label{sec:deriv-lang-gramm} + +If \(s\) is a sentence, we ask the question: + +\begin{quotation} + What is the derivative \(\frac{dL_{G}}{ds}\) of \(L_G\) by \(s\)? +\end{quotation} + +To answer this, note that if \(s\) and \(t\) are sentences, and if +\(L\) is any language, then \(\frac{dL}{d(s\cdot t)}\) is equal to +\(\frac d{dt}\frac{dL}{ds}\), the derivative by \(t\) of the +derivative by \(s\) of \(L\). This can be seen by definition: a +sentence \(r\) belongs to \(\frac{dL}{d(s\cdot t)}\) if and only if +\((s\cdot t\cdot r)\) belongs to \(L\), which is exactly the condition +for \((t\cdot r)\) to belong to \(\frac{dL}{ds}\). And this is the +condition for \(r\) to belong to \(\frac d{dt} \frac{dL}{ds}\). + +The above observation tells us that we can obtain \(\frac{dL_G}{ds}\) +by taking derivatives of \(L_G\) by sentences consisting of one single +alphabet, successively. + +Now consider \(\frac{dL_G}{da}\). + +Suppose \(G\) consists of \(n\) rules + +\begin{equation} + \label{eq:rules_of_g} + \begin{aligned} + X_{1}&\mapsto Y_{1}\ldots Y_{m_{1}}\\ + \vdots\\ + X_{n}&\mapsto Z_{1} \ldots Z_{m_{n}}, + \end{aligned} +\end{equation} +where \(X_i\) are variables, \(m_i\) is an integer depending on \(i\), +for \(1\leq i\leq n\), and +\(Y_1, \ldots, Y_{m_1}, \ldots, Z_1, \ldots, Z_{m_n}\) are either +variables or constants. + +It is then natural to try to determine \(\frac{dL_G}{da}\) by some +kind of substitution. So for \(1\leq i\leq n\), let \(X_i\) also +denote the language that is generated if we treat the variable \(X_i\) +as the starting variable in the grammar. For instance, if \(X_1\) is +the original starting variable for the grammar \(G\), then \(L_G\) is +also regarded as equal to \(X_1\) under this view. + +Further, we observe that if both \(M_1\) and \(M_2\) are languages, +and if \(M\) is their concatenation: + +\begin{equation*} + M := \left\{ s \cdot t\mid s \in M_{1}, t\in M_{2}\right\}. +\end{equation*} + +Then \(\frac{dM}{da}\) is equal to either +\[\frac{dM_{1}}{da}\cdot M_{2}\] or +\[\frac{dM_{1}}{da} \cdot M_{2}\bigcup \frac{dM_{2}}{da},\] + +depending on whether or not \(M_1\) contains the empty sentence. This +can be regarded as the \emph{product-rule} of the derivatives of +languages, as an aside. + +Consequently, the derivative \(\frac{dX_1}{da}\) is equal to the union +of + +\begin{align*} + &\frac{dY_{1}}{da}\cdot Y_{2}\cdots Y_{m_{1}},\\ + &\frac{dY_{2}}{da}\cdot Y_{3}\cdots Y_{m_{1}},\\ + &\vdots\\ + &\frac{dY_{i}}{da}\cdot Y_{i+1}\cdots Y_{m_{1}},\\ +\end{align*} + +where \(i\) is the unique integer such that the language \(Y_{i}\) +does not contain the empty sentence. If no such \(i\) exists, then we +define \(i=m_{1}\). And \emph{mutatis mutandis} for +\(\frac{dX_{2}}{da},\ldots, \frac{dX_{n}}{da}\). + +Now we treat the languages \(\frac{dY_i}{da}\) as variables: if +\(Y_i\) is a variable, this denotes the derivative by \(a\) of the +language generated by \(Y_i\); if \(Y_i\) is a constant, then this +denotes either the language consisting of the empty sentence, or the +empty set, depending upon whether \(Y_i\) is equal to \(a\) or not. +Then the previous paragraph defines the substitution rules of these +new variables \(\frac{dY_i}{da}\). We call these rules as \emph{the + derived rules} of \(\frac{dY_i}{da}\). + +It follows that we can obtain the derivative +\(\frac{dL_{G}}{da} = \frac{dX_{1}}{da}\) by \emph{substituting} these +\(\frac{dY_i}{da}\) according to the derived rules. + +Combining the above, we see that the derivatives of the language +\(L_G\) by any sentence are generated by grammars. We denote the +above grammar, which generates \(\frac{dL_G}{da}\), as \emph{the + derived grammar} \(\frac{dG}{da}\). Now our problem is transformed +to determine whether the successive derivatives of grammars +\(\frac{dG}{da_1}, \frac d{da_2}\frac{dG}{da_1} ,\ldots\), generates +the empty sentence or not. This is relatively easy to do, once we can +compute the successive derivatives efficiently. + +\section{Chain-rule of derivatives}\label{sec:chain-rule-deriv} + +All of the above was not new when the author proposed the algorithm. +The core idea of the algorithm, in my opinion, is the +\textbf{chain-rule} of the operation of taking derivatives of +languages. + +In calculus, roughly speaking, the chain-rule of derivatives refers to +the formula + +\begin{equation*} + \frac{df}{dx} = + \sum_{i=1}^{n} \frac{df}{dy_{i}} \cdot \frac{dy_{i}}{dx}, +\end{equation*} + +where \(f\) is a function in \(n\) variables \(y_1, ..., y_n\), which +are also functions in the variable \(x\). + +If we replace the function \(f\) by a language \(L\) generated by a +grammar \(G\), the variables \(y_1, \ldots , y_n\) by the variables in +\(G\), \(x\) by an alphabet \(a\), the multiplication \(\cdot\) by the +concatenation of languages, and the summation \(+\) by the union of +languages, then we obtain a formula writing the derivative +\(\frac{dG}{da}\) as a union of the concatenations: + +\begin{equation}\label{eq:chain_rule} + \frac{dG}{da} = + \bigcup_{i=1}^{n}\frac{dG}{dy_{i}} \cdot \frac{dy_{i}}{da}. +\end{equation} + +The only remaining questions are: What does \(\frac{dG}{dy_{i}}\) +actually mean, and is the formula~\eqref{eq:chain_rule} true? + +\subsection{Derivatives with respect to variables} +\label{sec:deriv-wrt-var} + +We want to generalize the construction in the +section~\ref{sec:deriv-lang-gramm} to be able to take the derivatives +with respect to variables. + +First, for the sake of brevity, let \(\Sigma^{\star}\) denote the set +of all sentences, i.e. possibly empty sequences of elements in +\(\Sigma\). + +We define a new alphabet +\(\Sigma' := \Sigma\bigcup \left\{X_{i}\,\vert\, 1\leq i\leq + n\right\}\) and a new set of variables +\[ + \left\{ Z_{i}\ \bigg\vert\ 1\leq i\leq n \right\}, +\] +where \(X_{i}\) are the variables in \(G\), according to its +rules~\ref{eq:rules_of_g}. + +Then for each rule \(X_{i}\mapsto Y_{1}\ldots Y_{m_{i}}\), we define +\(j\) rules: for \(1\leq k\leq j\), +\begin{equation}\label{eq:closed_grammar_rules} + Z_{i}\mapsto Z_{k} \cdot Y_{k+1}\cdots Y_{m_{i}}, +\end{equation} +where \(j\) is the smallest integer such that the language \(Y_{j}\) +does not contain the empty sequence \(\varepsilon\in\Sigma^{\star}\). +Here if \(Y_{k}\) belongs to \(\Sigma\), then \(Z_{k}\) is replaced by +\(Y_{k}\) itself, i.e. the right-hand side of that rule does not +contain variables. + +This defines a grammar, which we call the \emph{closed grammar} +\(\overline G\). + +With this closed grammar, we can naturally take derivatives with +respect to the original variables \(X_{i}\), as they are now alphabets +of this closed grammar \(\overline G\). + +\subsection{Proof of the chain-rule}\label{sec:proof-chain-rule} + +\begin{prop}\label{prop:proof-of-chain-rule} + \[ + \frac{dL_{G}}{da} = \bigcup_{i=1}^{n} \frac{dX_{i}}{da} \cdot + \left( \frac{dL_{\overline G}}{dX_{i}} \bigcap + \operatorname{P}\left(\Sigma^{\star}\right) \right) ,\, \forall + a\in\Sigma. + \] +\end{prop} + +\begin{proof} + For any \(i\), any sequence in \(\frac{dL_{\overline G}}{dX_{i}}\) + is generated from \(G\) by partial substitutions, and removing empty + sequences. So the containement of the right-hand side in the + left-hand side is obvious. + + Conversely, if a sentence \(s\in\Sigma^{\star}\) begins with \(a\), + then \(s\) must be obtained by substitution from some sentence in + \({(\Sigma')}^{\star}\), which means that it belongs to the + right-hand side. So the proof is complete. +\end{proof} + +\begin{rmk} + It follows that we can determine whether or not + \(\frac{dL_{G}}{da}\) contains the empty sequence by examining + \(\frac{dX_{i}}{da}\) and \(\frac{dL_{\overline G}}{dX_{i}}\) + respectively. + + Further, since + \(\frac{dX_{i}}{da} = \frac{dZ_{i}}{da}\bigcap\operatorname{P} + (\Sigma^{\star})\) and + \(\frac{dL_{\overline G}}{dX_{i}} = \frac{dZ_{1}}{dX_{i}}\), we + conclude that we only have to examine \(\frac{dZ_{i}}{dX_{j}}\), for + \(1\leq i,j\leq n\). +\end{rmk} + +What makes this grammar interesting is its regularity. + +\subsection{The closed grammar is regular}\label{sec:clos-gramm-regul} + +\begin{prop} + The closed grammar is regular. In fact, the set + \[ + \left\{\frac{dZ_{i}}{ds}\ \bigg\vert\ + 1\leq i\leq n,\,s\in{(\Sigma')}^{\star}\right\} + \] + is finite. +\end{prop} + +\begin{proof} + Observe the rules~\eqref{eq:closed_grammar_rules}. From the + definition, we see that the right-hand sides of rules only contain + at most one variable, \(Z_{k}\), and that variable must appear as + the first symbol in the sequence on the right-hand side. This + observation is sufficient to show that the closed grammar is + regular: the rigorous-minded readers can prove this statement by + first showing that the language generated by the closed grammar is a + finite union of concatenations of languages of two types: either a + set consisting of only one finite sentence, or a set consisting of + zero or more self-repetitions of a finite sequence. Then the reader + can easily verify that such languages are regular, by showing that + the derivatives of a language consisting of zero or more + self-repetitions of a finite sequence are finitely many. A hint for + the last step is to notice that the derivatives of the language + \(s^{\star}\) of zero or more self-repetitions of a sequence + \(s:=s_{1},\ldots,s_{p}\) can be injected in the finite set + \[ + \left\{\left\{s_{i},s_{i+1},\ldots,s_{p}\right\}\cdot s^{\star} + \mid 1\leq i\leq p+1\right\}. + \] +\end{proof} + +\section{Tokenization}\label{sec:tokenization} + +There is another abstraction used by this package. + + + +\end{document} \ No newline at end of file -- cgit v1.2.3-18-g5258