====================================================================== Rust, Emacs, and Parsers ====================================================================== About ----- This is a parser generator to be used as an Emacs dynamic package. First I shall explain how to compile this package, and then I shall explain briefly the background information and how to use this package. COMPILE ------- Download compiled library ========================= // TODO: Make the website. One can download a compiled dynamic library from the website: . Manual compilation ================== First clone this package: ---------------------------------------------- | git clone https://git.jsdurand.xyz/rep.git | ---------------------------------------------- Then go into the cloned directory: ---------- | cd rep | ---------- [ Make sure the Rust tools are already installed. If not, then install the Rust compiler tools from the instructions on the website . ] Then run the autotools: ------------------- | autoreconf -vif | ------------------- // TODO: Make the two build options. Now there are two build options: debug and release. The debug build is for debugging the package, so turns off optimisations and includes debug information. This is not recommended for normal use. The release build is the default, and is more efficient than the debug build. If one wants to use the debug build, run: ----------------------- | ./configure --debug | ----------------------- Otherwise just run: --------------- | ./configure | --------------- Finally run 'make' to build the package: -------- | make | -------- Now there should be a file 'rep.so' located in the 'src' directory. And one can directly load that file or move that dynamic library to another more convenient location of choice. The commands to build this package are summarized as follows: ---------------------------------------------- | git clone https://git.jsdurand.xyz/rep.git | | cd rep | | autoreconf -vif | | ./configure | | make | ---------------------------------------------- Build on Windows ================ Since autotools do not work well for windows, from my experiences, users of windows need another convenient way to compile the package. So I wrote a script to automate this process: just run // TODO: Figure out how to do this. Test ---- The package contains an Emacs Lisp file 'src/test.el' that demonstrates a simple scenario for testing. Simply open the file in an Emacs instance and run the Emacs Lisp forms one by one to see if it works. :D ====================================================================== About parser generators ====================================================================== Grammars -------- Before explaining what parsers are, we need first to explain what grammars are. In short, a grammar can be viewed as a "substitution machine". That is, a grammar consists of a set of variables, constants, and rules ---------------- | X => Y ... Z | ---------------- where X is a variable and Y ... Z are either variables or constants. The grammar then substitutes every occurence of X by the sequence at the right-hand side Y ... Z. And there is a "starting variable", usually denoted as "S". To sum up, we start from the sequence "S", and perform substitutions according to the rules. If the substituted sequence cannot be further substituted, i.e. if the sequence consists of constants only, we say that this constant-only sequence is "generated" by the grammar. One can think of this substitution machine as macros in TeX, except that this only allow macros without arguments. Or one can think of the machine as macros in programming languages, again without arguments. Notice that a variable can have multiple rules to substitute, and a given sequence can be generated by different sets of substitutions. Parsers ------- Parsers are "inverses" of grammars. More precisely, while grammars perform substitutions to generate sequences, parsers try to find out what substitutions the grammar performed in order to generate a given sequence. Since there might be multiple ways to generate a given sequence, the parser needs to find all possible ways. In fact, the number of ways to generate a sequence can even be infinite. For example, if the grammar has the following two rules: ----------- | S => S | | S => "" | ----------- This grammar means that S can be subtituted by S itself, and it can be substituted by the empty sequence as well. In this case, the grammar can only generate the empty sequence, but it can generate the empty sequence by infinitely many ways: it can first substitute S by itself any number of times, until it substitutes S by the empty sequence. Moreover, even if we do not consider the generations that produce empty sequences, the number of ways a grammar can generate a given sequence is in general O(n^3), where "O" means the big-O notation, and n is the length of the sequence to be generated. This is why general parsers are hard to construct for every grammar. Parser Generators ----------------- Now we come to the topic of this package: parser generators. Simply speaking, a parser generator takes a grammar as input, and produces a parser for that grammar as output. One can classify parser generators into two types: comiled and interpreted. The former means that the generator produces source codes of a parser that needs to be compiled to produce a parser. The latter means that the generator simply produces a parser that needs to be run by an interpreter. The well-known "tree-sitter" parser generator belongs to the first category, for example. This package is interpreted. But there is a plan to also turn on a "compiled mode" to potentially make the generated parsers more performant. Augmented Backus-Naur Form -------------------------- There are various ways to write grammars. This package chooses the augmented backus-naur form. See Wiki for an overview: , and see RFC 5234 for its detailed definition: or . The reason for this choice is that the RFC documents use this format, so there are an abundance of mature grammars to use. For example, the ABNF format itself has a grammar in the ABNF format, so we can use the generated parser to parse ABNF formats, immediately. There is also a plan to let the users choose their own favourite format, of course. Algorithm --------- [ This section is a little (too) involved. Those who are not interested in the details of the algorithm can skip to the next section "Tokenization" directly, by searching for a continuous sequence of the minus character '-' to jump to the next section, for example. ] This package uses an algorithm that is not widely adapted, at least not to my knowledge. So I think I will briefly explain how it works. Languages and derivatives ========================= To understand the algorithm, we shall first talk about the derivatives of languages. For the sake of convenience, let Sigma denote a finite set of alphabets. The set of all Unicode code points can be an example. We call a sequence of elements in Sigma a "sentence", and we call a (possibly infinite) set of sentences a "language". For example, the set of all grammatically correct English sentences is a language, to which we may give the name "English". Now, if L is a language and s is a sentence, we define the "derivative of L with respect to s", denoted as dL/ds, as the set of all sentences "t" such that the concatenated sentence "s . t" belongs to L. We can re-formulate this definition in more mathematical terms: ----------------------------------------------------- | dL/ds := { t is a sentence | s . t belongs to L } | ----------------------------------------------------- As a side note, the reason to call it the derivative is not only definitional similarity (to some extent), but also that this operation of taking the derivative of a language with respect to a sentence satisfies some similar properties as the "derivatives in Calculus". Parsing by taking derivatives ============================= Why consider derivatives in a document about parser generators? Because we can parse sentences by taking derivatives. More precisely, let L be a language and let s be a sentence. If the derivative dL/ds of L by s (which is a set of sentences) contains the empty sentence, then by definition this means that s belongs to L. This observation is the basis of parsing by taking derivatives. [ Those who have a rigorous mind are probably frowning at the above paragraph because we say we can parse by taking derivatives, whereas we actually only determine if the sentence belongs to the language. This gap comes from that the details of actually constructing the parse forests by taking derivatives are too complicated for me to explain clearly here. I might come up with a clear explanation in the future, though. ] Regular languages ================= With the above definition in mind, we now say that a language L is "regular" if and only if the set of derivatives of L is finite, i.e. ------------------------------------------ | { dL/ds | s is a sentence } is finite. | ------------------------------------------ Note that this is a set of languages, namely, a set of sets of sentences. And we say it is finite if it contains finitely many sets. For those familiar with regular expressions, a language is regular if and only if there is a regular expression such that a sentence belongs to the language if and only if it matches that regular expression. I chose to define regular languages using derivatives as it is easier to reason about them in the following, in my opinion. Languages of grammars ===================== If G is a grammar, we say that "G generates a language L", where L consists of all sentences generated by the grammar. Denote the language generated by G as L_G. Derivatives of the languages of grammars ======================================== If s is a sentence, We ask the question: --------------------------------------- | What is the derivative of L_G by s? | --------------------------------------- To answer this, note that if s and t are sentences, and if L is any language, then dL/d(s . t) is equal to d( dL/ds )/dt, the derivative by t of the derivative by s of L. This can be seen by definition: a sentence r belongs to dL/d(s . t) if and only if (s . t . r) belongs to L, which is exactly the condition for (t . r) to belong to dL/ds. And this is the condition for r to belong to d( dL/ds )/dt. The above observation tells us that we can obtain dL_G/ds by taking derivatives of L_G by sentences consisting of one single alphabet, successively. Now consider dL_G/da. Suppose G consists of n rules -------------------------- | X_1 => Y_1 ... Y_{m_1} | | ... => ... | | X_n => Z_1 ... Z_{m_n} | -------------------------- where X_i are variables, m_i is an integer depending on i, for i from 1 to n, and Y_1, ..., Y_{m_1}, ..., Z_1, ..., Z_{m_n} are either variables or constants. It is then natural to try to determine dL_G/da by some kind of substitution. So for i from 1 to n, let X_i also denote the language that is generated if we treat the variable X_i as the starting variable in the grammar. For instance, if X_1 is the original starting variable for the grammar G, then L_G is also regarded as equal to X_1 under this view. Further, we observe that if both M_1 and M_2 are languages, and if M is their concatenation: ------------------------------------------------- | M = { s . t | s belongs to M_1 and t to M_2 } | ------------------------------------------------- Then dM/da is equal to either --------------------- | ( dM_1/da ) . M_2 | --------------------- or ----------------------------------------------- | the union of ( dM_1/da ) . M_2 and dM_2/da, | ----------------------------------------------- depending on whether or not M_1 contains the empty sentence. Consequently, the derivative dX_1/da is equal to the union of ---------------------------------- | dY_1/da . Y_2 ... Y_{m_1}, | | ..., | | dY_i/da . Y_{i+1} ... Y_{m_1}, | ---------------------------------- where i is the unique integer such that the language Y_i does not contain the empty sentence. If no such i exists, then i = m_1. And similarly for dX_2/da, ..., dX_n/da. Now we treat the languages dY_i/da as variables: if Y_i is a variable, this denotes the derivative by a of the language generated by Y_i; if Y_i is a constant, then this denotes either the language consisting of the empty sentence, or the empty set, depending upon whether Y_i is equal to a or not. Then the previous paragraph defines the substitution rules of these new variables dY_i/da. We call these rules as "the derived rules" of dY_i/da. It follows that we can obtain the derivative --------------------- | dL_G/da = dX_1/da | --------------------- by "substituting" these dY_i/da according to the derived rules. Combining the above, we see that the derivatives of the language L_G by any sentence are generated by grammars. We denote the above grammar, which generates dL_G/da, as "the derived grammar" dG/da. Now our problem is transformed to determine if the successive derivatives of grammars dG/da_1, d( dG/da_1 )/da_2, et cetera, generates the empty sentence. This is relatively easy to do. Chain-rule of derivatives ========================= All of the above was not new when the author proposed the algorithm. The core idea of the algorithm, in my opinion, is the "chain-rule" of the operation of taking derivatives of languages. In calculus, roughly speaking, the chain-rule of derivatives refers to the formula -------------------------------------------------------- | df/dx = df/dy_1 . dy_1/dx + ... + df/dy_n . dy_n/dx, | -------------------------------------------------------- where f is a function in n variables y_1, ..., y_n, which are also functions in the variable x. If we replace the function f by a language L generated by a grammar G, the variables y_1, ..., y_n by the variables in G, x by an alphabet, the multiplication "." by the concatenation of languages, and the summation "+" by the union of languages, then we obtain a formula writing the derivative dG/da as a union of the concatenations ---------------------- | dG/dy_i . dy_i/da, | ---------------------- for i from 1 to n. The only remaining question is: --------------------------- | What does dG/dy_i mean? | --------------------------- TODO: Provide more details. Tokenization ------------ There is another abstraction used by this package.