#+AUTHOR: Durand #+EMAIL: durand@jsdurand.xyz #+DATE: <2022-09-30 Ven 14:45> #+STARTUP: fold This document records an attempt to design a parser generator library. * Goals - Modular The library should be modular enough that the following components of the parser generator can be easily substituted by either external crates or even external foreign functions. + Regular expression matchers This project is to have a parser generator. A regular expression matcher is not the core goal of the project. So I shall write the library in a way that can use other means to match regular expressions, should some users wish to do so. Since I do not want to use external dependencies by default, I will also write a small regular expression matcher library to serve as the default matcher. This will allow me to try different optimization techniques for regular expression matchers as well. + Nondeterministic Fintie Automata The project needs to manipulate nondeterministic fintie automata. As for regular expression matchers, I will write my own library for dealing with NFA's, and in the mean time preserve the possibility to "plug in" other external crates for this purpose as well. + Graph data type The internal data of the parser generator, as used by the current algorithm, is a directed graph. While this is easy to represent using the adjacent list representation, I will not underestimate the potential optimizations brought by external libraries. Moreover, the underlying data structures of regular expressions and nondeterministic fintie automata are graphs as well. So they should be able to share the same data structure in the default implementation, and to plug in external crates, should the need arise. - Abstract One of the deficiencies of the current version of this project is that it does not mirror the abstract ideas of the algorithm closely. This means that, when the implementation of the algorithm goes wrong, I cannot precisely pinpoint the cause for the erraneous behaviour. The reason for the departure from the abstract ideas was that I thought this could lead to some performance gains. Apparently I was too obsessed with premature optimizations that I ignored the potential problems caused. ** Update Now I realize that a nondeterministic fintie automaton is equivalent with a regular expression, so it makes no sense to deal with NFAs without dealing with regular expressions. Also, dealing with regular expressions directly makes it easier to extract the "atomic languages" of grammars and to present them in a readable way. But still the current focus is not on matching regular expressions, but on the structures of regular expressions. * Details According to the above goals, I have to write three sub-crates, or workspace members, for regular expression matchers, NFA's, and for graphs, respectively. Each of these three crates is easy to write, in my opinion. I just have to pack the old codes into separate crates, to separate concerns. Once this is done, I can easily plug in other crates as well, as one can just wrap up external crates. Moreover, in order to achieve the goal of abstraction, I need modules for "atoms" and "languages". Atoms represent the regular language of derivatives of the grammar. This is perhaps the most obscure part, in my experience. Languages represent the derivative of the grammar with respect to inputs. From my experience, it is not a good idea to calculate the semiring values after the derivatives have been calculated. Instead, we shall calculate the semiring values at the same time as calculating the derivatives of the grammar, and record the values in a dedicated recorder. * Tokenizers In my opinion, tokenization is just a speed-up process. The user should not worry about the distinction between tokens and non-terminals. This means I want the parser generator to work at the character, or byte, level by default. When the user wants to use a dedicated tokenizer, to have some speed-ups for example, the user can supply a function that eats a string, or an array of bytes, and returns an array of tokens. This decision means that we don't need regular expressions by default. We shall add the regular expressions back later, when we want to decrease the constant factor involved in the algorithmic complexities. When we do want to do so, then, we need a way to automatically deduce regular expression terminals. In the current version, a non-terminal whose rules only involve with terminals will be automatically converted to a regular expression terminal. I think this mechanism can be adapted in the future. * Grammars