From cb7bcfad4ab0041aaf3fde3185e27ee46bb37788 Mon Sep 17 00:00:00 2001 From: JSDurand Date: Tue, 15 Nov 2022 12:01:28 +0800 Subject: Initial commit Basic GNU standard files are added, and we now stop worrying about monadic anamorphisms. The current focus is on testing the correctness of the algorithm, so I need convenient support for manipulating, interpreting, examining, and per chance animating nondeterministic automata. --- DESIGN.org | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 114 insertions(+) create mode 100644 DESIGN.org (limited to 'DESIGN.org') diff --git a/DESIGN.org b/DESIGN.org new file mode 100644 index 0000000..18834cb --- /dev/null +++ b/DESIGN.org @@ -0,0 +1,114 @@ +#+AUTHOR: Durand +#+EMAIL: durand@jsdurand.xyz +#+DATE: <2022-09-30 Ven 14:45> +#+STARTUP: fold + +This document records an attempt to design a parser generator library. + +* Goals + +- Modular + + The library should be modular enough that the following components + of the parser generator can be easily substituted by either external + crates or even external foreign functions. + + + Regular expression matchers + + This project is to have a parser generator. A regular expression + matcher is not the core goal of the project. So I shall write the + library in a way that can use other means to match regular + expressions, should some users wish to do so. + + Since I do not want to use external dependencies by default, I + will also write a small regular expression matcher library to + serve as the default matcher. This will allow me to try different + optimization techniques for regular expression matchers as well. + + + Nondeterministic Fintie Automata + + The project needs to manipulate nondeterministic fintie automata. + As for regular expression matchers, I will write my own library + for dealing with NFA's, and in the mean time preserve the + possibility to "plug in" other external crates for this purpose as + well. + + + Graph data type + + The internal data of the parser generator, as used by the current + algorithm, is a directed graph. While this is easy to represent + using the adjacent list representation, I will not underestimate + the potential optimizations brought by external libraries. + + Moreover, the underlying data structures of regular expressions + and nondeterministic fintie automata are graphs as well. So they + should be able to share the same data structure in the default + implementation, and to plug in external crates, should the need + arise. + +- Abstract + + One of the deficiencies of the current version of this project is + that it does not mirror the abstract ideas of the algorithm closely. + This means that, when the implementation of the algorithm goes + wrong, I cannot precisely pinpoint the cause for the erraneous + behaviour. + + The reason for the departure from the abstract ideas was that I + thought this could lead to some performance gains. Apparently I was + too obsessed with premature optimizations that I ignored the + potential problems caused. + +** Update + +Now I realize that a nondeterministic fintie automaton is equivalent +with a regular expression, so it makes no sense to deal with NFAs +without dealing with regular expressions. Also, dealing with regular +expressions directly makes it easier to extract the "atomic languages" +of grammars and to present them in a readable way. + +But still the current focus is not on matching regular expressions, +but on the structures of regular expressions. + +* Details + +According to the above goals, I have to write three sub-crates, or +workspace members, for regular expression matchers, NFA's, and for +graphs, respectively. + +Each of these three crates is easy to write, in my opinion. I just +have to pack the old codes into separate crates, to separate concerns. +Once this is done, I can easily plug in other crates as well, as one +can just wrap up external crates. + +Moreover, in order to achieve the goal of abstraction, I need modules +for "atoms" and "languages". Atoms represent the regular language of +derivatives of the grammar. This is perhaps the most obscure part, in +my experience. Languages represent the derivative of the grammar with +respect to inputs. From my experience, it is not a good idea to +calculate the semiring values after the derivatives have been +calculated. Instead, we shall calculate the semiring values at the +same time as calculating the derivatives of the grammar, and record +the values in a dedicated recorder. + +* Tokenizers + +In my opinion, tokenization is just a speed-up process. The user +should not worry about the distinction between tokens and +non-terminals. This means I want the parser generator to work at the +character, or byte, level by default. When the user wants to use a +dedicated tokenizer, to have some speed-ups for example, the user can +supply a function that eats a string, or an array of bytes, and +returns an array of tokens. + +This decision means that we don't need regular expressions by default. +We shall add the regular expressions back later, when we want to +decrease the constant factor involved in the algorithmic complexities. +When we do want to do so, then, we need a way to automatically deduce +regular expression terminals. In the current version, a non-terminal +whose rules only involve with terminals will be automatically +converted to a regular expression terminal. I think this mechanism +can be adapted in the future. + +* Grammars + -- cgit v1.2.3-18-g5258