summaryrefslogtreecommitdiff
path: root/DESIGN.org
diff options
context:
space:
mode:
authorJSDurand <mmemmew@gmail.com>2022-11-15 12:01:28 +0800
committerJSDurand <mmemmew@gmail.com>2022-11-15 12:01:28 +0800
commitcb7bcfad4ab0041aaf3fde3185e27ee46bb37788 (patch)
treea4fd99b138b72617b6c4c2b04f5d2655d0fedcc5 /DESIGN.org
Initial commit
Basic GNU standard files are added, and we now stop worrying about monadic anamorphisms. The current focus is on testing the correctness of the algorithm, so I need convenient support for manipulating, interpreting, examining, and per chance animating nondeterministic automata.
Diffstat (limited to 'DESIGN.org')
-rw-r--r--DESIGN.org114
1 files changed, 114 insertions, 0 deletions
diff --git a/DESIGN.org b/DESIGN.org
new file mode 100644
index 0000000..18834cb
--- /dev/null
+++ b/DESIGN.org
@@ -0,0 +1,114 @@
+#+AUTHOR: Durand
+#+EMAIL: durand@jsdurand.xyz
+#+DATE: <2022-09-30 Ven 14:45>
+#+STARTUP: fold
+
+This document records an attempt to design a parser generator library.
+
+* Goals
+
+- Modular
+
+ The library should be modular enough that the following components
+ of the parser generator can be easily substituted by either external
+ crates or even external foreign functions.
+
+ + Regular expression matchers
+
+ This project is to have a parser generator. A regular expression
+ matcher is not the core goal of the project. So I shall write the
+ library in a way that can use other means to match regular
+ expressions, should some users wish to do so.
+
+ Since I do not want to use external dependencies by default, I
+ will also write a small regular expression matcher library to
+ serve as the default matcher. This will allow me to try different
+ optimization techniques for regular expression matchers as well.
+
+ + Nondeterministic Fintie Automata
+
+ The project needs to manipulate nondeterministic fintie automata.
+ As for regular expression matchers, I will write my own library
+ for dealing with NFA's, and in the mean time preserve the
+ possibility to "plug in" other external crates for this purpose as
+ well.
+
+ + Graph data type
+
+ The internal data of the parser generator, as used by the current
+ algorithm, is a directed graph. While this is easy to represent
+ using the adjacent list representation, I will not underestimate
+ the potential optimizations brought by external libraries.
+
+ Moreover, the underlying data structures of regular expressions
+ and nondeterministic fintie automata are graphs as well. So they
+ should be able to share the same data structure in the default
+ implementation, and to plug in external crates, should the need
+ arise.
+
+- Abstract
+
+ One of the deficiencies of the current version of this project is
+ that it does not mirror the abstract ideas of the algorithm closely.
+ This means that, when the implementation of the algorithm goes
+ wrong, I cannot precisely pinpoint the cause for the erraneous
+ behaviour.
+
+ The reason for the departure from the abstract ideas was that I
+ thought this could lead to some performance gains. Apparently I was
+ too obsessed with premature optimizations that I ignored the
+ potential problems caused.
+
+** Update
+
+Now I realize that a nondeterministic fintie automaton is equivalent
+with a regular expression, so it makes no sense to deal with NFAs
+without dealing with regular expressions. Also, dealing with regular
+expressions directly makes it easier to extract the "atomic languages"
+of grammars and to present them in a readable way.
+
+But still the current focus is not on matching regular expressions,
+but on the structures of regular expressions.
+
+* Details
+
+According to the above goals, I have to write three sub-crates, or
+workspace members, for regular expression matchers, NFA's, and for
+graphs, respectively.
+
+Each of these three crates is easy to write, in my opinion. I just
+have to pack the old codes into separate crates, to separate concerns.
+Once this is done, I can easily plug in other crates as well, as one
+can just wrap up external crates.
+
+Moreover, in order to achieve the goal of abstraction, I need modules
+for "atoms" and "languages". Atoms represent the regular language of
+derivatives of the grammar. This is perhaps the most obscure part, in
+my experience. Languages represent the derivative of the grammar with
+respect to inputs. From my experience, it is not a good idea to
+calculate the semiring values after the derivatives have been
+calculated. Instead, we shall calculate the semiring values at the
+same time as calculating the derivatives of the grammar, and record
+the values in a dedicated recorder.
+
+* Tokenizers
+
+In my opinion, tokenization is just a speed-up process. The user
+should not worry about the distinction between tokens and
+non-terminals. This means I want the parser generator to work at the
+character, or byte, level by default. When the user wants to use a
+dedicated tokenizer, to have some speed-ups for example, the user can
+supply a function that eats a string, or an array of bytes, and
+returns an array of tokens.
+
+This decision means that we don't need regular expressions by default.
+We shall add the regular expressions back later, when we want to
+decrease the constant factor involved in the algorithmic complexities.
+When we do want to do so, then, we need a way to automatically deduce
+regular expression terminals. In the current version, a non-terminal
+whose rules only involve with terminals will be automatically
+converted to a regular expression terminal. I think this mechanism
+can be adapted in the future.
+
+* Grammars
+