Initial commit

Basic GNU standard files are added, and we now stop worrying about monadic anamorphisms. The current focus is on testing the correctness of the algorithm, so I need convenient support for manipulating, interpreting, examining, and per chance animating nondeterministic automata.
author: JSDurand <mmemmew@gmail.com> 2022-11-15 12:01:28 +0800
committer: JSDurand <mmemmew@gmail.com> 2022-11-15 12:01:28 +0800
commit: cb7bcfad4ab0041aaf3fde3185e27ee46bb37788 (patch)
tree: a4fd99b138b72617b6c4c2b04f5d2655d0fedcc5 /DESIGN.org
1 files changed, 114 insertions, 0 deletions
diff --git a/DESIGN.org b/DESIGN.org
new file mode 100644
index 0000000..18834cb
--- /dev/null
+++ b/DESIGN.org
@@ -0,0 +1,114 @@
+#+AUTHOR: Durand
+#+EMAIL: durand@jsdurand.xyz
+#+DATE: <2022-09-30 Ven 14:45>
+#+STARTUP: fold
+
+This document records an attempt to design a parser generator library.
+
+* Goals
+
+- Modular
+  
+  The library should be modular enough that the following components
+  of the parser generator can be easily substituted by either external
+  crates or even external foreign functions.
+
+  + Regular expression matchers
+
+    This project is to have a parser generator.  A regular expression
+    matcher is not the core goal of the project.  So I shall write the
+    library in a way that can use other means to match regular
+    expressions, should some users wish to do so.
+
+    Since I do not want to use external dependencies by default, I
+    will also write a small regular expression matcher library to
+    serve as the default matcher.  This will allow me to try different
+    optimization techniques for regular expression matchers as well.
+
+  + Nondeterministic Fintie Automata
+
+    The project needs to manipulate nondeterministic fintie automata.
+    As for regular expression matchers, I will write my own library
+    for dealing with NFA's, and in the mean time preserve the
+    possibility to "plug in" other external crates for this purpose as
+    well.
+
+  + Graph data type
+
+    The internal data of the parser generator, as used by the current
+    algorithm, is a directed graph.  While this is easy to represent
+    using the adjacent list representation, I will not underestimate
+    the potential optimizations brought by external libraries.
+
+    Moreover, the underlying data structures of regular expressions
+    and nondeterministic fintie automata are graphs as well.  So they
+    should be able to share the same data structure in the default
+    implementation, and to plug in external crates, should the need
+    arise.
+
+- Abstract
+
+  One of the deficiencies of the current version of this project is
+  that it does not mirror the abstract ideas of the algorithm closely.
+  This means that, when the implementation of the algorithm goes
+  wrong, I cannot precisely pinpoint the cause for the erraneous
+  behaviour.
+
+  The reason for the departure from the abstract ideas was that I
+  thought this could lead to some performance gains.  Apparently I was
+  too obsessed with premature optimizations that I ignored the
+  potential problems caused.
+
+** Update
+
+Now I realize that a nondeterministic fintie automaton is equivalent
+with a regular expression, so it makes no sense to deal with NFAs
+without dealing with regular expressions.  Also, dealing with regular
+expressions directly makes it easier to extract the "atomic languages"
+of grammars and to present them in a readable way.
+
+But still the current focus is not on matching regular expressions,
+but on the structures of regular expressions.
+
+* Details
+
+According to the above goals, I have to write three sub-crates, or
+workspace members, for regular expression matchers, NFA's, and for
+graphs, respectively.
+
+Each of these three crates is easy to write, in my opinion.  I just
+have to pack the old codes into separate crates, to separate concerns.
+Once this is done, I can easily plug in other crates as well, as one
+can just wrap up external crates.
+
+Moreover, in order to achieve the goal of abstraction, I need modules
+for "atoms" and "languages".  Atoms represent the regular language of
+derivatives of the grammar.  This is perhaps the most obscure part, in
+my experience.  Languages represent the derivative of the grammar with
+respect to inputs.  From my experience, it is not a good idea to
+calculate the semiring values after the derivatives have been
+calculated.  Instead, we shall calculate the semiring values at the
+same time as calculating the derivatives of the grammar, and record
+the values in a dedicated recorder.
+
+* Tokenizers
+
+In my opinion, tokenization is just a speed-up process.  The user
+should not worry about the distinction between tokens and
+non-terminals.  This means I want the parser generator to work at the
+character, or byte, level by default.  When the user wants to use a
+dedicated tokenizer, to have some speed-ups for example, the user can
+supply a function that eats a string, or an array of bytes, and
+returns an array of tokens.
+
+This decision means that we don't need regular expressions by default.
+We shall add the regular expressions back later, when we want to
+decrease the constant factor involved in the algorithmic complexities.
+When we do want to do so, then, we need a way to automatically deduce
+regular expression terminals.  In the current version, a non-terminal
+whose rules only involve with terminals will be automatically
+converted to a regular expression terminal.  I think this mechanism
+can be adapted in the future.
+
+* Grammars
+
author	JSDurand <mmemmew@gmail.com>	2022-11-15 12:01:28 +0800
committer	JSDurand <mmemmew@gmail.com>	2022-11-15 12:01:28 +0800
commit	cb7bcfad4ab0041aaf3fde3185e27ee46bb37788 (patch)
tree	a4fd99b138b72617b6c4c2b04f5d2655d0fedcc5 /DESIGN.org