DESIGN.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114

#+AUTHOR: Durand
#+EMAIL: durand@jsdurand.xyz
#+DATE: <2022-09-30 Ven 14:45>
#+STARTUP: fold

This document records an attempt to design a parser generator library.

* Goals

- Modular
  
  The library should be modular enough that the following components
  of the parser generator can be easily substituted by either external
  crates or even external foreign functions.

  + Regular expression matchers

    This project is to have a parser generator.  A regular expression
    matcher is not the core goal of the project.  So I shall write the
    library in a way that can use other means to match regular
    expressions, should some users wish to do so.

    Since I do not want to use external dependencies by default, I
    will also write a small regular expression matcher library to
    serve as the default matcher.  This will allow me to try different
    optimization techniques for regular expression matchers as well.

  + Nondeterministic Fintie Automata

    The project needs to manipulate nondeterministic fintie automata.
    As for regular expression matchers, I will write my own library
    for dealing with NFA's, and in the mean time preserve the
    possibility to "plug in" other external crates for this purpose as
    well.

  + Graph data type

    The internal data of the parser generator, as used by the current
    algorithm, is a directed graph.  While this is easy to represent
    using the adjacent list representation, I will not underestimate
    the potential optimizations brought by external libraries.

    Moreover, the underlying data structures of regular expressions
    and nondeterministic fintie automata are graphs as well.  So they
    should be able to share the same data structure in the default
    implementation, and to plug in external crates, should the need
    arise.

- Abstract

  One of the deficiencies of the current version of this project is
  that it does not mirror the abstract ideas of the algorithm closely.
  This means that, when the implementation of the algorithm goes
  wrong, I cannot precisely pinpoint the cause for the erraneous
  behaviour.

  The reason for the departure from the abstract ideas was that I
  thought this could lead to some performance gains.  Apparently I was
  too obsessed with premature optimizations that I ignored the
  potential problems caused.

** Update

Now I realize that a nondeterministic fintie automaton is equivalent
with a regular expression, so it makes no sense to deal with NFAs
without dealing with regular expressions.  Also, dealing with regular
expressions directly makes it easier to extract the "atomic languages"
of grammars and to present them in a readable way.

But still the current focus is not on matching regular expressions,
but on the structures of regular expressions.

* Details

According to the above goals, I have to write three sub-crates, or
workspace members, for regular expression matchers, NFA's, and for
graphs, respectively.

Each of these three crates is easy to write, in my opinion.  I just
have to pack the old codes into separate crates, to separate concerns.
Once this is done, I can easily plug in other crates as well, as one
can just wrap up external crates.

Moreover, in order to achieve the goal of abstraction, I need modules
for "atoms" and "languages".  Atoms represent the regular language of
derivatives of the grammar.  This is perhaps the most obscure part, in
my experience.  Languages represent the derivative of the grammar with
respect to inputs.  From my experience, it is not a good idea to
calculate the semiring values after the derivatives have been
calculated.  Instead, we shall calculate the semiring values at the
same time as calculating the derivatives of the grammar, and record
the values in a dedicated recorder.

* Tokenizers

In my opinion, tokenization is just a speed-up process.  The user
should not worry about the distinction between tokens and
non-terminals.  This means I want the parser generator to work at the
character, or byte, level by default.  When the user wants to use a
dedicated tokenizer, to have some speed-ups for example, the user can
supply a function that eats a string, or an array of bytes, and
returns an array of tokens.

This decision means that we don't need regular expressions by default.
We shall add the regular expressions back later, when we want to
decrease the constant factor involved in the algorithmic complexities.
When we do want to do so, then, we need a way to automatically deduce
regular expression terminals.  In the current version, a non-terminal
whose rules only involve with terminals will be automatically
converted to a regular expression terminal.  I think this mechanism
can be adapted in the future.

* Grammars