1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
|
#+AUTHOR: Durand
#+EMAIL: durand@jsdurand.xyz
#+DATE: <2022-09-30 Ven 14:45>
#+STARTUP: fold
This document records an attempt to design a parser generator library.
* Goals
- Modular
The library should be modular enough that the following components
of the parser generator can be easily substituted by either external
crates or even external foreign functions.
+ Regular expression matchers
This project is to have a parser generator. A regular expression
matcher is not the core goal of the project. So I shall write the
library in a way that can use other means to match regular
expressions, should some users wish to do so.
Since I do not want to use external dependencies by default, I
will also write a small regular expression matcher library to
serve as the default matcher. This will allow me to try different
optimization techniques for regular expression matchers as well.
+ Nondeterministic Fintie Automata
The project needs to manipulate nondeterministic fintie automata.
As for regular expression matchers, I will write my own library
for dealing with NFA's, and in the mean time preserve the
possibility to "plug in" other external crates for this purpose as
well.
+ Graph data type
The internal data of the parser generator, as used by the current
algorithm, is a directed graph. While this is easy to represent
using the adjacent list representation, I will not underestimate
the potential optimizations brought by external libraries.
Moreover, the underlying data structures of regular expressions
and nondeterministic fintie automata are graphs as well. So they
should be able to share the same data structure in the default
implementation, and to plug in external crates, should the need
arise.
- Abstract
One of the deficiencies of the current version of this project is
that it does not mirror the abstract ideas of the algorithm closely.
This means that, when the implementation of the algorithm goes
wrong, I cannot precisely pinpoint the cause for the erraneous
behaviour.
The reason for the departure from the abstract ideas was that I
thought this could lead to some performance gains. Apparently I was
too obsessed with premature optimizations that I ignored the
potential problems caused.
** Update
Now I realize that a nondeterministic fintie automaton is equivalent
with a regular expression, so it makes no sense to deal with NFAs
without dealing with regular expressions. Also, dealing with regular
expressions directly makes it easier to extract the "atomic languages"
of grammars and to present them in a readable way.
But still the current focus is not on matching regular expressions,
but on the structures of regular expressions.
* Details
According to the above goals, I have to write three sub-crates, or
workspace members, for regular expression matchers, NFA's, and for
graphs, respectively.
Each of these three crates is easy to write, in my opinion. I just
have to pack the old codes into separate crates, to separate concerns.
Once this is done, I can easily plug in other crates as well, as one
can just wrap up external crates.
Moreover, in order to achieve the goal of abstraction, I need modules
for "atoms" and "languages". Atoms represent the regular language of
derivatives of the grammar. This is perhaps the most obscure part, in
my experience. Languages represent the derivative of the grammar with
respect to inputs. From my experience, it is not a good idea to
calculate the semiring values after the derivatives have been
calculated. Instead, we shall calculate the semiring values at the
same time as calculating the derivatives of the grammar, and record
the values in a dedicated recorder.
* Tokenizers
In my opinion, tokenization is just a speed-up process. The user
should not worry about the distinction between tokens and
non-terminals. This means I want the parser generator to work at the
character, or byte, level by default. When the user wants to use a
dedicated tokenizer, to have some speed-ups for example, the user can
supply a function that eats a string, or an array of bytes, and
returns an array of tokens.
This decision means that we don't need regular expressions by default.
We shall add the regular expressions back later, when we want to
decrease the constant factor involved in the algorithmic complexities.
When we do want to do so, then, we need a way to automatically deduce
regular expression terminals. In the current version, a non-terminal
whose rules only involve with terminals will be automatically
converted to a regular expression terminal. I think this mechanism
can be adapted in the future.
* Grammars
|