1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
|
#+TITLE: Plan of the package
#+AUTHOR: Durand
#+DATE: <2022-11-18 Ven 19:57>
* Things to do [9/9]
- [X] Implement builders for graphs
- [X] Find sets of the first terminals for each non-terminal, in the
grammar module.
- [X] Collect various grammars for testing. [5/5]
+ [X] One simple grammar
+ [X] Notes
+ [X] Parentheses
+ [X] Pathelogical left-recursive
+ [X] Pathelogical right-recursive
# More should be added
- [X] NFA [4/4]
+ [X] Add regular expression into NFA.
+ [X] Convert a set of regular expressions into a nondeterministic
finite automaton, where non-terminals denote the indices of
regular expressions in the set to substitute.
+ [X] Convert a grammar to the language of atomic languages. [2/2]
* [X] Convert a grammar into its grammar of left-linear closures
of non-temrinals and their derivatives.
* [X] Convert nondeterministic finite automata to their null
closures.
+ [X] Test more grammars to ensure it works correctly. [5/5]
* [X] Test the conversion to left-closure as a set of regular
expressions.
* [X] Test the conversion of a set of regular expressions into a
nondeterministic finite automaton.
* [X] Test the closure of null edges, where the nullity of an edge
is defined by a function. In our specific case, an edge is null
if the grammar symbol that edge represents, either a terminal or
a non-terminal, can match an empty string.
* [X] Test the removal of empty transitions from nondeterministic
finite automata.
* [X] Test the removal of dead states, where a state is dead if
and only if no other states have an edge into that state.
- [X] Refactor [7/7]
+ [X] Implement a data type for graphs with labels on vertices and
edges, but do not need to index edges by labels, which can index
vertices by labels instead.
* [X] Test it.
+ [X] Implement a builder that borrows a graph mutably.
* [X] Test it.
+ [X] Implement a data type for graphs in which every node knows its
parents, and every node has a unique label but edges have no
labels.
* [X] Test it.
+ [X] When we pull in some regular language because of the
left-linear expansion, we need to mark that branch as coming from
left-linear expansions. This is necessary because we cannot
follow a left-linearly expanded branch when we take the derivative
of a language. We only expand left-linearly when we try to access
the atomic languages [s]^{(t)}.
+ [X] An edge of the NFA should carry a label that can be more
informative than solely a terminal or a non-terminal.
+ [X] Add a mechanism for a grammar to attach labels to edges of NFA
which are not necessarily Copy-able, and which should be stored in a
separate array, such that the labels on edges of NFA point to the
elements of the array.
+ [X] We need to record the expansions of those "virtual nodes".
That is, the virtual nodes represent atomic languages such as
[s]^{(t)} where s is a non-terminal and t is a terminal. To be more
specific, it represents the derivative of the left-linear closure
of the non-terminal s with respect to the terminal t. The
left-linear closure of s is the set of all strings (consisting of
terminals and non-terminals alike) that are derivable from s
alone, without consuming any inputs, by expanding according to the
grammar rules. This means that we need to know which
non-terminals were expanded in order to get to a state in [s]^{(t)}.
* [X] Test
* [X] Test more
- [X] Atom [3/3]
+ [X] Define the Atom trait.
+ [X] Implement virtual nodes for each derivative of each atom: The
lack of this step might be the root cause of the failure of the
previous version of this project.
+ [X] Test atoms
- [X] Implement semiring. [4/4]
+ [X] Define the trait.
+ [X] Define items and rules for the chain-rule parser, as an
item-based description.
+ [X] Implement the natural number semiring.
+ [X] Implement the free semiring.
- [X] Implement languages. [6/6]
+ [X] First define a trait with the expected behaviours.
+ [X] Then implement them as doubly labelled graphs.
+ [X] Each edge in the chain-rule machine needs to be labelled also
with a position in the forest. This perfectly solves the problem
of those "plugs"!
+ [X] Then implement finding derivatives by use of the chain rule.
+ [X] Handle Semirings
+ [X] Tests
- [X] Miscellaneous [1/1]
+ [X] Implement a function for NFA that checks if a given predicate
is satisfied for each edge label.
** Forests
This was a failed attempt to handle forests. I decided to handle
forests as a special case of semi-rings. This is not only more
systematic, but has also the potential of handling more general
semi-rings later, like the Viterbi semirings, for example.
- [X] Implement forests [4/4]
+ [X] Design a format of forests. This should be the most crucial
thing to do, in order to have a better understanding of the whole
picture. I need to have a clear understanding of the details of
the binarised shared packed parsed forests, that reflects the
regular expressions in the grammar equations.
+ [X] Define a trait with the expected behaviours.
+ [X] Implement them as parents-knowing graphs.
+ [X] Test
* Atomic Languages
This describes the behaviours of atomic languages. The atomic
language consists of the null closure of any non-terminal symbol in
the grammar, and their deriavtives by terminals and non-terminal.
* Derivative Languages
This is the main driving device of the algorithm. Basically, the
algorithm works by taking successive derivatives, according to the
input symbol. At each step, we calculate the derivative language. In
this process, we also compute some semiring value and store in a
carrier. The end result of the algorithm is the final semiring
value.
If one wants simply to determine if the input string belongs to the
grammar, one chooses the semiring to be the field with two elements,
the boolean field. If one wants to find how many ways are there to
derive a given input string, then one uses the semiring of natural
numbers instead. If one wants, moreover, to find all the possible
ways to derive a particular input string, then one can use the
free semiring on the set of terminals and non-terminals of the
grammar. Here the free semiring is the left-adjoint functor to the
forgetful functor from the category of semirings to the category of
sets.
To be more specific, the free semiring on a set is given by sets of
sequences of elements in the set. The addition of the semiring is the
set union operation, and the multiplication is taking the respective
concatenations.
** Semirings
So we need a module to define the behaviours of semirings, and provide
some common semirings implementations. Then in the main driving force
we can freely substitute different semirings, according to the
particular needs.
** Languages
Then the main part is to define the behaviour of languages. This
should be easy enough, since we already have the mechanism of graphs,
nondeterministic automata, and semirings. All we need to do is to
combine them together.
* Lexers
I got this idea from [[https://lists.gnu.org/archive/html/emacs-devel/2022-12/msg01127.html][a discussion on emacs-devel]]. The idea can be
formulated as
#+begin_quote
Potentially it allows invocation of a parser with different variants
of lexers - one mode with block tokens for the exploration of
project's structure, and another mode for indentation and error
checking purposes.
#+end_quote
I think this is the "right" use of lexers. That is to say, the parser
by default should operate on an abstract type of tokens, which are but
unsigned integers, and rely on the user application to provide a lexer
for turning the actual input, like a string, into an iterator of
tokens. The lexer can choose to do any sort of preprocessing that it
sees fit, or it can do nothing and just turn characters to unsigned
integers. In the latter case, this parser generator works on the
character level. But, when one neeeds, one can instruct the parser
generator to work on the level of preprocessed tokens easily.
This has the advantage of allowing a certain degree of flexibility in
the type of tokens accepted, while keeping the implementation simple
at the same time: the internal core only has to deal with unsigned
integers, after all.
* Library for drawing graphs
In the past, I thought that drawing graphs is only a development tool
for my package, so I can rely on such external tools as GraphViz, as
that would not be needed for my end-users.
But recently I realized that this is not a need only for the
development of the package, but also for the users as well. To be
more specific, the target users are those who want to "play" with
grammars. So if one cannot view the resulting parse forest diagrams
easily and interactively, it would be hard to "play" with grammars.
Moreover, the internal workings of the parsing machine might also be
interesting for users to inspect, whether to debug the program or to
know more under the hood. As such it is required to view the diagrams
of the internal directed acyclic graphs.
For all these reasons, I know regard the ability to draw graphs and to
interact with those graphs is essential to the user experience of this
package. As a consequence, I would like to develop a small default
library for this purpose. I follow the convention adapted by this
package here as well, which is to provide a default implementation for
the functionality, while still leaving the room for users to
substitute with other external packages, if so desired, for whatever
reason. For example, an external package may provide an unprecedented
performance for drawing graphs in Emacs. If such a package appears, I
can easily avail of that external package to draw graphs.
* Testing ground
I am in a strong need to test things out. The most important one is
to visualize each step of the derivation, in a human-friendly manner.
I need this to examine whether my atomic languages are wrongly
implemented, or my atomic languages are wrongly derived, or my
understanding of the main algorithm is plain wrong.
This is the main reason I started this rewrite of the package.
* To figure out
I need to figure out how to construct the parse forest from a
left-most null-closed derivation graph.
Reading through the original paper again, I found that the author
augmented the atomic languages with annotations of partial derivation
steps on each edge of the nondeterministic finite automaton. In
brief, this is what I tried to achieve with some obscure constructs
such as expanding reduction informations carried by the
nondeterministic finite automata, in one of the previous versions. To
sum up, I really need to carry those informations in the first place,
otherwise the parse forests cannot be reliably construced later on.
But I really do not want to adopt the approach of the original author.
I still prefer the idea of a recorder. That is to say, instead of
constructing the parse forest after the recognition, we shall
construct the parse forest simultaneously while we are recognizing.
|