summaryrefslogtreecommitdiff
path: root/chain/src/plan.org
blob: 520ba8f8c0c650db0eb63343149f680546326507 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
#+TITLE: Plan of the package
#+AUTHOR: Durand
#+DATE: <2022-11-18 Ven 19:57>

* Things to do [9/9]

- [X] Implement builders for graphs
- [X] Find sets of the first terminals for each non-terminal, in the
  grammar module.
- [X] Collect various grammars for testing. [5/5]
  + [X] One simple grammar
  + [X] Notes
  + [X] Parentheses
  + [X] Pathelogical left-recursive
  + [X] Pathelogical right-recursive
    # More should be added
- [X] NFA [4/4]
  + [X] Add regular expression into NFA.
  + [X] Convert a set of regular expressions into a nondeterministic
    finite automaton, where non-terminals denote the indices of
    regular expressions in the set to substitute.
  + [X] Convert a grammar to the language of atomic languages. [2/2]
    * [X] Convert a grammar into its grammar of left-linear closures
      of non-temrinals and their derivatives.
    * [X] Convert nondeterministic finite automata to their null
      closures.
  + [X] Test more grammars to ensure it works correctly. [5/5]
    * [X] Test the conversion to left-closure as a set of regular
      expressions.
    * [X] Test the conversion of a set of regular expressions into a
      nondeterministic finite automaton.
    * [X] Test the closure of null edges, where the nullity of an edge
      is defined by a function.  In our specific case, an edge is null
      if the grammar symbol that edge represents, either a terminal or
      a non-terminal, can match an empty string.
    * [X] Test the removal of empty transitions from nondeterministic
      finite automata.
    * [X] Test the removal of dead states, where a state is dead if
      and only if no other states have an edge into that state.
- [X] Refactor [7/7]
  + [X] Implement a data type for graphs with labels on vertices and
    edges, but do not need to index edges by labels, which can index
    vertices by labels instead.
    * [X] Test it.
  + [X] Implement a builder that borrows a graph mutably.
    * [X] Test it.
  + [X] Implement a data type for graphs in which every node knows its
    parents, and every node has a unique label but edges have no
    labels.
    * [X] Test it.
  + [X] When we pull in some regular language because of the
    left-linear expansion, we need to mark that branch as coming from
    left-linear expansions.  This is necessary because we cannot
    follow a left-linearly expanded branch when we take the derivative
    of a language.  We only expand left-linearly when we try to access
    the atomic languages [s]^{(t)}.
  + [X] An edge of the NFA should carry a label that can be more
    informative than solely a terminal or a non-terminal.
  + [X] Add a mechanism for a grammar to attach labels to edges of NFA
    which are not necessarily Copy-able, and which should be stored in a
    separate array, such that the labels on edges of NFA point to the
    elements of the array.
  + [X] We need to record the expansions of those "virtual nodes".
    That is, the virtual nodes represent atomic languages such as
    [s]^{(t)} where s is a non-terminal and t is a terminal.  To be more
    specific, it represents the derivative of the left-linear closure
    of the non-terminal s with respect to the terminal t.  The
    left-linear closure of s is the set of all strings (consisting of
    terminals and non-terminals alike) that are derivable from s
    alone, without consuming any inputs, by expanding according to the
    grammar rules.  This means that we need to know which
    non-terminals were expanded in order to get to a state in [s]^{(t)}.
    * [X] Test
    * [X] Test more
- [X] Atom [3/3]
  + [X] Define the Atom trait.
  + [X] Implement virtual nodes for each derivative of each atom: The
    lack of this step might be the root cause of the failure of the
    previous version of this project.
  + [X] Test atoms
- [X] Implement semiring. [4/4]
  + [X] Define the trait.
  + [X] Define items and rules for the chain-rule parser, as an
    item-based description.
  + [X] Implement the natural number semiring.
  + [X] Implement the free semiring.
- [X] Implement languages. [6/6]
  + [X] First define a trait with the expected behaviours.
  + [X] Then implement them as doubly labelled graphs.
  + [X] Each edge in the chain-rule machine needs to be labelled also
    with a position in the forest.  This perfectly solves the problem
    of those "plugs"!
  + [X] Then implement finding derivatives by use of the chain rule.
  + [X] Handle Semirings
  + [X] Tests
- [X] Miscellaneous [1/1]
  + [X] Implement a function for NFA that checks if a given predicate
    is satisfied for each edge label.

** Forests

This was a failed attempt to handle forests.  I decided to handle
forests as a special case of semi-rings.  This is not only more
systematic, but has also the potential of handling more general
semi-rings later, like the Viterbi semirings, for example.

- [X] Implement forests [4/4]
  + [X] Design a format of forests.  This should be the most crucial
    thing to do, in order to have a better understanding of the whole
    picture.  I need to have a clear understanding of the details of
    the binarised shared packed parsed forests, that reflects the
    regular expressions in the grammar equations.
  + [X] Define a trait with the expected behaviours.
  + [X] Implement them as parents-knowing graphs.
  + [X] Test

* Atomic Languages

This describes the behaviours of atomic languages.  The atomic
language consists of the null closure of any non-terminal symbol in
the grammar, and their deriavtives by terminals and non-terminal.

* Derivative Languages

This is the main driving device of the algorithm.  Basically, the
algorithm works by taking successive derivatives, according to the
input symbol.  At each step, we calculate the derivative language.  In
this process, we also compute some semiring value and store in a
carrier.  The end result of the algorithm is the final semiring
value.

If one wants simply to determine if the input string belongs to the
grammar, one chooses the semiring to be the field with two elements,
the boolean field.  If one wants to find how many ways are there to
derive a given input string, then one uses the semiring of natural
numbers instead.  If one wants, moreover, to find all the possible
ways to derive a particular input string, then one can use the
free semiring on the set of terminals and non-terminals of the
grammar.  Here the free semiring is the left-adjoint functor to the
forgetful functor from the category of semirings to the category of
sets.

To be more specific, the free semiring on a set is given by sets of
sequences of elements in the set.  The addition of the semiring is the
set union operation, and the multiplication is taking the respective
concatenations.

** Semirings

So we need a module to define the behaviours of semirings, and provide
some common semirings implementations.  Then in the main driving force
we can freely substitute different semirings, according to the
particular needs.

** Languages

Then the main part is to define the behaviour of languages.  This
should be easy enough, since we already have the mechanism of graphs,
nondeterministic automata, and semirings.  All we need to do is to
combine them together.

* Lexers

I got this idea from [[https://lists.gnu.org/archive/html/emacs-devel/2022-12/msg01127.html][a discussion on emacs-devel]].  The idea can be
formulated as

#+begin_quote
Potentially it allows invocation of a parser with different variants
of lexers - one mode with block tokens for the exploration of
project's structure, and another mode for indentation and error
checking purposes.
#+end_quote

I think this is the "right" use of lexers.  That is to say, the parser
by default should operate on an abstract type of tokens, which are but
unsigned integers, and rely on the user application to provide a lexer
for turning the actual input, like a string, into an iterator of
tokens.  The lexer can choose to do any sort of preprocessing that it
sees fit, or it can do nothing and just turn characters to unsigned
integers.  In the latter case, this parser generator works on the
character level.  But, when one neeeds, one can instruct the parser
generator to work on the level of preprocessed tokens easily.

This has the advantage of allowing a certain degree of flexibility in
the type of tokens accepted, while keeping the implementation simple
at the same time: the internal core only has to deal with unsigned
integers, after all.

* Library for drawing graphs

In the past, I thought that drawing graphs is only a development tool
for my package, so I can rely on such external tools as GraphViz, as
that would not be needed for my end-users.

But recently I realized that this is not a need only for the
development of the package, but also for the users as well.  To be
more specific, the target users are those who want to "play" with
grammars.  So if one cannot view the resulting parse forest diagrams
easily and interactively, it would be hard to "play" with grammars.

Moreover, the internal workings of the parsing machine might also be
interesting for users to inspect, whether to debug the program or to
know more under the hood.  As such it is required to view the diagrams
of the internal directed acyclic graphs.

For all these reasons, I know regard the ability to draw graphs and to
interact with those graphs is essential to the user experience of this
package.  As a consequence, I would like to develop a small default
library for this purpose.  I follow the convention adapted by this
package here as well, which is to provide a default implementation for
the functionality, while still leaving the room for users to
substitute with other external packages, if so desired, for whatever
reason.  For example, an external package may provide an unprecedented
performance for drawing graphs in Emacs.  If such a package appears, I
can easily avail of that external package to draw graphs.

* Testing ground

I am in a strong need to test things out.  The most important one is
to visualize each step of the derivation, in a human-friendly manner.
I need this to examine whether my atomic languages are wrongly
implemented, or my atomic languages are wrongly derived, or my
understanding of the main algorithm is plain wrong.

This is the main reason I started this rewrite of the package.

* To figure out

I need to figure out how to construct the parse forest from a
left-most null-closed derivation graph.

Reading through the original paper again, I found that the author
augmented the atomic languages with annotations of partial derivation
steps on each edge of the nondeterministic finite automaton.  In
brief, this is what I tried to achieve with some obscure constructs
such as expanding reduction informations carried by the
nondeterministic finite automata, in one of the previous versions.  To
sum up, I really need to carry those informations in the first place,
otherwise the parse forests cannot be reliably construced later on.

But I really do not want to adopt the approach of the original author.
I still prefer the idea of a recorder.  That is to say, instead of
constructing the parse forest after the recognition, we shall
construct the parse forest simultaneously while we are recognizing.