summaryrefslogtreecommitdiff
path: root/DIARY
blob: 7c882d041d2f0828994d7a76447dfdbb19a01140 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
======================================================================
                              2023-06-02
======================================================================

This is a "diary" that records my thoughts when trying to debug the
forest manipulations.  Basically the forests are wrongly "merged", and
produce weird clones.

Right now I have finished refactoring most of the codes, but still
fail to perceive why the forests behave in that way.  I even finished
a proto-type of my procedural macros which help me reduce
boiler-plates but are not necessary.

Right now I think I need more refined control over the details of the
forests manipulations.  To be more precise, I need to be able to
observe, step by step, how the chain-rule machine manipulated the
forests.  This means in particular to be able to produce graphs of
half-way forests whenever I want.  I have not yet figured out how to
do this exactly.

The rough plan is to compress all steps of graphs into an "incremental
format" that stores not just forests information, but how they were
created, and allow the users to re-create the forests step by step.

Simply put, this is just another feature that I would like to have for
the end-users, but postponed to later since that was not essential for
my developments.  Now this seems to be quite important for me to
properly observe the forests, so it is perhaps time to implement this
feature first.



======================================================================
                              2023-07-22
======================================================================

Now a stable version of the package is in place, so I can start
working on some non-essential features, like the tokenization process
and the visulization feature.

Amongst these, the tokenization process is the most important, as the
users will not expect to work with tokens: they expect to use
characters.  So in some sense this package is not ready for the
end-users yet.

My thought is to offer two modes: one for operating on characters and
the other for operating on tokens directly.  This can be achieved by
letting the data type representing terminals contain more information
on how to match characters.

My thought is to attach either a vector of "machine codes" or a single
character to terminals.  If a terminal is attached a single character,
it can only match that character.

On the other hand, if a terminal corresponds to a vector of machine
codes, then it means to execute this sequence of codes to determine if
a character matches.  We can even simplify the format of machine codes
to a sequence of ranges, and a character only matches if it is
contained in one of the ranges.

After this is finished, we can bump the version to 0.1.3 and simplify
the build process.  Hope it all goes well.



======================================================================
                              2023-07-23
======================================================================

I encounter some problem in implementing the tokenization handling.

To better separate concerns, this package only operates at the level
of tokens, which are an abstraction over inputs.  For example, if we
encounter an identifier in the source, we can replace that identifier
by a special token.  This way, the length of the input is greatly
reduced, and the parsing process can be sped up as the tokenization
process, which usually only involves regular expressions, is much
faster than the rest of the parsing.

The problem is that the same input slice can be classified into
multiple different tokens.  As an example, if a grammar has a token
for matching arbitrary non-whitespace letters, and if it also has a
token for matching a specific character, like '*', then the two tokens
are both valid at the character '*'.  This can be regarded as a token
at a deeper level appearing as multiple different tokens at a more
shallow level.

Hence, a natural approach would be to let the tokens of the grammar
decide how to match against tokens of the input: the grammar and the
input use different token types.  This is fine as long as the user
does not get into play: the package can implement a way to match
grammar tokens against inputs.  However, if the user wants to
interrupt with the tokensization process, then things become tricky.

The well-known tree-sitter parser generator (see
<https://tree-sitter.github.io/tree-sitter/> for more information)
differentiates the tokens by a set of rules, by which a certain input
can only be classified as a unique token.  Basically it solves the
possibility of multiple tokens by the precedence rules.  This is
consistent with its goal of providing parsers for grammars of
primarily programming languages, for which the number of derivations
of a given input is at most one: it does not accept multiple
derivations of the same input.

But this is not acceptable for this package: it is explicitly desired
to have multiple derivations recognized.

Moreover, if we look back at the original motivation behind the
tokensization process, we see that the tokensization is only needed
for two reasons: speed and flexibility.

If we want the tokensization because of speed, it means that the
tokensization is only an implementation detail, and should not impose
restrictions on the resulting derivations.  From this perspective,
every tokensization probability should be preserved.

On the other hand, if we tokenize input on account of the flexibility
thus brought about, then there is no reason to discard some
tokensization possibility just because we do not find a way to do it:
that hurts the flexibility a lot.

Consequently, from both points of view, we shall treat multiple
tokensization possibilities as they are different grammar derivations.
The only difference from a "regular" grammar derivation is that they
can be altered by the user, from which the flexibility follows.  And
this is the point: tokens play a role between internal to the parsers
and external to them.  Or put more simply: they should be a changeable
gear in the parsers.

But things are not so easy: if we want to let users have a handle on
the tokenization, then we require extra input from the users.  This is
not only an added burden to the users, but it also is an additional
performance overhead that I am not sure if is too much.

The situation is even worse: different tokenization choices can lead
to different segments of the input being tokenized.  For example, a
choice may break the input into segments (Token(1), Span(0, 1)),
(Token(2), Span(1, 2)), (Token(3), Span(2, 3)), whereas another choice
may break the input into these segments instead: (Token(1), Span(0,
1)), (Token(4), Span(1, 3)).

This is very hard for the chain-rule machine to track, as it operates
at the level of tokens.  In fact, after more thoughts, the chain-rule
machine is not the right place to handle these complexities.  After
all, this process has nothing to do with the chain-rule machine.  It
should be the responsibility of the application parser generator that
feeds the machine with appropriate input tokens.

Consequently, the tokenization process should best be handled by
"splitting the input sequence" into multiple sequences of input
tokens, if multiple variatoins of tokens are desired.  Apparently,
each variation should be coupled with its own derivation forest, if
that variation is valid according to the grammar.

I think this is a rather plausible approach: multiple possibilities of
tokens are handled, and if only one variation is needed, no additional
overhead is imposed.

One point of caution is that we shall be able to "share states" across
variations.  This means the directed acyclic graph that is used to
recognize the input sequence should be maximally shared.

In my rough plan, this should be accomplished after we implement the
"context-free memoization" for the chain-rule machines.  This
memoization technique is really the central selling-point of the paper
proposing this chain-rule machine parsing algorithm, at least
according to its author.

Currenly, there are some steps in order to implement the memoization
technique.  The part on the directed acyclic graph is rather simple:
we simply have to break the graph into several chained pieces each of
which represents a segment of the whole graph.  The reason for this
segmentation is that this way we can memoize the segments more easily.

What is more challenging is the memoization of the derivation forests.
This part is kind of different from what the author proposed in the
paper.  The author suggested to record every derivation sequence in
the graph, and memoize the transformations on the sequences.  My plan
is instead to memoize the directed acyclic graph and the
transformation on the derivation forests.

So I need to figure out the memoization of two things: directed
acyclic graphs and transformations of derivation forests.

But before that, the first thing to do is to implement a "default
tokenizer".  The user of course should have the facility to control
the working of this tokenizer, per chance by additional specialization
in the grammar files.

Further, this control should take the derivation forest state into
account: the user can demand, for example, to reject a particular
tokenization variation if, at a given token, the derivation forest
does not meet a certain criterion.  This process can of course be done
after parsing, but earlier cut-off should be appealing someitmes.