summaryrefslogtreecommitdiff
path: root/chain/src/item/thoughts-on-reducer.org
blob: 39a799ae63633502afa60b1172a5482745187e29 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
#+TITLE: Thoughts on the reducer
#+AUTHOR: Durand
#+DATE: <2023-06-05 Lun 22:08>

This module implements a data type for storing reduction
information.

* Questions

What is reduction information, and why is it necessary to store it?

* Nullable-closure process

In the process of running the chain-rule machine, we will collapse
edges, in the sense that, if there is an edge from node a to node b
and another from node b to node c, and if the edge from node a to node
b is "nullable", that is if the edge corresponds to a position in the
atomic language that is deemed nullable by the atomic language, then
we also make an edge from node a to node c.

The purpose of this process of forming the nullable closure is to
ensure that the chain-rule machine can save the time to traverse the
entire machine to find the nullable closure later on.  But a
side-effect of this process is that reductions are not explicitly
marked.

To explain this in detail, we first investigate what reduction is and
what it means in the chain-rule machine, and then explain the problem.

* Three types of actions

We can imagine a "traditional parser generator" as a stack machine:
there are three types of actions associated with it, depending on the
current state and the current input token.  The first is expansion: it
means that we are expanding from a non-terminal, by some rule.  The
second is a normal push: we just continue parsing according to the
current rule.  The final one is reduction: it means the current
expansion from a non-terminal has terminated and we go back to the
previous level of expansion.

* Relation to the chain-rule machine

For our chain-rule machine, expansion means to create a new node
pointing at the current node, forming a path of length increased by
one.  A normal push means to create a new node that points to the
target of an edge going out from the current node, which was not
created by the process of forming nullable closures.  And the
reduction means to create a new node that points to the target of an
edge going out from the current node, which *was* created by the
process of forming nullable closures.

* Problem

As can be seen from the previous paragraph, the distinction between a
normal push and a reduction in the chain-rule machine is simply
whether or not the original edge was created by the process of forming
nullable closures.  For the chain-rule machine, this does not matter:
it can function well.  For the formation of the derivation forest,
however, this is not so well: we cannot read-off immediately whether
or not to perform reductions from the chain-rule machine.

* Solution

Since we cannot read-off the reduction information directly from the
chain-rule machine, we have to store that information somehow.

** Digression: past experiences

When developping this part, I did not have a clear picture of the
situation in mind: I was experimenting with the ways to construct the
derivation forest from the chain-rule machine, as the paper describing
the algorithm does not construct the derivation forest directly: it
constructs an alternative format.  A consequence is that I spent quite
some time in figuring out how to construct derivation forests
correctly.

During the experiments, I tried various ideas: including to "fill-in"
the reductions after we have parsed the entire input.  This seemed
ostensibly a good idea, as I seem to be able to "read-off" the
reductions from the resulting partially complete derivation forests.

As it turned out, this was actually a bad idea.  In fact, I can now
prove that this will not work by the presence of a specific grammar
that will cause this approach to fail definitely.  This led me to
believe that the only way is to store the needed reduction information
in order to fill in this gap.

** Storing reduction information

Now we want to store the reduction information, so we need to be clear
about what we are storing.

In the derivation forest, a reduction happens when we reach the
right-most descendent of a subtree, which marks the end of an
expansion from a non-terminal.  Moreover, our derivation forests
usually contain half-open nodes, which mean that they are not
completed yet and can keep growing, until a reduction happens when we
"close" the half-open node.  Thus, what we really need is the list of
nodes to close.

This information is readily available when we form nullable closures:
the skipped nodes are the nodes we need to close later on.  To be more
exact, when we skip nodes, we have two nodes: the top node and the
bottom node, and we want to store the middle node that is skipped.

This naturally led to the structure of a nondeterministic finite
automaton: when we want to close nodes, we start from the bottom-most
node, and query the nodes upward by the top node, until we reach the
top node or until we have no more nodes to go.

The special characteristic about this structure is that we need to
label both the vertices and the edges.  Since we already have a
structure that labels edges, we can simply extend that structure by
adding an array of labels and possibly a map from labels to nodes, to
make sure the labels of vertices are unique and can be queried
quickly.

One thing to note is that we actually only need the ability to query
children by the labels, and do not need to query labels by the target.
We need experiments to see if thiw will be the bottleneck and hence
need to be optimized out.