forest/src/design.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141

#+TITLE: Format of a binarised shared packed parse forests
#+AUTHOR: Durand
#+DATE: <2023-01-05 Jeu 23:55>
#+STARTUP: content

* Introduction

I want to design a format for shared packed parse forests.
They are needed for the chain-rule machine because the graphs need to
be labelled (indirectly) by fragments of forests so that we can
correctly and efficiently compute the semi-ring values along the
chain-rule machine operations.  Moreover, what we are really
interested is the parse forests, so we will need to deal with this
topic sooner or later, why not deal with it now?  ;P

* Principles

- The number of children of nodes is bounded by a constant, in theory.
  So it is only bounded by the computer hardware (and the grammar):
  this representation is closer to the grammar rules.
- Labels of nodes are grammar slots, that is positions within the
  rules.
- Edges have no labels: they are simply not needed.
- We need to have two lists of "plugs" that we can plug other forests
  in.  For this we also need to know if a label is labelled on a node
  that is a plug.  This implies the use of hashmaps I guess.
- When there are alternations in the forest, we use a "packed node" to
  represent this alternation.  Packed nodes are not labelled.  They
  just serve the role of putting nodes together.

* Some thoughts

We do not need to attach forest fragments to nodes of nondeterministic
finite automata: we just attach to them lists of grammar slots, which
represent a sequence of "nulling out" nullable non-terminals, until
the main non-terminal appears as the last element of the list (or even
a range of slots to be even more efficient).  A normal node would have
a range with one element.

That is it!  We do not need to worry about the forests now as we would
know how to combine the forest segments when we are performing the
chain-rule operations: we know exactly when we are expanding
non-terminals.

Some caveats:

- Every node in the forest needs to know its parents.  This is needed
  for "reductions".  That is, after we have parsed the entire
  expansion for a non-terminal, we need to go back to where that
  non-terminal was and continue from there.
- We need to record the expansions of those "virtual nodes".  That is,
  the virtual nodes represent atomic languages such as [s]^{(t)} where s
  is a non-terminal and t is a terminal.  To be more specific, it
  represents the derivative of the left-linear closure of the
  non-terminal s with respect to the terminal t.  The left-linear
  closure of s is the set of all strings (consisting of terminals and
  non-terminals alike) that are derivable from s alone, without
  consuming any inputs, by expanding according to the grammar rules.
  This means that we need to know if which non-terminals were expanded
  in order to get to a state in [s]^{(t)}.
- Each edge in the chain-rule machine needs to be labelled also with a
  position in the forest.  This perfectly solves the problem of those
  "plugs"!

* Formal Definition

We need to figure out the definition of the forest.  That it to say,
what will the resulting forest look like for a grammar.

Moreover, we need to know how to correspond each step in the
chain-rule machine with the definition of the forest.

After these two steps, we can easily proceed with the construction of
the chain-rule machine.

** Axioms

There are two *axioms* from which the definition of the forests
follows.

1. Each node has a unique label, given (principally) by three
   components:
   1) input range start
   2) input range end
   3) grammar label
2. If different subtrees share the same node label as the root, a
   "clone" of the root should be made, and there should be a
   "representative" of all clones, which is treated as a normal node
   with that label, and all clones are its children.

The first axiom ensures that the size of the forest is bounded by the
cube of \(n\), \(n\) is the length of the input sequence.  And the
second axiom specifies what to do when subtrees share a parent.

Since the clones are introduced only when subtrees share a parent,
their number cannot exceed the number of nodes of a forest without
clones.  This shows that adding clones does not affect the growth rate
of the size of forests.

A remark: the /clones/ are called /packed nodes/ in the traditional
terminology, but the terminology of a clone makes more sense to me.

** Correspondence with chain-rule machine

Every edge in the chain-rule machine corresponds to an edge in the
forest; denote this correspondence by \(f\).

The chain-rule operation can be described as follows:

1. Start with an edge \(e\) with children \(d_1, \ldots, d_n\) in the
   chain-rule machine.
2. Prepare a new forest fragment as follows.
   1) For every child \(g\) of \(e\) in the atomic language, if \(g\)
      is the terminal that appears as the current input \(t\), let the
      new fragment be defined as the node moved by \(g\), alone.
   2) If \(g\) is a non-terminal and its first-set contains \(t\),
      then for every left-linearly closed child of \(g\) that is
      labelled \(t\), denoted as \(h\), then let the fragment be
      defined as the node moved by \(g\), with the pre-recorded
      left-linear expansion information from \(g\) to \(h\) appended
      as children.
3. For \(i=1,\ldots,n\), consider the edge \(f(d_i)\) in the forest.
   There are three cases:
   1) If the next edge is empty, that means we have found something
      totally new to add to the forest, so we just add that.
   2) If the next edge after \(f(e)\) contains the new fragment as a
      sub-tree already, this means the current branch has already been
      explored before, and we have nothing else to do.
   3) If the next edge is non-empty and does not match the new
      fragment, this means we have a conflict with a previously found
      branch.  We solve it simply by following the second axiom above:
      we make the node that is the source of \(f(e)\) a clone, and add
      the new fragment under a new clone.  Note that if the node is
      already a clone, we just make a clone under the parent of that
      clone.
4. Update the correspondence \(f\) by updating \(f(g)\) and \(f(h)\)
   accordingly.

In this process we always follow the first axiom by ensuring that the
labels of the forest are unique to each node, except for those clones,
which are only created by following the second axiom.