forest/src/design.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

#+TITLE: Format of a binarised shared packed parse forests
#+AUTHOR: Durand
#+DATE: <2023-01-05 Jeu 23:55>
#+STARTUP: content

* Introduction

I want to design a format for shared packed parse forests.
They are needed for the chain-rule machine because the graphs need to
be labelled (indirectly) by fragments of forests so that we can
correctly and efficiently compute the semi-ring values along the
chain-rule machine operations.  Moreover, what we are really
interested is the parse forests, so we will need to deal with this
topic sooner or later, why not deal with it now?  ;P

* Principles

- The number of children of nodes is bounded by a constant, in theory.
  So it is only bounded by the computer hardware (and the grammar):
  this representation is closer to the grammar rules.
- Labels of nodes are grammar slots, that is positions within the
  rules.
- Edges have no labels: they are simply not needed.
- We need to have two lists of "plugs" that we can plug other forests
  in.  For this we also need to know if a label is labelled on a node
  that is a plug.  This implies the use of hashmaps I guess.
- When there are alternations in the forest, we use a "packed node" to
  represent this alternation.  Packed nodes are not labelled.  They
  just serve the role of putting nodes together.

* Some thoughts [1/3]

We do not need to attach forest fragments to nodes of nondeterministic
finite automata: we just attach to them lists of grammar slots, which
represent a sequence of "nulling out" nullable non-terminals, until
the main non-terminal appears as the last element of the list (or even
a range of slots to be even more efficient).  A normal node would have
a range with one element.

That is it!  We do not need to worry about the forests now as we would
know how to combine the forest segments when we are performing the
chain-rule operations: we know exactly when we are expanding
non-terminals.

Some caveats:

- [X] Every node in the forest needs to know its parents.  This is
  needed for "reductions".  That is, after we have parsed the entire
  expansion for a non-terminal, we need to go back to where that
  non-terminal was and continue from there.
- [ ] We need to record the expansions of those "virtual nodes".  That
  is, the virtual nodes represent atomic languages such as [s]^{(t)}
  where s is a non-terminal and t is a terminal.  To be more specific,
  it represents the derivative of the left-linear closure of the
  non-terminal s with respect to the terminal t.  The left-linear
  closure of s is the set of all strings (consisting of terminals and
  non-terminals alike) that are derivable from s alone, without
  consuming any inputs, by expanding according to the grammar rules.
  This means that we need to know if which non-terminals were expanded
  in order to get to a state in [s]^{(t)}.
- [ ] Each edge in the chain-rule machine needs to be labelled also
  with a position in the forest.  This perfectly solves the problem of
  those "plugs"!