summaryrefslogtreecommitdiff
path: root/forest/src/design.org
diff options
context:
space:
mode:
authorJSDurand <mmemmew@gmail.com>2023-01-06 23:42:28 +0800
committerJSDurand <mmemmew@gmail.com>2023-01-06 23:42:28 +0800
commitf27d604d93ce583d4404e1874664e08382ea2f00 (patch)
tree6fa8df26af954e94f3604ffabde4961ee8108c41 /forest/src/design.org
parent7dd4935230e303aef8d295d992239d59d95b32d7 (diff)
Save before system restart.
I am about to re-start my system, so I save before any crashes happen.
Diffstat (limited to 'forest/src/design.org')
-rw-r--r--forest/src/design.org66
1 files changed, 66 insertions, 0 deletions
diff --git a/forest/src/design.org b/forest/src/design.org
new file mode 100644
index 0000000..771ca4b
--- /dev/null
+++ b/forest/src/design.org
@@ -0,0 +1,66 @@
+#+TITLE: Format of a binarised shared packed parse forests
+#+AUTHOR: Durand
+#+DATE: <2023-01-05 Jeu 23:55>
+#+STARTUP: content
+
+* Introduction
+
+I want to design a format for shared packed parse forests.
+They are needed for the chain-rule machine because the graphs need to
+be labelled (indirectly) by fragments of forests so that we can
+correctly and efficiently compute the semi-ring values along the
+chain-rule machine operations. Moreover, what we are really
+interested is the parse forests, so we will need to deal with this
+topic sooner or later, why not deal with it now? ;P
+
+* Principles
+
+- The number of children of nodes is bounded by a constant, in theory.
+ So it is only bounded by the computer hardware (and the grammar):
+ this representation is closer to the grammar rules.
+- Labels of nodes are grammar slots, that is positions within the
+ rules.
+- Edges have no labels: they are simply not needed.
+- We need to have two lists of "plugs" that we can plug other forests
+ in. For this we also need to know if a label is labelled on a node
+ that is a plug. This implies the use of hashmaps I guess.
+- When there are alternations in the forest, we use a "packed node" to
+ represent this alternation. Packed nodes are not labelled. They
+ just serve the role of putting nodes together.
+
+* Some thoughts [1/3]
+
+We do not need to attach forest fragments to nodes of nondeterministic
+finite automata: we just attach to them lists of grammar slots, which
+represent a sequence of "nulling out" nullable non-terminals, until
+the main non-terminal appears as the last element of the list (or even
+a range of slots to be even more efficient). A normal node would have
+a range with one element.
+
+That is it! We do not need to worry about the forests now as we would
+know how to combine the forest segments when we are performing the
+chain-rule operations: we know exactly when we are expanding
+non-terminals.
+
+Some caveats:
+
+- [X] Every node in the forest needs to know its parents. This is
+ needed for "reductions". That is, after we have parsed the entire
+ expansion for a non-terminal, we need to go back to where that
+ non-terminal was and continue from there.
+- [ ] We need to record the expansions of those "virtual nodes". That
+ is, the virtual nodes represent atomic languages such as [s]^{(t)}
+ where s is a non-terminal and t is a terminal. To be more specific,
+ it represents the derivative of the left-linear closure of the
+ non-terminal s with respect to the terminal t. The left-linear
+ closure of s is the set of all strings (consisting of terminals and
+ non-terminals alike) that are derivable from s alone, without
+ consuming any inputs, by expanding according to the grammar rules.
+ This means that we need to know if which non-terminals were expanded
+ in order to get to a state in [s]^{(t)}.
+- [ ] Each edge in the chain-rule machine needs to be labelled also
+ with a position in the forest. This perfectly solves the problem of
+ those "plugs"!
+
+
+