summaryrefslogtreecommitdiff
path: root/forest/src/design.org
diff options
context:
space:
mode:
authorJSDurand <mmemmew@gmail.com>2023-01-11 23:47:26 +0800
committerJSDurand <mmemmew@gmail.com>2023-01-11 23:47:26 +0800
commit1a3d346f413325ed37848a6b2526e8e729269833 (patch)
treeab8812f8094d096c68aee53cf516e986cc9a273a /forest/src/design.org
parentf27d604d93ce583d4404e1874664e08382ea2f00 (diff)
Record left-linear expansion and forest format
Now the grammar will record the left-linear expansions when generating the nondeterministic finite automaton frmo its rules, and will record whether an edge in the nondeterministic finite automaton comes from a left-linear expansion. The latter is needed because while performing a chain-rule derivation, we do not need the left-linear expanded derivations in the "first layer". This might well have been the root cause of the bad performance of the previous version of this package. Also I have figured out how to properly generate and handle parse forests while manipulating the "chain-rule machine".
Diffstat (limited to 'forest/src/design.org')
-rw-r--r--forest/src/design.org95
1 files changed, 85 insertions, 10 deletions
diff --git a/forest/src/design.org b/forest/src/design.org
index 771ca4b..09db113 100644
--- a/forest/src/design.org
+++ b/forest/src/design.org
@@ -28,7 +28,7 @@ topic sooner or later, why not deal with it now? ;P
represent this alternation. Packed nodes are not labelled. They
just serve the role of putting nodes together.
-* Some thoughts [1/3]
+* Some thoughts
We do not need to attach forest fragments to nodes of nondeterministic
finite automata: we just attach to them lists of grammar slots, which
@@ -44,23 +44,98 @@ non-terminals.
Some caveats:
-- [X] Every node in the forest needs to know its parents. This is
- needed for "reductions". That is, after we have parsed the entire
+- Every node in the forest needs to know its parents. This is needed
+ for "reductions". That is, after we have parsed the entire
expansion for a non-terminal, we need to go back to where that
non-terminal was and continue from there.
-- [ ] We need to record the expansions of those "virtual nodes". That
- is, the virtual nodes represent atomic languages such as [s]^{(t)}
- where s is a non-terminal and t is a terminal. To be more specific,
- it represents the derivative of the left-linear closure of the
+- We need to record the expansions of those "virtual nodes". That is,
+ the virtual nodes represent atomic languages such as [s]^{(t)} where s
+ is a non-terminal and t is a terminal. To be more specific, it
+ represents the derivative of the left-linear closure of the
non-terminal s with respect to the terminal t. The left-linear
closure of s is the set of all strings (consisting of terminals and
non-terminals alike) that are derivable from s alone, without
consuming any inputs, by expanding according to the grammar rules.
This means that we need to know if which non-terminals were expanded
in order to get to a state in [s]^{(t)}.
-- [ ] Each edge in the chain-rule machine needs to be labelled also
- with a position in the forest. This perfectly solves the problem of
- those "plugs"!
+- Each edge in the chain-rule machine needs to be labelled also with a
+ position in the forest. This perfectly solves the problem of those
+ "plugs"!
+* Formal Definition
+We need to figure out the definition of the forest. That it to say,
+what will the resulting forest look like for a grammar.
+Moreover, we need to know how to correspond each step in the
+chain-rule machine with the definition of the forest.
+
+After these two steps, we can easily proceed with the construction of
+the chain-rule machine.
+
+** Axioms
+
+There are two *axioms* from which the definition of the forests
+follows.
+
+1. Each node has a unique label, given (principally) by three
+ components:
+ 1) input range start
+ 2) input range end
+ 3) grammar label
+2. If different subtrees share the same node label as the root, a
+ "clone" of the root should be made, and there should be a
+ "representative" of all clones, which is treated as a normal node
+ with that label, and all clones are its children.
+
+The first axiom ensures that the size of the forest is bounded by the
+cube of \(n\), \(n\) is the length of the input sequence. And the
+second axiom specifies what to do when subtrees share a parent.
+
+Since the clones are introduced only when subtrees share a parent,
+their number cannot exceed the number of nodes of a forest without
+clones. This shows that adding clones does not affect the growth rate
+of the size of forests.
+
+A remark: the /clones/ are called /packed nodes/ in the traditional
+terminology, but the terminology of a clone makes more sense to me.
+
+** Correspondence with chain-rule machine
+
+Every edge in the chain-rule machine corresponds to an edge in the
+forest; denote this correspondence by \(f\).
+
+The chain-rule operation can be described as follows:
+
+1. Start with an edge \(e\) with children \(d_1, \ldots, d_n\) in the
+ chain-rule machine.
+2. Prepare a new forest fragment as follows.
+ 1) For every child \(g\) of \(e\) in the atomic language, if \(g\)
+ is the terminal that appears as the current input \(t\), let the
+ new fragment be defined as the node moved by \(g\), alone.
+ 2) If \(g\) is a non-terminal and its first-set contains \(t\),
+ then for every left-linearly closed child of \(g\) that is
+ labelled \(t\), denoted as \(h\), then let the fragment be
+ defined as the node moved by \(g\), with the pre-recorded
+ left-linear expansion information from \(g\) to \(h\) appended
+ as children.
+3. For \(i=1,\ldots,n\), consider the edge \(f(d_i)\) in the forest.
+ There are three cases:
+ 1) If the next edge is empty, that means we have found something
+ totally new to add to the forest, so we just add that.
+ 2) If the next edge after \(f(e)\) contains the new fragment as a
+ sub-tree already, this means the current branch has already been
+ explored before, and we have nothing else to do.
+ 3) If the next edge is non-empty and does not match the new
+ fragment, this means we have a conflict with a previously found
+ branch. We solve it simply by following the second axiom above:
+ we make the node that is the source of \(f(e)\) a clone, and add
+ the new fragment under a new clone. Note that if the node is
+ already a clone, we just make a clone under the parent of that
+ clone.
+4. Update the correspondence \(f\) by updating \(f(g)\) and \(f(h)\)
+ accordingly.
+
+In this process we always follow the first axiom by ensuring that the
+labels of the forest are unique to each node, except for those clones,
+which are only created by following the second axiom.