#+TITLE: Format of a binarised shared packed parse forests #+AUTHOR: Durand #+DATE: <2023-01-05 Jeu 23:55> #+STARTUP: content * Introduction I want to design a format for shared packed parse forests. They are needed for the chain-rule machine because the graphs need to be labelled (indirectly) by fragments of forests so that we can correctly and efficiently compute the semi-ring values along the chain-rule machine operations. Moreover, what we are really interested is the parse forests, so we will need to deal with this topic sooner or later, why not deal with it now? ;P * Principles - The number of children of nodes is bounded by a constant, in theory. So it is only bounded by the computer hardware (and the grammar): this representation is closer to the grammar rules. - Labels of nodes are grammar slots, that is positions within the rules. - Edges have no labels: they are simply not needed. - We need to have two lists of "plugs" that we can plug other forests in. For this we also need to know if a label is labelled on a node that is a plug. This implies the use of hashmaps I guess. - When there are alternations in the forest, we use a "packed node" to represent this alternation. Packed nodes are not labelled. They just serve the role of putting nodes together. * Some thoughts We do not need to attach forest fragments to nodes of nondeterministic finite automata: we just attach to them lists of grammar slots, which represent a sequence of "nulling out" nullable non-terminals, until the main non-terminal appears as the last element of the list (or even a range of slots to be even more efficient). A normal node would have a range with one element. That is it! We do not need to worry about the forests now as we would know how to combine the forest segments when we are performing the chain-rule operations: we know exactly when we are expanding non-terminals. Some caveats: - Every node in the forest needs to know its parents. This is needed for "reductions". That is, after we have parsed the entire expansion for a non-terminal, we need to go back to where that non-terminal was and continue from there. - We need to record the expansions of those "virtual nodes". That is, the virtual nodes represent atomic languages such as [s]^{(t)} where s is a non-terminal and t is a terminal. To be more specific, it represents the derivative of the left-linear closure of the non-terminal s with respect to the terminal t. The left-linear closure of s is the set of all strings (consisting of terminals and non-terminals alike) that are derivable from s alone, without consuming any inputs, by expanding according to the grammar rules. This means that we need to know if which non-terminals were expanded in order to get to a state in [s]^{(t)}. - Each edge in the chain-rule machine needs to be labelled also with a position in the forest. This perfectly solves the problem of those "plugs"! * Formal Definition We need to figure out the definition of the forest. That it to say, what will the resulting forest look like for a grammar. Moreover, we need to know how to correspond each step in the chain-rule machine with the definition of the forest. After these two steps, we can easily proceed with the construction of the chain-rule machine. ** Axioms There are two *axioms* from which the definition of the forests follows. 1. Each node has a unique label, given (principally) by three components: 1) input range start 2) input range end 3) grammar label 2. If different subtrees share the same node label as the root, a "clone" of the root should be made, and there should be a "representative" of all clones, which is treated as a normal node with that label, and all clones are its children. The first axiom ensures that the size of the forest is bounded by the cube of \(n\), \(n\) is the length of the input sequence. And the second axiom specifies what to do when subtrees share a parent. Since the clones are introduced only when subtrees share a parent, their number cannot exceed the number of nodes of a forest without clones. This shows that adding clones does not affect the growth rate of the size of forests. A remark: the /clones/ are called /packed nodes/ in the traditional terminology, but the terminology of a clone makes more sense to me. ** Correspondence with chain-rule machine Every edge in the chain-rule machine corresponds to an edge in the forest; denote this correspondence by \(f\). The chain-rule operation can be described as follows: 1. Start with an edge \(e\) with children \(d_1, \ldots, d_n\) in the chain-rule machine. 2. Prepare a new forest fragment as follows. 1) For every child \(g\) of \(e\) in the atomic language, if \(g\) is the terminal that appears as the current input \(t\), let the new fragment be defined as the node moved by \(g\), alone. 2) If \(g\) is a non-terminal and its first-set contains \(t\), then for every left-linearly closed child of \(g\) that is labelled \(t\), denoted as \(h\), then let the fragment be defined as the node moved by \(g\), with the pre-recorded left-linear expansion information from \(g\) to \(h\) appended as children. 3. For \(i=1,\ldots,n\), consider the edge \(f(d_i)\) in the forest. There are three cases: 1) If the next edge is empty, that means we have found something totally new to add to the forest, so we just add that. 2) If the next edge after \(f(e)\) contains the new fragment as a sub-tree already, this means the current branch has already been explored before, and we have nothing else to do. 3) If the next edge is non-empty and does not match the new fragment, this means we have a conflict with a previously found branch. We solve it simply by following the second axiom above: we make the node that is the source of \(f(e)\) a clone, and add the new fragment under a new clone. Note that if the node is already a clone, we just make a clone under the parent of that clone. 4. Update the correspondence \(f\) by updating \(f(g)\) and \(f(h)\) accordingly. In this process we always follow the first axiom by ensuring that the labels of the forest are unique to each node, except for those clones, which are only created by following the second axiom.