#ifndef BSR_H #define BSR_H /* This file implements the "Binary Subtree Representation" of a derivation forest. See the following paper for the miraculous details: Derivation representation using binary subtree sets by Elizabeth Scott, Adrian Johnstone and L.Thomas van Binsbergen. */ /* Note that there is an implementation of this GLL algorithm in the following GitHub repository: https://github.com/goccmack/gogll But the implementation there uses array look-up to determine if a BSR has already been added to a set. I guess the reason is that this is not the bottleneck of performance in its applications. But I am not satisfied with that solution. I want to take advantage of hash-table's almost constant-time look-up feature to implement this. */ /* A BSR set has two types. The first type is of the form (X := alpha, i, j, k), where X := alpha is a grammar rule. This is equivalent with a 5-tuple (X, a, i, j, k), where X is the integer corresponding to X and a is the index of the rule X := alpha. The second type is of the form (X := beta . gamma, i, j, k), where beta is a prefix of the rule X := beta gamma. So this is equivalent with a 6-tuple (X, a, b, i, j, k), where a is the index of the rule X := beta gamma and b is the length of beta. The simplest way to implement this is to have two "sets", one for each type. Each set is an array of arrays, where the root array's indices correspond to non-terminals, and the leaf arrays correspond to rules of that non-terminal. Then the rest is implemented through hash-tables, whose keys are tuples of integers. For the first type, the keys are the (i, k) pairs and the values are those j. For the second, the keys are the (b, i, k) tupels and the values are those j. The reason to use i and k as keys is that they represent the left and the right extent of the BSR respectively, so we index them in order to find them easily. In addition, this will be useful if we want to extract SPPF out of a BSR set. I don't know if there are better implementations. If I think of better ways in the future, I will re-implement the BSR sets. */ /* Note that we also need to implement "process descriptors" for the GLL parser. Since a process descriptor is of the form (X := beta . gamma, i, j, k), where X := beta gamma is a rule and i is the length of beta, we can represent a process descriptor as (X, a, i, j, k), i.e. as a first-type BSR set. But the meaning of the "i" is different. However, we shall notice that there is a natural "grading" on the process descriptors. The k-th grading consists of the descriptors with the last element in the tuple equal to k. In the clustered nonterminal parsing algorithm, the process descriptors that are added as a consequence of dealing with a process descriptor of grading k must be of grade greater than or equal to k. This means that if all process descriptors of grade k have been processed and the descriptors awaiting to be processed all have grades > k, then we won't see any process descriptors of grade <= k again. So we can actually keep a list of process descriptors that serve the role of the set U_k in the afore-mentionned paper. This list corresponds to the grading of the descriptors. This way we don't need to store the k in the descriptor. Therefore a process descriptor is represented as (X, a, i, j). Again we store an array of arrays corresponding to X and a. Thus the hash table simply has keys i and values j. The main benefit of this is that after we find that all process descriptors in R_k have been processed, we can free those descriptors in U_k. I think this is a good way to lower the space requirements of this algorithm. */ int place(int x); #endif