src/bsr.h


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

#ifndef BSR_H
#define BSR_H

/* This file implements the "Binary Subtree Representation" of a
   derivation forest.  See the following paper for the miraculous
   details:

   Derivation representation using binary subtree sets
   
   by
   
   Elizabeth Scott, Adrian Johnstone and L.Thomas van Binsbergen. */

/* Note that there is an implementation of this GLL algorithm in the
   following GitHub repository:

   https://github.com/goccmack/gogll

   But the implementation there uses array look-up to determine if a
   BSR has already been added to a set.  I guess the reason is that
   this is not the bottleneck of performance in its applications.  But
   I am not satisfied with that solution.  I want to take advantage of
   hash-table's almost constant-time look-up feature to implement
   this. */

/* A BSR set has two types.

   The first type is of the form

   (X := alpha, i, j, k),

   where X := alpha is a grammar rule.  This is equivalent with a
   5-tuple

   (X, a, i, j, k),

   where X is the integer corresponding to X and a is the index of the
   rule X := alpha.

   The second type is of the form

   (X := beta . gamma, i, j, k),

   where beta is a prefix of the rule X := beta gamma.  So this is
   equivalent with a 6-tuple

   (X, a, b, i, j, k),

   where a is the index of the rule X := beta gamma and b is the
   length of beta.

   The simplest way to implement this is to have two "sets", one for
   each type.  Each set is an array of arrays, where the root array's
   indices correspond to non-terminals, and the leaf arrays correspond
   to rules of that non-terminal.  Then the rest is implemented
   through hash-tables, whose keys are tuples of integers.  For the
   first type, the keys are the (i, k) pairs and the values are those
   j.  For the second, the keys are the (b, i, k) tupels and the
   values are those j.  The reason to use i and k as keys is that they
   represent the left and the right extent of the BSR respectively, so
   we index them in order to find them easily.  In addition, this will
   be useful if we want to extract SPPF out of a BSR set.

   I don't know if there are better implementations.  If I think of
   better ways in the future, I will re-implement the BSR sets. */

/* Note that we also need to implement "process descriptors" for the
   GLL parser.  Since a process descriptor is of the form

   (X := beta . gamma, i, j, k),

   where X := beta gamma is a rule and i is the length of beta, we can
   represent a process descriptor as

   (X, a, i, j, k),

   i.e. as a first-type BSR set.  But the meaning of the "i" is
   different.

   However, we shall notice that there is a natural "grading" on the
   process descriptors.  The k-th grading consists of the descriptors
   with the last element in the tuple equal to k.  In the clustered
   nonterminal parsing algorithm, the process descriptors that are
   added as a consequence of dealing with a process descriptor of
   grading k must be of grade greater than or equal to k.  This means
   that if all process descriptors of grade k have been processed and
   the descriptors awaiting to be processed all have grades > k, then
   we won't see any process descriptors of grade <= k again.  So we
   can actually keep a list of process descriptors that serve the role
   of the set U_k in the afore-mentionned paper.  This list
   corresponds to the grading of the descriptors.  This way we don't
   need to store the k in the descriptor.  Therefore a process
   descriptor is represented as

   (X, a, i, j).

   Again we store an array of arrays corresponding to X and a.  Thus
   the hash table simply has keys i and values j.

   The main benefit of this is that after we find that all process
   descriptors in R_k have been processed, we can free those
   descriptors in U_k.  I think this is a good way to lower the space
   requirements of this algorithm. */

int place(int x);

#endif