diff options
-rw-r--r-- | DIARY | 132 |
1 files changed, 132 insertions, 0 deletions
@@ -60,3 +60,135 @@ contained in one of the ranges. After this is finished, we can bump the version to 0.1.3 and simplify the build process. Hope it all goes well. + + + +====================================================================== + 2023-07-23 +====================================================================== + +I encounter some problem in implementing the tokenization handling. + +To better separate concerns, this package only operates at the level +of tokens, which are an abstraction over inputs. For example, if we +encounter an identifier in the source, we can replace that identifier +by a special token. This way, the length of the input is greatly +reduced, and the parsing process can be sped up as the tokenization +process, which usually only involves regular expressions, is much +faster than the rest of the parsing. + +The problem is that the same input slice can be classified into +multiple different tokens. As an example, if a grammar has a token +for matching arbitrary non-whitespace letters, and if it also has a +token for matching a specific character, like '*', then the two tokens +are both valid at the character '*'. This can be regarded as a token +at a deeper level appearing as multiple different tokens at a more +shallow level. + +Hence, a natural approach would be to let the tokens of the grammar +decide how to match against tokens of the input: the grammar and the +input use different token types. This is fine as long as the user +does not get into play: the package can implement a way to match +grammar tokens against inputs. However, if the user wants to +interrupt with the tokensization process, then things become tricky. + +The well-known tree-sitter parser generator (see +<https://tree-sitter.github.io/tree-sitter/> for more information) +differentiates the tokens by a set of rules, by which a certain input +can only be classified as a unique token. Basically it solves the +possibility of multiple tokens by the precedence rules. This is +consistent with its goal of providing parsers for grammars of +primarily programming languages, for which the number of derivations +of a given input is at most one: it does not accept multiple +derivations of the same input. + +But this is not acceptable for this package: it is explicitly desired +to have multiple derivations recognized. + +Moreover, if we look back at the original motivation behind the +tokensization process, we see that the tokensization is only needed +for two reasons: speed and flexibility. + +If we want the tokensization because of speed, it means that the +tokensization is only an implementation detail, and should not impose +restrictions on the resulting derivations. From this perspective, +every tokensization probability should be preserved. + +On the other hand, if we tokenize input on account of the flexibility +thus brought about, then there is no reason to discard some +tokensization possibility just because we do not find a way to do it: +that hurts the flexibility a lot. + +Consequently, from both points of view, we shall treat multiple +tokensization possibilities as they are different grammar derivations. +The only difference from a "regular" grammar derivation is that they +can be altered by the user, from which the flexibility follows. And +this is the point: tokens play a role between internal to the parsers +and external to them. Or put more simply: they should be a changeable +gear in the parsers. + +But things are not so easy: if we want to let users have a handle on +the tokenization, then we require extra input from the users. This is +not only an added burden to the users, but it also is an additional +performance overhead that I am not sure if is too much. + +The situation is even worse: different tokenization choices can lead +to different segments of the input being tokenized. For example, a +choice may break the input into segments (Token(1), Span(0, 1)), +(Token(2), Span(1, 2)), (Token(3), Span(2, 3)), whereas another choice +may break the input into these segments instead: (Token(1), Span(0, +1)), (Token(4), Span(1, 3)). + +This is very hard for the chain-rule machine to track, as it operates +at the level of tokens. In fact, after more thoughts, the chain-rule +machine is not the right place to handle these complexities. After +all, this process has nothing to do with the chain-rule machine. It +should be the responsibility of the application parser generator that +feeds the machine with appropriate input tokens. + +Consequently, the tokenization process should best be handled by +"splitting the input sequence" into multiple sequences of input +tokens, if multiple variatoins of tokens are desired. Apparently, +each variation should be coupled with its own derivation forest, if +that variation is valid according to the grammar. + +I think this is a rather plausible approach: multiple possibilities of +tokens are handled, and if only one variation is needed, no additional +overhead is imposed. + +One point of caution is that we shall be able to "share states" across +variations. This means the directed acyclic graph that is used to +recognize the input sequence should be maximally shared. + +In my rough plan, this should be accomplished after we implement the +"context-free memoization" for the chain-rule machines. This +memoization technique is really the central selling-point of the paper +proposing this chain-rule machine parsing algorithm, at least +according to its author. + +Currenly, there are some steps in order to implement the memoization +technique. The part on the directed acyclic graph is rather simple: +we simply have to break the graph into several chained pieces each of +which represents a segment of the whole graph. The reason for this +segmentation is that this way we can memoize the segments more easily. + +What is more challenging is the memoization of the derivation forests. +This part is kind of different from what the author proposed in the +paper. The author suggested to record every derivation sequence in +the graph, and memoize the transformations on the sequences. My plan +is instead to memoize the directed acyclic graph and the +transformation on the derivation forests. + +So I need to figure out the memoization of two things: directed +acyclic graphs and transformations of derivation forests. + +But before that, the first thing to do is to implement a "default +tokenizer". The user of course should have the facility to control +the working of this tokenizer, per chance by additional specialization +in the grammar files. + +Further, this control should take the derivation forest state into +account: the user can demand, for example, to reject a particular +tokenization variation if, at a given token, the derivation forest +does not meet a certain criterion. This process can of course be done +after parsing, but earlier cut-off should be appealing someitmes. |