====================================================================== 2023-06-02 ====================================================================== This is a "diary" that records my thoughts when trying to debug the forest manipulations. Basically the forests are wrongly "merged", and produce weird clones. Right now I have finished refactoring most of the codes, but still fail to perceive why the forests behave in that way. I even finished a proto-type of my procedural macros which help me reduce boiler-plates but are not necessary. Right now I think I need more refined control over the details of the forests manipulations. To be more precise, I need to be able to observe, step by step, how the chain-rule machine manipulated the forests. This means in particular to be able to produce graphs of half-way forests whenever I want. I have not yet figured out how to do this exactly. The rough plan is to compress all steps of graphs into an "incremental format" that stores not just forests information, but how they were created, and allow the users to re-create the forests step by step. Simply put, this is just another feature that I would like to have for the end-users, but postponed to later since that was not essential for my developments. Now this seems to be quite important for me to properly observe the forests, so it is perhaps time to implement this feature first. ====================================================================== 2023-07-22 ====================================================================== Now a stable version of the package is in place, so I can start working on some non-essential features, like the tokenization process and the visulization feature. Amongst these, the tokenization process is the most important, as the users will not expect to work with tokens: they expect to use characters. So in some sense this package is not ready for the end-users yet. My thought is to offer two modes: one for operating on characters and the other for operating on tokens directly. This can be achieved by letting the data type representing terminals contain more information on how to match characters. My thought is to attach either a vector of "machine codes" or a single character to terminals. If a terminal is attached a single character, it can only match that character. On the other hand, if a terminal corresponds to a vector of machine codes, then it means to execute this sequence of codes to determine if a character matches. We can even simplify the format of machine codes to a sequence of ranges, and a character only matches if it is contained in one of the ranges. After this is finished, we can bump the version to 0.1.3 and simplify the build process. Hope it all goes well. ====================================================================== 2023-07-23 ====================================================================== I encounter some problem in implementing the tokenization handling. To better separate concerns, this package only operates at the level of tokens, which are an abstraction over inputs. For example, if we encounter an identifier in the source, we can replace that identifier by a special token. This way, the length of the input is greatly reduced, and the parsing process can be sped up as the tokenization process, which usually only involves regular expressions, is much faster than the rest of the parsing. The problem is that the same input slice can be classified into multiple different tokens. As an example, if a grammar has a token for matching arbitrary non-whitespace letters, and if it also has a token for matching a specific character, like '*', then the two tokens are both valid at the character '*'. This can be regarded as a token at a deeper level appearing as multiple different tokens at a more shallow level. Hence, a natural approach would be to let the tokens of the grammar decide how to match against tokens of the input: the grammar and the input use different token types. This is fine as long as the user does not get into play: the package can implement a way to match grammar tokens against inputs. However, if the user wants to interrupt with the tokensization process, then things become tricky. The well-known tree-sitter parser generator (see for more information) differentiates the tokens by a set of rules, by which a certain input can only be classified as a unique token. Basically it solves the possibility of multiple tokens by the precedence rules. This is consistent with its goal of providing parsers for grammars of primarily programming languages, for which the number of derivations of a given input is at most one: it does not accept multiple derivations of the same input. But this is not acceptable for this package: it is explicitly desired to have multiple derivations recognized. Moreover, if we look back at the original motivation behind the tokensization process, we see that the tokensization is only needed for two reasons: speed and flexibility. If we want the tokensization because of speed, it means that the tokensization is only an implementation detail, and should not impose restrictions on the resulting derivations. From this perspective, every tokensization probability should be preserved. On the other hand, if we tokenize input on account of the flexibility thus brought about, then there is no reason to discard some tokensization possibility just because we do not find a way to do it: that hurts the flexibility a lot. Consequently, from both points of view, we shall treat multiple tokensization possibilities as they are different grammar derivations. The only difference from a "regular" grammar derivation is that they can be altered by the user, from which the flexibility follows. And this is the point: tokens play a role between internal to the parsers and external to them. Or put more simply: they should be a changeable gear in the parsers. But things are not so easy: if we want to let users have a handle on the tokenization, then we require extra input from the users. This is not only an added burden to the users, but it also is an additional performance overhead that I am not sure if is too much. The situation is even worse: different tokenization choices can lead to different segments of the input being tokenized. For example, a choice may break the input into segments (Token(1), Span(0, 1)), (Token(2), Span(1, 2)), (Token(3), Span(2, 3)), whereas another choice may break the input into these segments instead: (Token(1), Span(0, 1)), (Token(4), Span(1, 3)). This is very hard for the chain-rule machine to track, as it operates at the level of tokens. In fact, after more thoughts, the chain-rule machine is not the right place to handle these complexities. After all, this process has nothing to do with the chain-rule machine. It should be the responsibility of the application parser generator that feeds the machine with appropriate input tokens. Consequently, the tokenization process should best be handled by "splitting the input sequence" into multiple sequences of input tokens, if multiple variatoins of tokens are desired. Apparently, each variation should be coupled with its own derivation forest, if that variation is valid according to the grammar. I think this is a rather plausible approach: multiple possibilities of tokens are handled, and if only one variation is needed, no additional overhead is imposed. One point of caution is that we shall be able to "share states" across variations. This means the directed acyclic graph that is used to recognize the input sequence should be maximally shared. In my rough plan, this should be accomplished after we implement the "context-free memoization" for the chain-rule machines. This memoization technique is really the central selling-point of the paper proposing this chain-rule machine parsing algorithm, at least according to its author. Currenly, there are some steps in order to implement the memoization technique. The part on the directed acyclic graph is rather simple: we simply have to break the graph into several chained pieces each of which represents a segment of the whole graph. The reason for this segmentation is that this way we can memoize the segments more easily. What is more challenging is the memoization of the derivation forests. This part is kind of different from what the author proposed in the paper. The author suggested to record every derivation sequence in the graph, and memoize the transformations on the sequences. My plan is instead to memoize the directed acyclic graph and the transformation on the derivation forests. So I need to figure out the memoization of two things: directed acyclic graphs and transformations of derivation forests. But before that, the first thing to do is to implement a "default tokenizer". The user of course should have the facility to control the working of this tokenizer, per chance by additional specialization in the grammar files. Further, this control should take the derivation forest state into account: the user can demand, for example, to reject a particular tokenization variation if, at a given token, the derivation forest does not meet a certain criterion. This process can of course be done after parsing, but earlier cut-off should be appealing someitmes.