summaryrefslogtreecommitdiff
path: root/DIARY
diff options
context:
space:
mode:
authorJSDurand <mmemmew@gmail.com>2023-07-30 11:37:19 +0800
committerJSDurand <mmemmew@gmail.com>2023-07-30 11:37:19 +0800
commita774b37fc9cfaa9c40c33201dcad9f2d71e32e52 (patch)
treeae5affe2a94261bbcf9e1dfa14e70362dd3c9a99 /DIARY
parent23df5ce7d69f715aca7e7a53d03a048830e99b69 (diff)
DIARY: Add more thoughts
Diffstat (limited to 'DIARY')
-rw-r--r--DIARY132
1 files changed, 132 insertions, 0 deletions
diff --git a/DIARY b/DIARY
index afef8e2..7c882d0 100644
--- a/DIARY
+++ b/DIARY
@@ -60,3 +60,135 @@ contained in one of the ranges.
After this is finished, we can bump the version to 0.1.3 and simplify
the build process. Hope it all goes well.
+
+
+
+======================================================================
+ 2023-07-23
+======================================================================
+
+I encounter some problem in implementing the tokenization handling.
+
+To better separate concerns, this package only operates at the level
+of tokens, which are an abstraction over inputs. For example, if we
+encounter an identifier in the source, we can replace that identifier
+by a special token. This way, the length of the input is greatly
+reduced, and the parsing process can be sped up as the tokenization
+process, which usually only involves regular expressions, is much
+faster than the rest of the parsing.
+
+The problem is that the same input slice can be classified into
+multiple different tokens. As an example, if a grammar has a token
+for matching arbitrary non-whitespace letters, and if it also has a
+token for matching a specific character, like '*', then the two tokens
+are both valid at the character '*'. This can be regarded as a token
+at a deeper level appearing as multiple different tokens at a more
+shallow level.
+
+Hence, a natural approach would be to let the tokens of the grammar
+decide how to match against tokens of the input: the grammar and the
+input use different token types. This is fine as long as the user
+does not get into play: the package can implement a way to match
+grammar tokens against inputs. However, if the user wants to
+interrupt with the tokensization process, then things become tricky.
+
+The well-known tree-sitter parser generator (see
+<https://tree-sitter.github.io/tree-sitter/> for more information)
+differentiates the tokens by a set of rules, by which a certain input
+can only be classified as a unique token. Basically it solves the
+possibility of multiple tokens by the precedence rules. This is
+consistent with its goal of providing parsers for grammars of
+primarily programming languages, for which the number of derivations
+of a given input is at most one: it does not accept multiple
+derivations of the same input.
+
+But this is not acceptable for this package: it is explicitly desired
+to have multiple derivations recognized.
+
+Moreover, if we look back at the original motivation behind the
+tokensization process, we see that the tokensization is only needed
+for two reasons: speed and flexibility.
+
+If we want the tokensization because of speed, it means that the
+tokensization is only an implementation detail, and should not impose
+restrictions on the resulting derivations. From this perspective,
+every tokensization probability should be preserved.
+
+On the other hand, if we tokenize input on account of the flexibility
+thus brought about, then there is no reason to discard some
+tokensization possibility just because we do not find a way to do it:
+that hurts the flexibility a lot.
+
+Consequently, from both points of view, we shall treat multiple
+tokensization possibilities as they are different grammar derivations.
+The only difference from a "regular" grammar derivation is that they
+can be altered by the user, from which the flexibility follows. And
+this is the point: tokens play a role between internal to the parsers
+and external to them. Or put more simply: they should be a changeable
+gear in the parsers.
+
+But things are not so easy: if we want to let users have a handle on
+the tokenization, then we require extra input from the users. This is
+not only an added burden to the users, but it also is an additional
+performance overhead that I am not sure if is too much.
+
+The situation is even worse: different tokenization choices can lead
+to different segments of the input being tokenized. For example, a
+choice may break the input into segments (Token(1), Span(0, 1)),
+(Token(2), Span(1, 2)), (Token(3), Span(2, 3)), whereas another choice
+may break the input into these segments instead: (Token(1), Span(0,
+1)), (Token(4), Span(1, 3)).
+
+This is very hard for the chain-rule machine to track, as it operates
+at the level of tokens. In fact, after more thoughts, the chain-rule
+machine is not the right place to handle these complexities. After
+all, this process has nothing to do with the chain-rule machine. It
+should be the responsibility of the application parser generator that
+feeds the machine with appropriate input tokens.
+
+Consequently, the tokenization process should best be handled by
+"splitting the input sequence" into multiple sequences of input
+tokens, if multiple variatoins of tokens are desired. Apparently,
+each variation should be coupled with its own derivation forest, if
+that variation is valid according to the grammar.
+
+I think this is a rather plausible approach: multiple possibilities of
+tokens are handled, and if only one variation is needed, no additional
+overhead is imposed.
+
+One point of caution is that we shall be able to "share states" across
+variations. This means the directed acyclic graph that is used to
+recognize the input sequence should be maximally shared.
+
+In my rough plan, this should be accomplished after we implement the
+"context-free memoization" for the chain-rule machines. This
+memoization technique is really the central selling-point of the paper
+proposing this chain-rule machine parsing algorithm, at least
+according to its author.
+
+Currenly, there are some steps in order to implement the memoization
+technique. The part on the directed acyclic graph is rather simple:
+we simply have to break the graph into several chained pieces each of
+which represents a segment of the whole graph. The reason for this
+segmentation is that this way we can memoize the segments more easily.
+
+What is more challenging is the memoization of the derivation forests.
+This part is kind of different from what the author proposed in the
+paper. The author suggested to record every derivation sequence in
+the graph, and memoize the transformations on the sequences. My plan
+is instead to memoize the directed acyclic graph and the
+transformation on the derivation forests.
+
+So I need to figure out the memoization of two things: directed
+acyclic graphs and transformations of derivation forests.
+
+But before that, the first thing to do is to implement a "default
+tokenizer". The user of course should have the facility to control
+the working of this tokenizer, per chance by additional specialization
+in the grammar files.
+
+Further, this control should take the derivation forest state into
+account: the user can demand, for example, to reject a particular
+tokenization variation if, at a given token, the derivation forest
+does not meet a certain criterion. This process can of course be done
+after parsing, but earlier cut-off should be appealing someitmes.