#+TITLE: Plan of the package #+AUTHOR: Durand #+DATE: <2022-11-18 Ven 19:57> * Things to do [6/10] - [X] Implement builders for graphs - [X] Find sets of the first terminals for each non-terminal, in the grammar module. - [-] Collect various grammars for testing. [5/6] + [X] One simple grammar + [X] Notes + [X] Parentheses + [X] Pathelogical left-recursive + [X] Pathelogical right-recursive + [ ] Pathelogically ambiguous # More should be added - [X] NFA [4/4] + [X] Add regular expression into NFA. + [X] Convert a set of regular expressions into a nondeterministic finite automaton, where non-terminals denote the indices of regular expressions in the set to substitute. + [X] Convert a grammar to the language of atomic languages. [2/2] * [X] Convert a grammar into its grammar of left-linear closures of non-temrinals and their derivatives. * [X] Convert nondeterministic finite automata to their null closures. + [X] Test more grammars to ensure it works correctly. [5/5] * [X] Test the conversion to left-closure as a set of regular expressions. * [X] Test the conversion of a set of regular expressions into a nondeterministic finite automaton. * [X] Test the closure of null edges, where the nullity of an edge is defined by a function. In our specific case, an edge is null if the grammar symbol that edge represents, either a terminal or a non-terminal, can match an empty string. * [X] Test the removal of empty transitions from nondeterministic finite automata. * [X] Test the removal of dead states, where a state is dead if and only if no other states have an edge into that state. - [X] Refactor [7/7] + [X] Implement a data type for graphs with labels on vertices and edges, but do not need to index edges by labels, which can index vertices by labels instead. * [X] Test it. + [X] Implement a builder that borrows a graph mutably. * [X] Test it. + [X] Implement a data type for graphs in which every node knows its parents, and every node has a unique label but edges have no labels. * [X] Test it. + [X] When we pull in some regular language because of the left-linear expansion, we need to mark that branch as coming from left-linear expansions. This is necessary because we cannot follow a left-linearly expanded branch when we take the derivative of a language. We only expand left-linearly when we try to access the atomic languages [s]^{(t)}. + [X] An edge of the NFA should carry a label that can be more informative than solely a terminal or a non-terminal. + [X] Add a mechanism for a grammar to attach labels to edges of NFA which are not necessarily Copy-able, and which should be stored in a separate array, such that the labels on edges of NFA point to the elements of the array. + [X] We need to record the expansions of those "virtual nodes". That is, the virtual nodes represent atomic languages such as [s]^{(t)} where s is a non-terminal and t is a terminal. To be more specific, it represents the derivative of the left-linear closure of the non-terminal s with respect to the terminal t. The left-linear closure of s is the set of all strings (consisting of terminals and non-terminals alike) that are derivable from s alone, without consuming any inputs, by expanding according to the grammar rules. This means that we need to know which non-terminals were expanded in order to get to a state in [s]^{(t)}. * [X] Test * [X] Test more - [X] Atom [3/3] + [X] Define the Atom trait. + [X] Implement virtual nodes for each derivative of each atom: The lack of this step might be the root cause of the failure of the previous version of this project. + [X] Test atoms - [-] Implement languages. [1/4] + [X] First define a trait with the expected behaviours. + [ ] Then implement them as doubly labelled graphs. + [ ] Then implement finding derivatives by use of the chain rule. + [ ] Each edge in the chain-rule machine needs to be labelled also with a position in the forest. This perfectly solves the problem of those "plugs"! - [-] Implement forests [2/3] + [X] Design a format of forests. This should be the most crucial thing to do, in order to have a better understanding of the whole picture. I need to have a clear understanding of the details of the binarised shared packed parsed forests, that reflects the regular expressions in the grammar equations. + [X] Define a trait with the expected behaviours. + [-] Implement them as parents-knowing graphs. - [ ] Implement semiring. [0/5] + [ ] Define the trait. + [ ] Implement the boolean semiring. + [ ] Implement the natural number semiring. + [ ] Implement the free semiring. + [ ] Compute and record a semiring value when computing derivatives. - [X] Miscellaneous [1/1] + [X] Implement a function for NFA that checks if a given predicate is satisfied for each edge label. * Atomic Languages This describes the behaviours of atomic languages. The atomic language consists of the null closure of any non-terminal symbol in the grammar, and their deriavtives by terminals and non-terminal. * Derivative Languages This is the main driving device of the algorithm. Basically, the algorithm works by taking successive derivatives, according to the input symbol. At each step, we calculate the derivative language. In this process, we also compute some semiring value and store in a carrier. The end result of the algorithm is the final semiring value. If one wants simply to determine if the input string belongs to the grammar, one chooses the semiring to be the field with two elements, the boolean field. If one wants to find how many ways are there to derive a given input string, then one uses the semiring of natural numbers instead. If one wants, moreover, to find all the possible ways to derive a particular input string, then one can use the free semiring on the set of terminals and non-terminals of the grammar. Here the free semiring is the left-adjoint functor to the forgetful functor from the category of semirings to the category of sets. To be more specific, the free semiring on a set is given by sets of sequences of elements in the set. The addition of the semiring is the set union operation, and the multiplication is taking the respective concatenations. ** Semirings So we need a module to define the behaviours of semirings, and provide some common semirings implementations. Then in the main driving force we can freely substitute different semirings, according to the particular needs. ** Languages Then the main part is to define the behaviour of languages. This should be easy enough, since we already have the mechanism of graphs, nondeterministic automata, and semirings. All we need to do is to combine them together. * Lexers I got this idea from [[https://lists.gnu.org/archive/html/emacs-devel/2022-12/msg01127.html][a discussion on emacs-devel]]. The idea can be formulated as #+begin_quote Potentially it allows invocation of a parser with different variants of lexers - one mode with block tokens for the exploration of project's structure, and another mode for indentation and error checking purposes. #+end_quote I think this is the "right" use of lexers. That is to say, the parser by default should operate on an abstract type of tokens, which are but unsigned integers, and rely on the user application to provide a lexer for turning the actual input, like a string, into an iterator of tokens. The lexer can choose to do any sort of preprocessing that it sees fit, or it can do nothing and just turn characters to unsigned integers. In the latter case, this parser generator works on the character level. But, when one neeeds, one can instruct the parser generator to work on the level of preprocessed tokens easily. This has the advantage of allowing a certain degree of flexibility in the type of tokens accepted, while keeping the implementation simple at the same time: the internal core only has to deal with unsigned integers, after all. * Library for drawing graphs In the past, I thought that drawing graphs is only a development tool for my package, so I can rely on such external tools as GraphViz, as that would not be needed for my end-users. But recently I realized that this is not a need only for the development of the package, but also for the users as well. To be more specific, the target users are those who want to "play" with grammars. So if one cannot view the resulting parse forest diagrams easily and interactively, it would be hard to "play" with grammars. Moreover, the internal workings of the parsing machine might also be interesting for users to inspect, whether to debug the program or to know more under the hood. As such it is required to view the diagrams of the internal directed acyclic graphs. For all these reasons, I know regard the ability to draw graphs and to interact with those graphs is essential to the user experience of this package. As a consequence, I would like to develop a small default library for this purpose. I follow the convention adapted by this package here as well, which is to provide a default implementation for the functionality, while still leaving the room for users to substitute with other external packages, if so desired, for whatever reason. For example, an external package may provide an unprecedented performance for drawing graphs in Emacs. If such a package appears, I can easily avail of that external package to draw graphs. * Testing ground I am in a strong need to test things out. The most important one is to visualize each step of the derivation, in a human-friendly manner. I need this to examine whether my atomic languages are wrongly implemented, or my atomic languages are wrongly derived, or my understanding of the main algorithm is plain wrong. This is the main reason I started this rewrite of the package. * To figure out I need to figure out how to construct the parse forest from a left-most null-closed derivation graph. Reading through the original paper again, I found that the author augmented the atomic languages with annotations of partial derivation steps on each edge of the nondeterministic finite automaton. In brief, this is what I tried to achieve with some obscure constructs such as expanding reduction informations carried by the nondeterministic finite automata, in one of the previous versions. To sum up, I really need to carry those informations in the first place, otherwise the parse forests cannot be reliably construced later on. But I really do not want to adopt the approach of the original author. I still prefer the idea of a recorder. That is to say, instead of constructing the parse forest after the recognition, we shall construct the parse forest simultaneously while we are recognizing.