diff options
Diffstat (limited to 'README')
-rw-r--r-- | README | 207 |
1 files changed, 207 insertions, 0 deletions
@@ -206,9 +206,216 @@ format, of course. Algorithm --------- +[ This section is a little (too) involved. Those who are not + interested in the details of the algorithm can skip to the next + section "Tokenization" directly, by searching for a continuous + sequence of the minus character '-' to jump to the next section, for + example. ] + This package uses an algorithm that is not widely adapted, at least not to my knowledge. So I think I will briefly explain how it works. +Languages and derivatives +========================= + +To understand the algorithm, we shall first talk about the derivatives +of languages. + +For the sake of convenience, let Sigma denote a finite set of +alphabets. The set of all Unicode code points can be an example. We +call a sequence of elements in Sigma a "sentence", and we call a +(possibly infinite) set of sentences a "language". For example, the +set of all grammatically correct English sentences is a language, to +which we may give the name "English". + +Now, if L is a language and s is a sentence, we define the "derivative +of L with respect to s", denoted as dL/ds, as the set of all sentences +"t" such that the concatenated sentence "s . t" belongs to L. We can +re-formulate this definition in more mathematical terms: + + ----------------------------------------------------- + | dL/ds := { t is a sentence | s . t belongs to L } | + ----------------------------------------------------- + +As a side note, the reason to call it the derivative is not only +definitional similarity (to some extent), but also that this operation +of taking the derivative of a language with respect to a sentence +satisfies some similar properties as the "derivatives in Calculus". + +Parsing by taking derivatives +============================= + +Why consider derivatives in a document about parser generators? +Because we can parse sentences by taking derivatives. + +More precisely, let L be a language and let s be a sentence. If the +derivative dL/ds of L by s (which is a set of sentences) contains the +empty sentence, then by definition this means that s belongs to L. +This observation is the basis of parsing by taking derivatives. + +[ Those who have a rigorous mind are probably frowning at the above + paragraph because we say we can parse by taking derivatives, whereas + we actually only determine if the sentence belongs to the language. + This gap comes from that the details of actually constructing the + parse forests by taking derivatives are too complicated for me to + explain clearly here. I might come up with a clear explanation in + the future, though. ] + +Regular languages +================= + +With the above definition in mind, we now say that a language L is +"regular" if and only if the set of derivatives of L is finite, i.e. + + ------------------------------------------ + | { dL/ds | s is a sentence } is finite. | + ------------------------------------------ + +Note that this is a set of languages, namely, a set of sets of +sentences. And we say it is finite if it contains finitely many sets. + +For those familiar with regular expressions, a language is regular if +and only if there is a regular expression such that a sentence belongs +to the language if and only if it matches that regular expression. I +chose to define regular languages using derivatives as it is easier to +reason about them in the following, in my opinion. + +Languages of grammars +===================== + +If G is a grammar, we say that "G generates a language L", where L +consists of all sentences generated by the grammar. Denote the +language generated by G as L_G. + +Derivatives of the languages of grammars +======================================== + +If s is a sentence, We ask the question: + + --------------------------------------- + | What is the derivative of L_G by s? | + --------------------------------------- + +To answer this, note that if s and t are sentences, and if L is any +language, then dL/d(s . t) is equal to d( dL/ds )/dt, the derivative +by t of the derivative by s of L. This can be seen by definition: a +sentence r belongs to dL/d(s . t) if and only if (s . t . r) belongs +to L, which is exactly the condition for (t . r) to belong to dL/ds. +And this is the condition for r to belong to d( dL/ds )/dt. + +The above observation tells us that we can obtain dL_G/ds by taking +derivatives of L_G by sentences consisting of one single alphabet, +successively. + +Now consider dL_G/da. + +Suppose G consists of n rules + + -------------------------- + | X_1 => Y_1 ... Y_{m_1} | + | ... => ... | + | X_n => Z_1 ... Z_{m_n} | + -------------------------- + +where X_i are variables, m_i is an integer depending on i, for i from +1 to n, and Y_1, ..., Y_{m_1}, ..., Z_1, ..., Z_{m_n} are either +variables or constants. + +It is then natural to try to determine dL_G/da by some kind of +substitution. So for i from 1 to n, let X_i also denote the language +that is generated if we treat the variable X_i as the starting +variable in the grammar. For instance, if X_1 is the original +starting variable for the grammar G, then L_G is also regarded as +equal to X_1 under this view. + +Further, we observe that if both M_1 and M_2 are languages, and if M +is their concatenation: + + ------------------------------------------------- + | M = { s . t | s belongs to M_1 and t to M_2 } | + ------------------------------------------------- + +Then dM/da is equal to either + + --------------------- + | ( dM_1/da ) . M_2 | + --------------------- + +or + + ----------------------------------------------- + | the union of ( dM_1/da ) . M_2 and dM_2/da, | + ----------------------------------------------- + +depending on whether or not M_1 contains the empty sentence. + +Consequently, the derivative dX_1/da is equal to the union of + + ---------------------------------- + | dY_1/da . Y_2 ... Y_{m_1}, | + | ..., | + | dY_i/da . Y_{i+1} ... Y_{m_1}, | + ---------------------------------- + +where i is the unique integer such that the language Y_i does not +contain the empty sentence. If no such i exists, then i = m_1. And +similarly for dX_2/da, ..., dX_n/da. + +Now we treat the languages dY_i/da as variables: if Y_i is a variable, +this denotes the derivative by a of the language generated by Y_i; if +Y_i is a constant, then this denotes either the language consisting of +the empty sentence, or the empty set, depending upon whether Y_i is +equal to a or not. Then the previous paragraph defines the +substitution rules of these new variables dY_i/da. We call these +rules as "the derived rules" of dY_i/da. + +It follows that we can obtain the derivative + + --------------------- + | dL_G/da = dX_1/da | + --------------------- + +by "substituting" these dY_i/da according to the derived rules. + +Combining the above, we see that the derivatives of the language L_G +by any sentence are generated by grammars. We denote the above +grammar, which generates dL_G/da, as "the derived grammar" dG/da. Now +our problem is transformed to determine if the successive derivatives +of grammars dG/da_1, d( dG/da_1 )/da_2, et cetera, generates the empty +sentence. This is relatively easy to do. + +Chain-rule of derivatives +========================= + +All of the above was not new when the author proposed the algorithm. +The core idea of the algorithm, in my opinion, is the "chain-rule" of +the operation of taking derivatives of languages. + +In calculus, roughly speaking, the chain-rule of derivatives refers to +the formula + + -------------------------------------------------------- + | df/dx = df/dy_1 . dy_1/dx + ... + df/dy_n . dy_n/dx, | + -------------------------------------------------------- + +where f is a function in n variables y_1, ..., y_n, which are also +functions in the variable x. + +If we replace the function f by a language L generated by a grammar G, +the variables y_1, ..., y_n by the variables in G, x by an alphabet, +the multiplication "." by the concatenation of languages, and the +summation "+" by the union of languages, then we obtain a formula +writing the derivative dG/da as a union of the concatenations + + ---------------------- + | dG/dy_i . dy_i/da, | + ---------------------- + +for i from 1 to n. The only remaining question is: + + --------------------------- + | What does dG/dy_i mean? | + --------------------------- TODO: Provide more details. |