summaryrefslogtreecommitdiff
path: root/README
diff options
context:
space:
mode:
authorJSDurand <mmemmew@gmail.com>2023-08-10 11:14:54 +0800
committerJSDurand <mmemmew@gmail.com>2023-08-10 11:14:54 +0800
commitb2ca9a285f09d90259aaa7c4d09bb7335211020b (patch)
treece18a3a79a6456b6d584accda888fe5f12add1f0 /README
parentb8a2d05a3c0d835556d5ddbd44e4a1e201302af5 (diff)
Modify README
Working my way towards a user-friendly introduction to the package.
Diffstat (limited to 'README')
-rw-r--r--README207
1 files changed, 207 insertions, 0 deletions
diff --git a/README b/README
index 0da3d7a..1f7a05d 100644
--- a/README
+++ b/README
@@ -206,9 +206,216 @@ format, of course.
Algorithm
---------
+[ This section is a little (too) involved. Those who are not
+ interested in the details of the algorithm can skip to the next
+ section "Tokenization" directly, by searching for a continuous
+ sequence of the minus character '-' to jump to the next section, for
+ example. ]
+
This package uses an algorithm that is not widely adapted, at least
not to my knowledge. So I think I will briefly explain how it works.
+Languages and derivatives
+=========================
+
+To understand the algorithm, we shall first talk about the derivatives
+of languages.
+
+For the sake of convenience, let Sigma denote a finite set of
+alphabets. The set of all Unicode code points can be an example. We
+call a sequence of elements in Sigma a "sentence", and we call a
+(possibly infinite) set of sentences a "language". For example, the
+set of all grammatically correct English sentences is a language, to
+which we may give the name "English".
+
+Now, if L is a language and s is a sentence, we define the "derivative
+of L with respect to s", denoted as dL/ds, as the set of all sentences
+"t" such that the concatenated sentence "s . t" belongs to L. We can
+re-formulate this definition in more mathematical terms:
+
+ -----------------------------------------------------
+ | dL/ds := { t is a sentence | s . t belongs to L } |
+ -----------------------------------------------------
+
+As a side note, the reason to call it the derivative is not only
+definitional similarity (to some extent), but also that this operation
+of taking the derivative of a language with respect to a sentence
+satisfies some similar properties as the "derivatives in Calculus".
+
+Parsing by taking derivatives
+=============================
+
+Why consider derivatives in a document about parser generators?
+Because we can parse sentences by taking derivatives.
+
+More precisely, let L be a language and let s be a sentence. If the
+derivative dL/ds of L by s (which is a set of sentences) contains the
+empty sentence, then by definition this means that s belongs to L.
+This observation is the basis of parsing by taking derivatives.
+
+[ Those who have a rigorous mind are probably frowning at the above
+ paragraph because we say we can parse by taking derivatives, whereas
+ we actually only determine if the sentence belongs to the language.
+ This gap comes from that the details of actually constructing the
+ parse forests by taking derivatives are too complicated for me to
+ explain clearly here. I might come up with a clear explanation in
+ the future, though. ]
+
+Regular languages
+=================
+
+With the above definition in mind, we now say that a language L is
+"regular" if and only if the set of derivatives of L is finite, i.e.
+
+ ------------------------------------------
+ | { dL/ds | s is a sentence } is finite. |
+ ------------------------------------------
+
+Note that this is a set of languages, namely, a set of sets of
+sentences. And we say it is finite if it contains finitely many sets.
+
+For those familiar with regular expressions, a language is regular if
+and only if there is a regular expression such that a sentence belongs
+to the language if and only if it matches that regular expression. I
+chose to define regular languages using derivatives as it is easier to
+reason about them in the following, in my opinion.
+
+Languages of grammars
+=====================
+
+If G is a grammar, we say that "G generates a language L", where L
+consists of all sentences generated by the grammar. Denote the
+language generated by G as L_G.
+
+Derivatives of the languages of grammars
+========================================
+
+If s is a sentence, We ask the question:
+
+ ---------------------------------------
+ | What is the derivative of L_G by s? |
+ ---------------------------------------
+
+To answer this, note that if s and t are sentences, and if L is any
+language, then dL/d(s . t) is equal to d( dL/ds )/dt, the derivative
+by t of the derivative by s of L. This can be seen by definition: a
+sentence r belongs to dL/d(s . t) if and only if (s . t . r) belongs
+to L, which is exactly the condition for (t . r) to belong to dL/ds.
+And this is the condition for r to belong to d( dL/ds )/dt.
+
+The above observation tells us that we can obtain dL_G/ds by taking
+derivatives of L_G by sentences consisting of one single alphabet,
+successively.
+
+Now consider dL_G/da.
+
+Suppose G consists of n rules
+
+ --------------------------
+ | X_1 => Y_1 ... Y_{m_1} |
+ | ... => ... |
+ | X_n => Z_1 ... Z_{m_n} |
+ --------------------------
+
+where X_i are variables, m_i is an integer depending on i, for i from
+1 to n, and Y_1, ..., Y_{m_1}, ..., Z_1, ..., Z_{m_n} are either
+variables or constants.
+
+It is then natural to try to determine dL_G/da by some kind of
+substitution. So for i from 1 to n, let X_i also denote the language
+that is generated if we treat the variable X_i as the starting
+variable in the grammar. For instance, if X_1 is the original
+starting variable for the grammar G, then L_G is also regarded as
+equal to X_1 under this view.
+
+Further, we observe that if both M_1 and M_2 are languages, and if M
+is their concatenation:
+
+ -------------------------------------------------
+ | M = { s . t | s belongs to M_1 and t to M_2 } |
+ -------------------------------------------------
+
+Then dM/da is equal to either
+
+ ---------------------
+ | ( dM_1/da ) . M_2 |
+ ---------------------
+
+or
+
+ -----------------------------------------------
+ | the union of ( dM_1/da ) . M_2 and dM_2/da, |
+ -----------------------------------------------
+
+depending on whether or not M_1 contains the empty sentence.
+
+Consequently, the derivative dX_1/da is equal to the union of
+
+ ----------------------------------
+ | dY_1/da . Y_2 ... Y_{m_1}, |
+ | ..., |
+ | dY_i/da . Y_{i+1} ... Y_{m_1}, |
+ ----------------------------------
+
+where i is the unique integer such that the language Y_i does not
+contain the empty sentence. If no such i exists, then i = m_1. And
+similarly for dX_2/da, ..., dX_n/da.
+
+Now we treat the languages dY_i/da as variables: if Y_i is a variable,
+this denotes the derivative by a of the language generated by Y_i; if
+Y_i is a constant, then this denotes either the language consisting of
+the empty sentence, or the empty set, depending upon whether Y_i is
+equal to a or not. Then the previous paragraph defines the
+substitution rules of these new variables dY_i/da. We call these
+rules as "the derived rules" of dY_i/da.
+
+It follows that we can obtain the derivative
+
+ ---------------------
+ | dL_G/da = dX_1/da |
+ ---------------------
+
+by "substituting" these dY_i/da according to the derived rules.
+
+Combining the above, we see that the derivatives of the language L_G
+by any sentence are generated by grammars. We denote the above
+grammar, which generates dL_G/da, as "the derived grammar" dG/da. Now
+our problem is transformed to determine if the successive derivatives
+of grammars dG/da_1, d( dG/da_1 )/da_2, et cetera, generates the empty
+sentence. This is relatively easy to do.
+
+Chain-rule of derivatives
+=========================
+
+All of the above was not new when the author proposed the algorithm.
+The core idea of the algorithm, in my opinion, is the "chain-rule" of
+the operation of taking derivatives of languages.
+
+In calculus, roughly speaking, the chain-rule of derivatives refers to
+the formula
+
+ --------------------------------------------------------
+ | df/dx = df/dy_1 . dy_1/dx + ... + df/dy_n . dy_n/dx, |
+ --------------------------------------------------------
+
+where f is a function in n variables y_1, ..., y_n, which are also
+functions in the variable x.
+
+If we replace the function f by a language L generated by a grammar G,
+the variables y_1, ..., y_n by the variables in G, x by an alphabet,
+the multiplication "." by the concatenation of languages, and the
+summation "+" by the union of languages, then we obtain a formula
+writing the derivative dG/da as a union of the concatenations
+
+ ----------------------
+ | dG/dy_i . dy_i/da, |
+ ----------------------
+
+for i from 1 to n. The only remaining question is:
+
+ ---------------------------
+ | What does dG/dy_i mean? |
+ ---------------------------
TODO: Provide more details.