This is the first part in a five-part series that will cover implementing an earley parser for linguistic utterances.
- Part 1 - Why?, and some theory
- Part 2 - The Earley recognizer
- Part 3 - Felling the Earley parse forest
- Part 4 - Feature-structures and unification
- Part 5 - Wrapping up
There are already many great explanations of the Earley parse algorithm on the internet.1 There are also some great books on feature-structure based linguistic grammars, which I highly recommend.2 However, as far as I know nobody has put the two together into a beginner-friendly tutorial that explains both how to use these tools together, and perhaps more importantly why someone would want to. This series will be trying to explain this through an iterative approach, starting with the simplest possible solution, and expanding that solution only when necessary. The goal is that, when we finish, we’ll have the simplest possible solution to our goal, with a codebase that has every line justified.
The exact thing we’re going to be building is a symbolic, linguistic grammar model. Those adjectives describe the somewhat awkward space the thing we’re going to be building is in.
The other spot we’re stuck between are the now-ubiquitous linguistic models that are trained on data — usually based on neural networks. These models, such as BERT, take mountains of data — gigabytes of text — and train a model to do some basic linguistic task, such as predicting missing words. This gives the model an understanding of how the language works. Other engineers can then take this pre-trained language model, “fine tune” it on a smaller (but still substantial) amount of data that more closely matches their specific task, and stick another network on the back that spits out whatever they need.
However, symbolic linguistic models are still alive and kicking in this space, both in the emerging field of neural-symbolic computing, and in very mature, wide-coverage linguistic models such as the English Resource Grammar3 and the grammar matrix4. These symbolic grammars are still used because they have some distinct advantages of trained models: they are predictable5, they are analyzable, and they don’t require large amounts of training data. Our grammar will be built on these principles, and while it will not have as wide of coverage as the ERG (which has >99% coverage of the New York Times corpus!), it will reach levels of linguistic understanding that a trained model would need mountains of data to reach. On the other hand, our grammar will show the downsides of this method as well: while a neural network can make a “best guess” at a grammar construction its never seen before, that’s near-impossible for a symbolic model. If we’ve only implemented simple sentences with one subject and one object (transitive), our model will choke the first time it sees “Mary gave Sue the book”, which has one subject and two objects, “Sue” and “the book” (ditransitive). If we then only implement this ditransitive pattern, it will choke on “Mary gave the book to Sue”, where “Sue” is marked with the preposition “to” (dative alternation). Symbolic grammar engineering is a cycle of implementing patterns and finding more patterns your model doesn’t yet handle. Eventually, you have to stop. For our purposes, a simple, low-coverage model will be enough to illustrate the important concepts.
For my specific use case, I’m working on a game, based around the player learning an alien language that I’ve created. (It’s called Themengi, and you can check it out here.) I want to parse player input, both imperative commands and dialogue, and respond appropriately. This language does not exist; there is no training corpus, so a trained model is not possible. On the other hand, this is a language, not an obfuscated version of SQL, so I don’t want to simply apply programming-language techniques. I want to be able to handle ambiguity, case, and other linguistic features. In my search, I couldn’t find a framework that did quite what I needed, so I decided to write my own. This post series will follow my progress working on this framework.
Along the way, I’ll be illustrating the concepts with the running example of creating a tiny English grammar. I’m going to start from first principles, assuming only a little knowledge of grade-school English grammar terms (noun, verb, etc.). We’ll start with a simple system of rules, write a recognizer, extend that into a parser, and annotate that parser with linguistic information. Then we’ll have some fun at the end, maybe we’ll write a tiny grammar for Japanese and do some symbolic translation or write a mini rules-based digital assistant demo. The possibilities are endless, we could write a grammar for Klingon and make the first automated Japanese -> Klingon translation! Who knows!
So, with that out of the way, let’s get going!
Starting from zero
So, we’re modeling English. Big language, lot of words. Let’s start smaller and look at a few sentences.
1). Mary fell. 2). Mary kissed Sue.
If you’re familiar with English grammar learned in school, you know the typical division of an English sentence is into a subject, a verb, and an optional object. We can see that here, with example 1 lacking an object (the verb
fell is intransitive), and example 2 having one (the verb
kissed is transitive).
So how can we model a grammar for these two sentences? Well, specifically, our goal in modeling a grammar is to produce a model that:
- A, accepts all the sentences we deem grammatical
- B, rejects all the sentences we deem ungrammatical
- (C, later, we will produce a set of analyses for grammatical sentences. More in Part 3.)
For this set of sentences, that is actually very easy — just list every case!
Jokes aside, this actually highlights an important point: if your grammar describes a finite set of strings, then you don’t really need a parser!
However, human languages are not so restrictive as to have a finite number of sentences. Let’s add two more examples:
3). Takeshi said that Mary kissed Sue. 4). Robert thought that Takeshi said that Mary kissed Sue.
This illustrates one of many ways that English (and, we believe, maybe sans Pirahã, every natural language) enables infinite recursion. You can just keep stacking nested “X said that…” onto each other until your tongue falls out. There are many other sources of infinite recursion as well, such as the genitive construction (I am your father’s brother’s nephew’s cousin’s…), or even using simple conjunctions (I went to the store and bought milk, eggs, more milk, more eggs…). It’s trivially easy to find grammatical structures that allow for this infinite recursion. So, if we hope to model a reasonable subset of English, the method we choose has to be able to deal with this unbounded recursion. No lists of strings, and definitely no regexes.
So what method can we use? We need something that can describe a recursive structure. The usual tool for this in grammars, of the linguistic as well as programming language variety, is using rewrite rules.
All a rewrite rule does is specify how to replace one set of things with something else. If you’ve done basic arithmetic, you’ve used rewrite rules. Consider these two rules. They state symbolically what happens when you add or multiply something with zero.
A + 0 = A A * 0 = 0
Here’s an example we can try them out on:
x = 1 * (1 * (0 + 0))
This is just a complicated way to write 0, but how would you solve this via substitution? Well, starting from the right:
- A + 0 = A, so x = 1 * (1 * (0))
- A * 0 = 0, so x = 1 * (0)
- A * 0 = 0, again, so x = 0
So, rewrite rules handled that recursive structure, no problem. I wonder if we can apply that to our linguistic situation…
Rules for days
Before we start, a note on… notation. I’m going to switch how I annotate rules from the arithmetic notation (
A + 0 = A) to the rewrite-rule notation used in Backus-Naur Form (BNF) and more generally in grammar engineering, which looks like this (yes, the arrow is pointing the other way).
A -> A + 0
We also generally don’t use operators like (+) either, as they may occur as symbols in the strings we want to model, so to represent a rule rewriting the string ‘B B’ into ‘A’, we would use:
A -> B B
The reasons for the arrow and it’s counter-intuitive direction should become more clear later (it relates to trees, their labels, and their generation). For now, just think
=, and the sides are flipped from how they usually are.
So, what’s our corpus again?
1). Mary fell. 2). Mary kissed Sue. 3). Takeshi said Mary kissed Sue. 4). Robert thought that Takeshi said that Mary kissed Sue.
Well, what’s a starting assumption we can make? One thing you might notice is that any noun can stand in for another:
Mary fell. Sue fell. Takeshi fell. Robert fell.
In fact, more generally, parts of speech seem to act similarly. There is always a noun before the verb, for example. So let’s rewrite all the words into their parts of speech.
N -> Mary N -> Sue N -> Takeshi N -> Robert V -> fell V -> kissed V -> said V -> thought Comp -> that // complementizer
A complementizer is the part of speech for a word, like “that” or “which”, which is used to turn a sentence into the object of a verb. Like in “Robert thought that…”.
If we rewrite our corpus with these rules, we get:
1). N V. 2). N V N. 3). N V Comp N V N. 4). N V Comp N V Comp N V N.
You might see a pattern there already! Let’s introduce a rewrite to a new symbol, S (for sentence):
S -> N V S -> N V N S -> N V Comp N V N S -> N V Comp N V Comp N V N
Hmm, right, we want a rule that describes the recursion. Well, if we look at our last 3 rules, we can make these groups, where each successive example adds
N V Comp to the start of the last one:
S -> N V N S -> N V Comp (N V N) S -> N V Comp (N V Comp (N V N))
Since what is in the brackets is a full grammatical sentence according to our rules, we can just rewrite it to another S!
S -> N V N S -> N V Comp S S -> N V Comp S
The last two are now the same, so we only need one rule to represent a verb with a sentence as an object! Our complete rules now look like this:
S -> N V S -> N V N S -> N V Comp S N -> Mary N -> Sue N -> Takeshi N -> Robert V -> fell V -> kissed V -> said V -> thought Comp -> that
Let’s test this grammar out by generating some sentences. To do this, let’s start with our S symbol, and run it backwards through our rewrite rules (this is one reason the arrows face right, by the way). So,
Let’s start with
We have multiple rules that can rewrite
We keep rewriting the results of that rule. The first element is an
The next element is a
Next we have a
Next, we have another
We have an
We have a
|We’ve now rewritten all the symbols into italicized terminals, and we’re finished generating.|
Hey! That’s not in our corpus – but it probably should be. Remember, we showed that our corpus describes a grammar that can generate an infinite amount of sentences. Since it is highly impractical to collect an infinite number of sentences for our corpus, it’s a good thing that our model is generating sentences outside the corpus, as long as they are consistent with the corpus examples and with our own linguistic judgements. And
Takeshi said that Mary fell seems like a good sentence to me.
Let’s try again.
Let’s start with
We again have multiple rules that can rewrite
We have an
We have a
We finish with
Note on notation: the
*here prefixing the sentence indicates that the sentence is ungrammatical.
That’s not quite right. Our grammar is clearly also generating sentences that should not be in our corpus – sentences that are not consistent with the other examples (or, more intuitively, English-as-she-is-spoke). Another way to say it is that our grammar is over-generating.
The problem here is that the rule
S -> N V describes an intransitive verb (a verb that doesn’t need an object), and
said is a clausal verb, meaning it requires a sentence (clause) as an object. Likewise, if we had tried to use
kissed, a transitive verb, it would have produced the ungrammatical sentence
* Robert kissed.
Let’s subdivide the verbs into some new classes.
// Split S into three separate rules: S -> N IV // an intransitive rule, taking an IV S -> N TV N // a transitive rule, taking a TV S -> N CV Comp S // a clausal rule, taking a CV N -> Mary N -> Sue N -> Takeshi N -> Robert IV -> fell // intransitive verb TV -> kissed // transitive verb CV -> said // clausal verb CV -> thought // clausal verb Comp -> that
Now, if we try to generate from the rule
S -> N CV Comp S, no pesky intransitive verbs can slip into that
CV slot. Likewise, if we pick
S -> N IV, only the intransitive verb
kissed can fill that slot, preventing any
* Robert kissed funny-business.
This grammar is pretty good! It can recognize all the sentences in our corpus, it can reject some ungrammatical ones like
* Robert kissed, and it even handles some basic recursion! It’s ready to start implementing in code. For that, however, we will have to wait for Part 2, where we’ll get to see the magic of the Earley algorithm, and reject those ungrammatical sentences with extreme computational prejudice.
The standout book here is Syntactic Theory: A Formal Introduction, which covers the theory behind a symbolic grammar for a subset of English in exhaustive detail, using the HDPSG formalism. The linguistic aspects of this series will draw heavily from this book. I also like Implementing Typed Feature Structure Grammars, which goes into more detail on the LKB system that is used by the English Resource Grammar, and implements the formalism described in Syntactic Theory.↩︎
One area where this gets especially problematic is when models internalize bias in their training data, and then repeat that bias. For example, a language model trained to predict missing words might learn that it’s more often correct to predict male pronouns in engineering-related texts, and therefore learn its embedding should associate maleness more closely with engineering than femaleness. If the model is then fine-tuned to, for example, suggest careers to high school seniors based on a bio written about them by a guidance counselor, it may have a bias towards referring women to non-engineering jobs. While there is some research in the space of de-biasing neural networks, the problem is far from solved.↩︎
- Previous entry: Getting a full PDF from a DRM-encumbered online textbook