A symbolic linguistics framework in Rust — Part 1: Why?, rewrite rules, and recursion
This is the first part in a five-part series that will cover implementing an earley parser for linguistic utterances.
- Part 1 - Why?, and some theory
- Part 2 - The Earley recognizer
- Part 3 - Felling the Earley parse forest
- Part 4 - Feature-structures and unification
- Part 5 - Wrapping up
Why write this series?
There are already many great explanations of the Earley parse algorithm, which is the parsing algorithm that we'll be using.1 There are also some great books on feature-structure based linguistic grammars, which I highly recommend.2 However, as far as I know nobody has put the two together into a beginner-friendly, implementation-oriented tutorial that explains what these tools are, how to use them together, and perhaps most importantly why someone would want to. This series will be trying to explain this through an iterative approach, starting with the simplest possible solution, and expanding that solution only when necessary. The goal is that, when we finish, we'll have the simplest possible solution to our goal, with a codebase that has every line justified.
What are we making, exactly?
The exact thing we're going to be building is a symbolic, linguistic grammar model. Those adjectives describe the somewhat awkward space the thing we're going to be building is in. We'll be using symbolic, rule-based techniques to parse language, but we need to parse a wider variety of structures than most rule-based parsers. And we'll be parsing language, but in a rule-based way, not in a learned way like a neural network. This combination has some serious advantages, and is worth examining in detail.
One spot we're stuck between is symbolic, computer language grammar models. These are the parsers that parse Javascript, Python, JSON, C, the Markdown I'm writing this file in, and the HTML you're reading it in. In this space, symbolic models rule the day. However, these models are generally not suitable for linguistic use. For one, they generally use parsing techniques, such as LALR, that have restrictions that make them unsuitable for parsing natural languages. For example, LALR cannot handle an ambiguous grammar. These grammars also lack an efficient representation of linguistic information, such as case.
The other spot we're stuck between are the now-ubiquitous linguistic models that are trained on data — usually based on neural networks. These models, such as BERT, take mountains of data — gigabytes of text — and train a model to do some basic linguistic task, such as predicting missing words. This gives the model an understanding of how the language works. Other engineers can then take this pre-trained language model, "fine tune" it on a smaller (but still substantial) amount of data that more closely matches their specific task, and stick another network on the back that spits out whatever they need.
However, symbolic linguistic models are still alive and kicking in this space, both in the emerging field of neural-symbolic computing, and in very mature, wide-coverage linguistic models such as the English Resource Grammar3 and the grammar matrix4. These symbolic grammars are still used because they have some distinct advantages of trained models: they are predictable5, they are analyzable, and they don't require large amounts of training data. Our grammar will be built on these principles, and while it will not have as wide of coverage as the ERG (which has >99% coverage of the New York Times corpus!), it will reach levels of linguistic understanding that a trained model would need mountains of data to reach. On the other hand, our grammar will show the downsides of this method as well: while a neural network can make a "best guess" at a grammar construction its never seen before, that's near-impossible for a symbolic model. If we've only implemented simple sentences with one subject and one object (transitive), our model will choke the first time it sees "Mary gave Sue the book", which has one subject and two objects, "Sue" and "the book" (ditransitive). If we then only implement this ditransitive pattern, it will choke on "Mary gave the book to Sue", where "Sue" is marked with the preposition "to" (dative alternation). Symbolic grammar engineering is a cycle of implementing patterns and finding more patterns your model doesn't yet handle. Eventually, you have to stop. For our purposes, a simple, low-coverage model will be enough to illustrate the important concepts.
Where are we going?
For my specific use case, I'm working on a game, based around the player learning an alien language that I've created. (It's called Themengi, and you can check it out here.) I want to parse player input, both imperative commands and dialogue, and respond appropriately. This language does not exist; there is no training corpus, so a trained model is not possible. On the other hand, this is a language, not an obfuscated version of SQL, so I don't want to simply apply programming-language techniques. I want to be able to handle ambiguity, case, and other linguistic features. In my search, I couldn't find a framework that did quite what I needed, so I decided to write my own. This post series will follow my progress working on this framework.
Along the way, I'll be illustrating the concepts with the running example of creating a tiny English grammar. I'm going to start from first principles, assuming only a little knowledge of grade-school English grammar terms (noun, verb, etc.). We'll start with a simple system of rules, write a recognizer, extend that into a parser, and annotate that parser with linguistic information. Then we'll have some fun at the end, maybe we'll write a tiny grammar for Japanese and do some symbolic translation or write a mini rules-based digital assistant demo. The possibilities are endless, we could write a grammar for Klingon and make the first automated Japanese -> Klingon translation! Who knows!
So, with that out of the way, let's get going!
Starting from zero
So, we're modeling English. Big language, lot of words. Let's start smaller and look at a few sentences.
1). Mary fell.
2). Mary kissed Sue.
If you're familiar with English grammar learned in school, you know the typical division of an English sentence is into a subject, a verb, and an optional object. We can see that here, with example 1 lacking an object (the verb fell
is intransitive), and example 2 having one (the verb kissed
is transitive).
So how can we model a grammar for these two sentences? Well, specifically, our goal in modeling a grammar is to produce a model that:
- A, accepts all the sentences we deem grammatical
- B, rejects all the sentences we deem ungrammatical
- (C, later, we will produce a set of analyses for grammatical sentences. More in Part 3.)
For this set of sentences, that is actually very easy — just list every case!
fn is_grammatical(sentence: &str) -> bool {
sentence == "Mary fell" || sentence == "Mary kissed Sue"
}
Jokes aside, this actually highlights an important point: if your grammar describes a finite set of strings, then you don't really need a parser!
However, human languages are not so restrictive as to have a finite number of sentences. Let's add two more examples:
3). Takeshi said that Mary kissed Sue.
4). Robert thought that Takeshi said that Mary kissed Sue.
This illustrates one of many ways that English (and, we believe, maybe sans Pirahã, every natural language) enables infinite recursion. You can just keep stacking nested "X said that..." onto each other until your tongue falls out. There are many other sources of infinite recursion as well, such as the genitive construction (I am your father's brother's nephew's cousin's...), or even using simple conjunctions (I went to the store and bought milk, eggs, more milk, more eggs...). It's trivially easy to find grammatical structures that allow for this infinite recursion. So, if we hope to model a reasonable subset of English, the method we choose has to be able to deal with this unbounded recursion. No lists of strings, and definitely no regexes.
So what method can we use? We need something that can describe a recursive structure. The usual tool for this in grammars, of the linguistic as well as programming language variety, is using rewrite rules.
All a rewrite rule does is specify how to replace one set of things with something else. If you've done basic arithmetic, you've used rewrite rules. Consider these two rules. They state symbolically what happens when you add or multiply something with zero.
A + 0 = A
A * 0 = 0
Here's an example we can try them out on:
x = 1 * (1 * (0 + 0))
This is just a complicated way to write 0, but how would you solve this via substitution? Well, starting from the right:
- A + 0 = A, so x = 1 * (1 * (0))
- A * 0 = 0, so x = 1 * (0)
- A * 0 = 0, again, so x = 0
So, rewrite rules handled that recursive structure, no problem. I wonder if we can apply that to our linguistic situation...
Rules for days
Before we start, a note on... notation. I'm going to switch how I annotate rules from the arithmetic notation (
A + 0 = A
) to the rewrite-rule notation used in Backus-Naur Form (BNF) and more generally in grammar engineering, which looks like this (yes, the arrow is pointing the other way).A -> A + 0
We also generally don't use operators like (+) either, as they may occur as symbols in the strings we want to model, so to represent a rule rewriting the string 'B B' into 'A', we would use:
A -> B B
The reasons for the arrow and it's counter-intuitive direction should become more clear later (it relates to trees, their labels, and their generation). For now, just think
->
is=
, and the sides are flipped from how they usually are.
So, what's our corpus again?
1). Mary fell.
2). Mary kissed Sue.
3). Takeshi said Mary kissed Sue.
4). Robert thought that Takeshi said that Mary kissed Sue.
Well, what's a starting assumption we can make? One thing you might notice is that any noun can stand in for another:
Mary fell.
Sue fell.
Takeshi fell.
Robert fell.
In fact, more generally, parts of speech seem to act similarly. There is always a noun before the verb, for example. So let's rewrite all the words into their parts of speech.
All terminals (words) in our grammar will be lower-case, even proper nouns, to distinguish them from non-terminals like N, V, Comp, and so on.
N -> mary
N -> sue
N -> takeshi
N -> robert
V -> fell
V -> kissed
V -> said
V -> thought
Comp -> that // complementizer
A complementizer is the part of speech for a word, like "that" or "which", which is used to turn a sentence into the object of a verb. Like in "Robert thought that...".
If we rewrite our corpus with these rules, we get:
1). N V.
2). N V N.
3). N V Comp N V N.
4). N V Comp N V Comp N V N.
You might see a pattern there already! Let's introduce a rewrite to a new symbol, S (for sentence):
S -> N V
S -> N V N
S -> N V Comp N V N
S -> N V Comp N V Comp N V N
Hmm, right, we want a rule that describes the recursion. Well, if we look at our last 3 rules, we can make these groups, where each successive example adds N V Comp
to the start of the last one:
S -> N V N
S -> N V Comp (N V N)
S -> N V Comp (N V Comp (N V N))
Since what is in the brackets is a full grammatical sentence according to our rules, we can just rewrite it to another S!
S -> N V N
S -> N V Comp S
S -> N V Comp S
The last two are now the same, so we only need one rule to represent a verb with a sentence as an object! Our complete rules now look like this:
S -> N V
S -> N V N
S -> N V Comp S
N -> mary
N -> sue
N -> takeshi
N -> robert
V -> fell
V -> kissed
V -> said
V -> thought
Comp -> that
Let's test this grammar out by generating some sentences. To do this, let's start with our S symbol, and run it backwards through our rewrite rules (this is one reason the arrows face right, by the way).
Let's start with S
.
S
We have multiple rules that can rewrite S
. Let's pick S -> N V Comp S
.
N V Comp S
We keep rewriting the results of that rule. The first element is an N
. Let's pick takeshi
. It doesn't have an associated rewrite rule (it's a terminal ), so I'll italicize it to indicate it's done.
takeshi V Comp S
The next element is a V
. Let's pick said
. It's also a terminal, so I'll italicize it again.
takeshi said Comp S
Next we have a Comp
. Our only option is that
, so we'll rewrite it to that.
takeshi said that S
Next, we have another S
, but this time its the object. Let's pick S -> N V
this time.
takeshi said that N V
We have an N
again. Let's pick mary
.
takeshi said that mary V
We have a V
again. Let's pick fell
.
takeshi said that mary fell
We've now rewritten all the symbols into italicized terminals, and we're finished generating.
Hey! That's not in our corpus -- but it probably should be. Remember, we showed that our corpus describes a grammar that can generate an infinite amount of sentences. Since it is highly impractical to collect an infinite number of sentences for our corpus, it's a good thing that our model is generating sentences outside the corpus, as long as they are consistent with the corpus examples and with our own linguistic judgements. And Takeshi said that Mary fell
seems like a good sentence to me.
Let's try again. We'll start with S
:
S
We again have multiple rules that can rewrite S
. Let's pick S -> N V
this time.
N V
We have an N
. Let's pick robert
.
robert V
We have a V
. Let's pick said
.
* robert said
We finish with * robert said.
Oops.
Note on notation: the
*
here prefixing the sentence indicates that the sentence is ungrammatical.
That's not quite right. Our grammar is clearly also generating sentences that should not be in our corpus -- sentences that are not consistent with the other examples (or, more intuitively, English-as-she-is-spoke). Another way to say it is that our grammar is over-generating.
The problem here is that the rule S -> N V
describes an intransitive verb (a verb that doesn't need an object), and said
is a clausal verb, meaning it requires a sentence (clause) as an object. Likewise, if we had tried to use kissed
, a transitive verb, it would have produced the ungrammatical sentence * Robert kissed
.
Let's subdivide the verbs into some new classes.
// Split S into three separate rules:
S -> N IV // an intransitive rule, taking an IV
S -> N TV N // a transitive rule, taking a TV
S -> N CV Comp S // a clausal rule, taking a CV
N -> mary
N -> sue
N -> takeshi
N -> robert
IV -> fell // intransitive verb
TV -> kissed // transitive verb
CV -> said // clausal verb
CV -> thought // clausal verb
Comp -> that
Now, if we try to generate from the rule S -> N CV Comp S
, no pesky intransitive verbs can slip into that CV
slot. Likewise, if we pick S -> N IV
, only the intransitive verb kissed
can fill that slot, preventing any * Robert kissed
funny-business.
This grammar is pretty good! It can recognize all the sentences in our corpus, it can reject some ungrammatical ones like * Robert kissed
, and it even handles some basic recursion! It's ready to start implementing in code. For that, however, we will have to wait for Part 2, where we'll get to see the magic of the Earley algorithm, and reject those ungrammatical sentences with extreme computational prejudice.
This series is still under construction! To receive updates when new posts are ready, you can subscribe to the RSS feed, follow me on Twitter, or subscribe to the mailing list below.
I found this tutorial especially useful, as well as the obligatory Wikipedia article.
The standout book here is Syntactic Theory: A Formal Introduction, which covers the theory behind a symbolic grammar for a subset of English in exhaustive detail, using the HDPSG formalism. The linguistic aspects of this series will draw heavily from this book. I also like Implementing Typed Feature Structure Grammars, which goes into more detail on the LKB system that is used by the English Resource Grammar, and implements the formalism described in Syntactic Theory.
A wide-coverage symbolic grammar of English, available at www.delph-in.net/erg/.
A library of modules for symbolic grammar engineering, plus a collection of grammars for mostly low-resource languages built with that library. matrix.ling.washington.edu/index.html.
One area where this gets especially problematic is when models internalize bias in their training data, and then repeat that bias. For example, a language model trained to predict missing words might learn that it's more often correct to predict male pronouns in engineering-related texts, and therefore learn its embedding should associate maleness more closely with engineering than femaleness. If the model is then fine-tuned to, for example, suggest careers to high school seniors based on a bio written about them by a guidance counselor, it may have a bias towards referring women to non-engineering jobs. While there is some research in the space of de-biasing neural networks, the problem is far from solved.