VGEL•ME

tl;dr: The scoping on this series was bad. It should be a book. I released the code under an MIT license on Github as treebender. There's a tutorial there. I'm planning posting some non-series intermediate Rust posts next. Thanks for reading the blog, I appreciate it.

As readers of this blog may be aware of, I have been writing a series about implementing an earley parser for linguistic utterances. This parser is being used for my in-development game Themengi, a game about learning an alien language, but I also was writing the series in part to advocate for a symbolic framework of parsing as better, in certain situations, than the dominant statistical / neural approach.

I published the first post — Why?, and some theory, in April. Astute readers may notice that it is now October, coincidentally exactly 6 months after I published that first post. Considering that my original goal was to write 5 posts, even if the second post was done and ready to publish that is not an encouraging schedule.

Continue reading...

This is the first part in a five-part series that will cover implementing an earley parser for linguistic utterances.

  • Part 1 - Why?, and some theory
  • Part 2 - The Earley recognizer
  • Part 3 - Felling the Earley parse forest
  • Part 4 - Feature-structures and unification
  • Part 5 - Wrapping up

permalink for Why_write_this_series? Why write this series?

There are already many great explanations of the Earley parse algorithm, which is the parsing algorithm that we'll be using.1 There are also some great books on feature-structure based linguistic grammars, which I highly recommend.2 However, as far as I know nobody has put the two together into a beginner-friendly, implementation-oriented tutorial that explains what these tools are, how to use them together, and perhaps most importantly why someone would want to. This series will be trying to explain this through an iterative approach, starting with the simplest possible solution, and expanding that solution only when necessary. The goal is that, when we finish, we'll have the simplest possible solution to our goal, with a codebase that has every line justified.

permalink for What_are_we_making,_exactly? What are we making, exactly?

The exact thing we're going to be building is a symbolic, linguistic grammar model. Those adjectives describe the somewhat awkward space the thing we're going to be building is in. We'll be using symbolic, rule-based techniques to parse language, but we need to parse a wider variety of structures than most rule-based parsers. And we'll be parsing language, but in a rule-based way, not in a learned way like a neural network. This combination has some serious advantages, and is worth examining in detail.

One spot we're stuck between is symbolic, computer language grammar models. These are the parsers that parse Javascript, Python, JSON, C, the Markdown I'm writing this file in, and the HTML you're reading it in. In this space, symbolic models rule the day. However, these models are generally not suitable for linguistic use. For one, they generally use parsing techniques, such as LALR, that have restrictions that make them unsuitable for parsing natural languages. For example, LALR cannot handle an ambiguous grammar. These grammars also lack an efficient representation of linguistic information, such as case.

The other spot we're stuck between are the now-ubiquitous linguistic models that are trained on data — usually based on neural networks. These models, such as BERT, take mountains of data — gigabytes of text — and train a model to do some basic linguistic task, such as predicting missing words. This gives the model an understanding of how the language works. Other engineers can then take this pre-trained language model, "fine tune" it on a smaller (but still substantial) amount of data that more closely matches their specific task, and stick another network on the back that spits out whatever they need.

However, symbolic linguistic models are still alive and kicking in this space, both in the emerging field of neural-symbolic computing, and in very mature, wide-coverage linguistic models such as the English Resource Grammar3 and the grammar matrix4. These symbolic grammars are still used because they have some distinct advantages of trained models: they are predictable5, they are analyzable, and they don't require large amounts of training data. Our grammar will be built on these principles, and while it will not have as wide of coverage as the ERG (which has >99% coverage of the New York Times corpus!), it will reach levels of linguistic understanding that a trained model would need mountains of data to reach. On the other hand, our grammar will show the downsides of this method as well: while a neural network can make a "best guess" at a grammar construction its never seen before, that's near-impossible for a symbolic model. If we've only implemented simple sentences with one subject and one object (transitive), our model will choke the first time it sees "Mary gave Sue the book", which has one subject and two objects, "Sue" and "the book" (ditransitive). If we then only implement this ditransitive pattern, it will choke on "Mary gave the book to Sue", where "Sue" is marked with the preposition "to" (dative alternation). Symbolic grammar engineering is a cycle of implementing patterns and finding more patterns your model doesn't yet handle. Eventually, you have to stop. For our purposes, a simple, low-coverage model will be enough to illustrate the important concepts.

permalink for Where_are_we_going? Where are we going?

For my specific use case, I'm working on a game, based around the player learning an alien language that I've created. (It's called Themengi, and you can check it out here.) I want to parse player input, both imperative commands and dialogue, and respond appropriately. This language does not exist; there is no training corpus, so a trained model is not possible. On the other hand, this is a language, not an obfuscated version of SQL, so I don't want to simply apply programming-language techniques. I want to be able to handle ambiguity, case, and other linguistic features. In my search, I couldn't find a framework that did quite what I needed, so I decided to write my own. This post series will follow my progress working on this framework.

Along the way, I'll be illustrating the concepts with the running example of creating a tiny English grammar. I'm going to start from first principles, assuming only a little knowledge of grade-school English grammar terms (noun, verb, etc.). We'll start with a simple system of rules, write a recognizer, extend that into a parser, and annotate that parser with linguistic information. Then we'll have some fun at the end, maybe we'll write a tiny grammar for Japanese and do some symbolic translation or write a mini rules-based digital assistant demo. The possibilities are endless, we could write a grammar for Klingon and make the first automated Japanese -> Klingon translation! Who knows!

So, with that out of the way, let's get going!

Continue reading...

Xv6 is a fairly popular clone of Version 6 UNIX. It was made by MIT as a base for students to work off of, as the original V6 UNIX was showing it's age, and only ran on the outdated PDP11 architecture. Xv6 runs natively on x86, and supports modern features like SMP, while still being only 15k lines of easily-grokked C.

As a simple example, we're going to add the pwd command to xv6. pwd prints the shell's current working directory. To do this, we'll need to write the getcwd system call, and also write the pwd userspace program.

Continue reading...

Note to start off: this entire article is written with Python 3.x. It may, or may not work with 2.x. You can also access this article as an iPython notebook here.

Python is an amazingly introspective and hackable language, with a ton of cool features like metaclasses. One sadly unappreciated feature is the ability to not only inspect and disassemble, but actually programmatically modify the bytecode of Python functions from inside the script. While this sounds somewhat esoteric, I recently used it for an optimization, and decided to write a simple starter article on how to do it. (I'd like to warn that I'm not an expert in Python internals. If you see an error in this article please let me know so I can fix it).

Continue reading...

It seems I migrate my blog more than I update it, but I have once again. I got fed up with Heroku, and wanted a VPS for other projects anyways, so I bit the bullet, bought a VPS from Linode, and rewrote everything. I'm now using the amazing fabric library to statically generate my site and upload it to a VPS running nginx. I was even able to reuse most of my old code! Not that many people read this thing, but for the few who do, loading times shouldn't be > 10 seconds anymore (Heroku's free tier is seriously terrible for low-traffic sites). Besides that, everything should be basically the same, except that I also improved the RSS feed to actually validate.

Continue reading...

I've had a Kinect for a while. I wrote a few interesting scripts with it, then forgot about it for a while. In between that time and now, I went through several OS upgrades that wiped away my changes. I decided I want to mess with it again, and had to reinstall OpenNI and the python extensions. This is a somewhat tricky process, so here's how it's done.

Continue reading...

There's a lot of outdated and just plain wrong information on the internet about packaging Jython scripts. After I wanted to start a project in Jython, I figured I should find a simple, easy way to package your app as an executable (i.e., double-clickable) .jar. Turns out, it's a lot easier than it seems.

Continue reading...

I've migrated my blog to my own custom platform. I won't write about why I did it, since I don't have a specific technical reason Blogger failed me. I simply wanted a bit more control over the platform, wanted a personal site on my own server anyways, and had a bit of an itch to scratch. But instead, I'll go over some of my technical and design choices and why I made them.

Continue reading...

After learning Haskell a few months ago, I've had disappointingly few opportunities to use it. For a simple data-crunch a python repl starts up faster and lets me work quicker, and many of my other projects require a specific language. So when I was browsing reddit and saw /u/SWEAR_WORD_SEARCH, I thought it would be a fun, quick project to crack these in Haskell. (quick warning: if you didn't guess from the SWEAR_WORD part, this article contains some profane word searches ;-)

Continue reading...

I use XFCE in Ubuntu, and for a while used the default network applet built into the status bar. Eventually I wanted to get rid of it, since it had problems connecting to networks occasionally (I use a laptop and thus switch wifi networks often). Simply removing the status bar made XFCE replace it with nm-applet, which is much better. However, I was shocked to find over 500 network duplicates for one network (network-name, network-name 1, network-name 2, ..., network-name 538)! However, it's quite easy to delete this and other unused networks without manually clicking the name, then delete in nm-connection-editor.

Standard disclaimer: I'm not responsible if you fuck up your system blah blah blah.

Continue reading...

Since some people aren't getting it, the beginning of this article is a joke. What editor you use isn't important! As long as it's ed.

##Getting started with Ed - the standard editor Many modern programmers might laugh at ed. "What a quaint editor"! It doesn't have code completion! I can't load it with thousands of shitty plugins! It's name is only 2 letters! It's not written in hipster-script! Well, they're wrong. Ed is the standard text EDitor. It's on every Unix machine, ever! It doesn't waste your time with useless effects like cursors or showing you the current contents of the file. It's efficient! It's better! It's... ed!

Continue reading...

FreeTTS is a quite handy text-to-speech synthesizer that works in Java. It's also a bitch to install and use.

Continue reading...

Since proggit doesn't allow self posts I can stick my thoughts here and link them instead.

Continue reading...