vgel.me

- projects & stuff

permalink for Getting a full PDF from a DRM-encumbered online textbook

Getting a full PDF from a DRM-encumbered online textbook

I recently started a calculus course that uses an online textbook. Buying this textbook online was mandatory, not for the content, but to get an electronic access code for homework assignments. While I had the option of additionally buying a physical copy of the book, I don't like the idea of textbook publishers trying to squeeze the used books market with scummy tactics like this. On top of this, unless I paid extra, I will lose access to this book at some point in the future. That is unacceptable to me. So... I'm going to crack it.

(Yes, I probably could have just torrented a PDF copy. But that's no fun!)

The DRM on this textbook is pretty intense. Of course, there isn't a "download PDF" option. There is a printing option, but it's limited to 10 pages at a time, and prints the pages out with a large watermark in the center, along with licensing info (my name, number, and a "do not scan, copy, duplicate, distribute, or exercise any freedom with the material" notice) in the margins. Fun!

First, we need to download all the pages. Due to the download limit, this is going to take forever... right? Nope! A little Clojure and java.awt.Robot has our mouse pointer whizzing around the screen by itself.

(ns scraper.core)

(import '(java.awt Robot)
        '(java.awt.event KeyEvent InputEvent))

(use '[clojure.string :only (join)])

(defn char-to-key [c]
    (KeyEvent/getExtendedKeyCodeForChar (int c)))

(defn type-key [r c]
    (.keyPress r c)
    (.delay r 100)
    (.keyRelease r c))

(defn type-string [r s]
    (doseq [c (seq s)]
        (let [upcase? (Character/isUpperCase c)]
              (if upcase? (.keyPress r KeyEvent/VK_SHIFT))
              (.delay r 100)
              (type-key r (char-to-key c))
              (if upcase? (.keyRelease r KeyEvent/VK_SHIFT))
              (.delay r 100)))
    (.waitForIdle r))

(defn mouse-down [r] (.mousePress r InputEvent/BUTTON1_DOWN_MASK))
(defn mouse-up [r] (.mouseRelease r InputEvent/BUTTON1_DOWN_MASK))

(defn click-mouse [r f]
    (when-not (nil? f)
        (f r))
    (doto r
        (mouse-down)
        (.delay 100)
        (mouse-up)
        (.waitForIdle)))

(defn to-next-page [r] (.mouseMove r 727 752))

(defn to-print-button [r] (.mouseMove r 758 154))
(defn to-page-range [r] (.mouseMove r 630 406))
(defn to-page-range-box [r] (.mouseMove r 713 401))

(defn type-range [r l u]
    (type-string r (format "%s-%s" (str l) (str u))))

(defn to-modal-print [r] (.mouseMove r 624 527))
(defn to-save-button [r] (.mouseMove r 285 166))

(defn rename-file [r new-name]
    (doto r
        (.delay 1500)
        (type-key KeyEvent/VK_BACK_SPACE)
        (type-key KeyEvent/VK_BACK_SPACE) ; just to be sure :^)
        (.delay 1500)
        (type-string new-name)
        (.delay 1500)))

(defn to-modal-save [r] (.mouseMove r 1008 734))

(defn open-print-menu [r]
    (doto r
        (click-mouse to-print-button)
        (.delay 3000)
        (click-mouse to-modal-print)
        (.delay 1500)))

(defn save-file-as [r new-name extra-wait]
    (Thread/sleep extra-wait)
    (doto r
        (click-mouse to-save-button)
        (.delay 1500)
        (rename-file new-name)
        (.delay 1500)
        (click-mouse to-modal-save)))

(defn print-pages-single [r start end prefix]
    (doseq [n (range start end)]
        (doto r
            (open-print-menu)
            (.delay 1500)
            (save-file-as (format "%s%04d" prefix n) 1000)
            (.delay 1500)
            (click-mouse to-next-page)
            (.delay 4500))))

(defn print-pages-range [r start end prefix]
    (doseq [[range-start range-end] (map (juxt first last) (partition 10 10 [] (range start end)))]
        (let [range-name (format "%s%d-%s%d" prefix range-start prefix range-end)
              range-file-name (format "%s%04d_%s%04d" prefix range-start prefix range-end)]
            (doto r
                (click-mouse to-print-button)
                (.delay 3000)
                (click-mouse to-page-range)
                (.delay 1500)
                (click-mouse to-page-range-box)
                (.delay 1500)
                (type-string range-name)
                (.delay 1500)
                (click-mouse to-modal-print)
                (.delay 3000)
                (save-file-as range-file-name 10000)
                (.delay 4500)))))

(defn print-pages [r page-ranges]
    (doseq [s page-ranges]
        (let [[start end prefix single?] s]
            ((if single?
                print-pages-single
                print-pages-range) start end prefix))))

(let [r (Robot.)]
    (println "Starting in 1 second...")
    (doto r
        (print-pages-single 0 1 "0000_cover")
        (print-pages-single 1 30 "0000_prologue")
        (print-pages-range 1 1171 "")
        (print-pages-range 1 147 "A")
        (print-pages-range 1 11 "R")
        (.waitForIdle)))

My Clojure was pretty rusty, so the code is far from pretty, and I got around timing problems by adding more sleeps... but with some trial-and-error, it worked pretty well. Several coffee/tea/tinder breaks later, broken up by restarting the scraping process where it broke for some reason, and all the pages are living on my hard drive. Nice! Er, except the ones that didn't get captured due to timing issues. A bit of Python magic found which pages weren't grabbed correctly, though, and I was able to rerun the scraper on just those ranges to clean up the remnants. Overall, the process took around a day, which while not ideal wasn't too bad. I experimented with taking regional screenshots to actually detect if UI elements were ready instead of just guessing, but in reality if I was doing this to more books and wanted it to be robust I would look into cracking the .swf itself.

Now, we need to get a page into image form, so we can play around with it in GIMP. Once we get a process worked out, we can automate it with ImageMagick and process all 1500-odd pages of the book. Getting this image is easy: pick a page and run:

convert -density 300 -quality 100 $PDF -crop 1846x2306+208+322 out.png

to turn it into a high-quality PNG file.

Luckily page R11 was totally white, so converting it to a PNG yielded a clean, isolated copy of the watermark.

Dealing with the margins will be easy, we can just crop them out, so lets focus on the watermark. I made everything but the watermark itself transparent in GIMP, and after that removing it from the original image is as simple as overlaying the cleaned-up version on the page and setting the watermark's layer mode to divide.

Now we need to repeat our earlier PDF->PNG conversion for all the files. This wasn't much harder with a dash of GNU parallel (an incredibly handy tool):

parallel convert -density 300 -quality 100 {} '~/book-imgs/{/.}-%d.png' ::: pdfs/*.pdf

Now, the fun part - automating the image-munging process! With the watermark image in watermark.png and a test page image named test.png, we can easily replicate our GIMP process in ImageMagick:

convert test.png watermark.png -compose Divide_Src -composite out.png

Ahh, shell is a wonderful thing.

And we have a nice, clean page... which I'd love to show you, but, copyrights. So just image a pristine (well, there's a few artifacts) textbook page, with no ugly watermarks. Ahhhh.

Now, lets do this 1500 times! Time for reach for parallel again:

parallel convert {} -background white -fill white -draw \"rectangle 160,163 1799,215\" -draw \"rectangle 160,2725 2342,2834\" -flatten watermark.png -compose Divide_Src -composite \"/home/jon/book-imgs-proc/{/}\" ::: ~/book-imgs/*.png

(I could have used mogrify and done it in-place, but I wanted to keep a backup. The first parallel command took quite a while to run.)

This command seems a bit confusing, so I'll break it into its constituent pieces.

The first part is -background white ... -flatten, which fills the transparent edges with white. I wanted the images to have an 8.5x11 ratio (because I'm a silly American), and it turns out they already were in that ratio - if I included the transparent part. No cropping required!

The next part is -fill white -draw \"rectangle 160,163 1799,215\" -draw \"rectangle 160,2725 2342,2834\". Since we aren't cropping out the licensing text, I'm instead simply covering it up with some filled-white rectangles. The coordinates took a bit of tweaking, but it worked out pretty well.

Finally after we -flatten, and all the operations have happened on the source image, we can load the watermark image and divide by it as before: watermark.png -compose Divide_Src -composite \"/home/jon/book-imgs-proc/{/}\". To make everything work, I had to manually move the watermark up in GIMP to get it to align. Not sure why, but after I did this everything processed basically perfectly (it's still not exactly aligned, so there's some very thin gray lines, but it's good enough for me).

Now, we can use convert again (ImageMagick is so useful) to join the PNG images into a single PDF:

convert ~/book-images-proc/*.png book.pdf

Well actually, convert likes to use... lots of RAM, so I actually ran:

nice -n 19 convert -limit area 1GiB -limit memory 1GiB -limit map 1GiB ~/book-imgs-proc/*.png book.pdf

...and went out with a friend, then came back and read The C Programming Language for a bit, then browsed Hacker News for a bit... it ended up running for almost 3 hours but it chewed through all the pages eventually. The resulting PDF was more than 700 MiB.

Now we can use basically any PDF OCR tool to make the text searchable. If I had the motivation I could probably scrape the original text from the book to get it perfect, but I don't care that much, so OCR it is.

I already have Ruby and the Tesseract OCR engine installed, so I just grabbed the one-script pdfocr tool from its Github repo. One extra command installation it needed for some reason and...

./pdfocr -t -i book.pdf -o book-ocr.pdf

...another hour or so of waiting later and my PDF was done! Basically 100% searchable and cleanly formatted. Over buying a "lifetime of edition" code I saved at least $120, so I'm pretty happy with this project overall.

permalink for Adding the pwd command to xv6

Adding the pwd command to xv6

Xv6 is a fairly popular clone of Version 6 UNIX. It was made by MIT as a base for students to work off of, as the original V6 UNIX was showing it's age, and only ran on the outdated PDP11 architecture. Xv6 runs natively on x86, and supports modern features like SMP, while still being only 15k lines of easily-grokked C.

As a simple example, we're going to add the pwd command to xv6. pwd prints the shell's current working directory. To do this, we'll need to write the getcwd system call, and also write the pwd userspace program.

First, to set up the system call we need to add a bit of boilerplate.

Firstly, assign the syscall an id in syscall.h:

...
#define SYS_close  21
#define SYS_getcwd 22 // <-- add this

Then, to syscall.c, add definitions for the system call:

...
extern int sys_uptime(void);
extern int sys_getcwd(void); // <-- add this

...
[SYS_close]   sys_close,
[SYS_getcwd]  sys_getcwd, // <-- add this

We need to define the getcwd function for user code as well. Add the header definition to user.h:

...
int uptime(void);
int getcwd(void*, int); // <-- add this

To define the actual function, we need to use assembly (to issue the interrupt to switch to kernel mode). Luckily, that's done for us, defined as the SYSCALL macro. Add a definition for getcwd to usys.S:

...
SYSCALL(uptime)
SYSCALL(getcwd) // <-- add this

So how does all this boilerplate work? getcwd, when called by user code, will use the definition in user.h. This will be linked to the assembly function generated by the SYSCALL macro. This function moves the SYS_getcwd constant into %eax and then issues the interrupt, switching to kernel mode. The kernel looks up %eax (22, SYS_getcwd) in the syscalls array, and sees the entry we added, which maps SYS_getcwd to the sys_getcwd function... which we need to write!

There are several files with system call implementations in the kernel. The best place for this is probably sysfile.c, as this file contains other file-related interrupts like sys_read and sys_exec.

This system call is a bit of code. I'm going to dump it all here, but don't worry, I'll walk through it piece by piece :). Prepared?

int
name_of_inode(struct inode *ip, struct inode *parent, char buf[DIRSIZ]) {
    uint off;
    struct dirent de;
    for (off = 0; off < parent->size; off += sizeof(de)) {
        if (readi(parent, (char*)&de, off, sizeof(de)) != sizeof(de))
            panic("couldn't read dir entry");
        if (de.inum == ip->inum) {
            safestrcpy(buf, de.name, DIRSIZ);
            return 0;
        }
    }
    return -1;
}

int
name_for_inode(char* buf, int n, struct inode *ip) {
    int path_offset;
    struct inode *parent;
    char node_name[DIRSIZ];
    if (ip->inum == namei("/")->inum) { //namei is inefficient but iget isn't exported for some reason
        buf[0] = '/';
        return 1;
    } else if (ip->type == T_DIR) {
        parent = dirlookup(ip, "..", 0);
        ilock(parent);
        if (name_of_inode(ip, parent, node_name)) {
            panic("could not find name of inode in parent!");
        }
        path_offset = name_for_inode(buf, n, parent);
        safestrcpy(buf + path_offset, node_name, n - path_offset);
        path_offset += strlen(node_name);
        if (path_offset == n - 1) {
            buf[path_offset] = '\0';
            return n;
        } else {
            buf[path_offset++] = '/';
        }
        iput(parent); //free
        return path_offset;
    } else if (ip->type == T_DEV || ip->type == T_FILE) {
        panic("process cwd is a device node / file, not a directory!");
    } else {
        panic("unknown inode type");
    }
}

int
sys_getcwd(void)
{
    char *p;
    int n;
    if(argint(1, &n) < 0 || argptr(0, &p, n) < 0)
        return -1;
    return name_for_inode(p, n, proc->cwd);
}

That wasn't so bad! Lets look at a small piece first.

int
sys_getcwd(void)
{
    char *p;
    int n;
    if(argint(1, &n) < 0 || argptr(0, &p, n) < 0)
        return -1;
    return name_for_inode(p, n, proc->cwd);
}

This function is pretty simple. argint and argptr just take arguments from the stack - the order is reversed because argptr needs the length of the buffer. proc is a global variable containing the current process, and cwd is the inode for the processes current directory. So we already have the current directory as proc->cwd, what's all the other code for?

In Xv6 (and most Unix file systems), the inode, short for 'index node' (though this may be a backronym), is a number pointing to a specific block on the disk that holds information about the file. On xv6, the inode struct stores only a few pieces of data. You can see them for yourself in file.h if you're interested. However, the inode is a layer of abstraction below the concept of the filesystem hierarchy and file names. The filesystem hierarchy is created with special inodes - type T_DIR - which contain a series of dirent structures. Each dirent is a tuple of a string name and an inode which references the file associated with that name. inodes themselves have no concept of file names, and can in fact be contained in multiple parent directories with multiple different file names. For example, one directory "foo" could have a dirent ("bar", inode 20), and another directory "baz" could have a dirent ("buzz", inode 20). What file name is inode 20? Is it /foo/bar, or is it /baz/buzz? The OS doesn't have to decide, since it never has a need to transform inodes to file names, so it doesn't.

We, however, have to decide. Luckily we have an out - if we had to start with a file this would be much trickier, but proc->cwd is always a directory, and thankfully create always makes the .. entry pointing to a directory parent. We will simply write a recursive function:

func psuedocode_func(inode):
    if inode is root:
        return "/"
    else:
        name := ""
        for dirent in inode.parent:
            if dirent.inode == inode:
                name = dirent.name
                break
        if not name:
            panic("inode not in parent")
        return psuedocode_func(inode.parent) + name + "/"

Of course, when we implement this in C and add error handling, it will get much uglier ;). Lets start with the implementation of that for loop. In the actual C code, this is the separate function name_of_inode:

int
name_of_inode(struct inode *ip, struct inode *parent, char buf[DIRSIZ]) {
    uint off;
    struct dirent de;
    for (off = 0; off < parent->size; off += sizeof(de)) {
        if (readi(parent, (char*)&de, off, sizeof(de)) != sizeof(de))
            panic("couldn't read dir entry");
        if (de.inum == ip->inum) {
            safestrcpy(buf, de.name, DIRSIZ);
            return 0;
        }
    }
    return -1;
}

Lets start with the header:

int
name_of_inode(struct inode *ip, struct inode *parent, char buf[DIRSIZ]) {

Since this is C, we take the string as a buffer, and return an int to show success/failure. DIRSIZ is a constant (14), the max file name length.

uint off;
struct dirent de;

I predeclared the variables, as this is the convention in the rest of the xv6 source. I did do the brackets differently, but eh.

for (off = 0; off < parent->size; off += sizeof(de)) {
    if (readi(parent, (char*)&de, off, sizeof(de)) != sizeof(de))
        panic("couldn't read dir entry");

Directories in the xv6 filesystem are really files whose contents are a just a series of dirent structures. Dirent is just a structure of a ushort inode id and a name. That means all this loop does is loop over every directory entry in the directory, loading it into de with readi.

if (de.inum == ip->inum) {
    safestrcpy(buf, de.name, DIRSIZ);
    return 0;
}

We check de to see if it's referring to our sought-after inode, ip. If it is, we use safestrcpy from string.h to copy it over in the buffer and do a standard C successful return.

If we don't find the inode, we return -1. This shouldn't happen unless the filesystem is pretty broken (.. points to the wrong parent).

Pretty simple, overall! Now lets move to the big function.

int
name_for_inode(char* buf, int n, struct inode *ip) {
    int path_offset;
    struct inode *parent;
    char node_name[DIRSIZ];
    if (ip->inum == namei("/")->inum) { //namei is inefficient but iget isn't exported for some reason
        buf[0] = '/';
        return 1;
    } else if (ip->type == T_DIR) {
        parent = dirlookup(ip, "..", 0);
        ilock(parent);
        if (name_of_inode(ip, parent, node_name)) {
            panic("could not find name of inode in parent!");
        }
        path_offset = name_for_inode(buf, n, parent);
        safestrcpy(buf + path_offset, node_name, n - path_offset);
        path_offset += strlen(node_name);
        if (path_offset == n - 1) {
            buf[path_offset] = '\0';
            return n;
        } else {
            buf[path_offset++] = '/';
        }
        iput(parent); //free
        return path_offset;
    } else if (ip->type == T_DEV || ip->type == T_FILE) {
        panic("process cwd is a device node / file, not a directory!");
    } else {
        panic("unknown inode type");
    }
}

The header and beginning are similar, so I'll skip over them.

if (ip->inum == namei("/")->inum) { //namei is inefficient but iget isn't exported for some reason
    buf[0] = '/';
    return 1;
}

This is the inode is root line in the psuedocode. As the comment states, iget is a private function, so we have to use the slightly less efficient namei, which is basically a wrapper which can turn a full path into an inode. It has a fast path for "/", so the parsing code doesn't make it much slower thankfully. Anyway, if the node is the root, we set buf to "/", just as in the psuedocode.

} else if (ip->type == T_DIR) {

This line is rather self-explanatory. The inode types are defined in stat.h as T_DIR, T_FILE, and T_DEV - a device node (the things in /dev/ on most Unix systems).

parent = dirlookup(ip, "..", 0);
ilock(parent);

We grab the parent reference with dirlookup. The . and .. entries could feasibly not exist (they're just normal directories, the kernel makes them in when the directory is created). I didn't explicitly deal with this case, but if the entry does not exist it will return null, and passing null to ilock will cause a panic, which is really the only thing to do in this case anyways. ilock makes sure the inode is loaded from disk.

Next, we call our name_of_inode function that we already discussed, and do that wonderful C string manipulation. Finally, we release the inode with iput, and return the length of the path, either for the benefit of the next name_for_inode down, or for the userspace caller.

The other checks are simply to make sure something extremely weird hasn't happened.

The system call is now, amazingly, done! The getcwd function will work fine from userspace code! Lets write pwd to test it:

#include "types.h"
#include "user.h"

#define MAX_PATH 512

int
main(int argc, char *argv[])
{
    char path[MAX_PATH];
    getcwd(path, MAX_PATH);
    printf(0, "%s\n", path);
    exit();
}

Yep, that's it! This is all extremely self explanatory, hello world even, so the only notes I'll leave are that printf is defined a little oddly (the first argument is which stream to print to, 0 being stdout and 1 being stderr, and that you have to use exit(), not return 0; to end your program. If you use return 0;, everything will work, but the kernel will complain a bit and print an ugly error message.

The only thing left to do is add pwd to the makefile.

...
        _zombie\
        _pwd\ # <-- add this

Binaries compile to _[c file name], so this just sets up the pwd binary.

...
        printf.c umalloc.c\
        pwd.c\ # <-- add this

Also add pwd.c as an extra file, which I think is only used for distribution, but you might as well.

That's it! You can now run make qemu (assuming you haven't already jumped the gun and tested it), and if you make a few levels of directories and cd into them, pwd will work (keep in mind there is no $PATH, so if you're in /foo/bar/baz, you need to run ../../../pwd for the shell to find it).

Happy hacking! Maybe try implementing grep?

permalink for Patching Function Bytecode in Python

Patching Function Bytecode in Python

Note to start off: this entire article is written with Python 3.x. It may, or may not work with 2.x. You can also access this article as an iPython notebook here.

Python is an amazingly introspective and hackable language, with a ton of cool features like metaclasses. One sadly unappreciated feature is the ability to not only inspect and disassemble, but actually programmatically modify the bytecode of Python functions from inside the script. While this sounds somewhat esoteric, I recently used it for an optimization, and decided to write a simple starter article on how to do it. (I'd like to warn that I'm not an expert in Python internals. If you see an error in this article please let me know so I can fix it).

To start off, let's declare a simple function we can play around with.

def func():
    a = 1
    b = 2
    c = a + b
    return c

I tried to avoid using any advanced Python features, just to keep the bytecode simple. However, syntax sugar like tuple unpacking and combined operator/assignment really doesn't add much complexity to the bytecode. Anyways, let's take a look at this function with Python's built-in dissassembler (things like this are why I love Python):

>>> import dis
>>> dis.dis(func)

  2           0 LOAD_CONST               1 (1)
              3 STORE_FAST               0 (a)

  3           6 LOAD_CONST               2 (2)
              9 STORE_FAST               1 (b)

  4          12 LOAD_FAST                0 (a)
             15 LOAD_FAST                1 (b)
             18 BINARY_ADD
             19 STORE_FAST               2 (c)

  5          22 LOAD_FAST                2 (c)
             25 RETURN_VALUE

That's actually not that complicated! The numbers on the far left are line numbers in the function. The number right before the opcode name is the byte offset that opcode starts at (Python opcodes are either 1 or 3 bytes each, depending on whether they take an argument [but can be extended 3 more bytes with EXTENDED_ARG]). The LOAD/STORE opcodes operate on the Python stack. STORE_FAST is the fast-path for storing to local variables. That means that the first 4 opcodes are simply storing 1 to a and 2 to b. Next, we load a and b again, and use the BINARY_ADD opcode to add them. That is STORE_FAST'd to c, and finally c is loaded and returned. Simple!

Now, lets take a look at the REAL opcodes. dis just prints them prettily for us, but behind the scenes these opcodes are really just bytes.

>>> print(func.__code__.co_code)

b'd\x01\x00}\x00\x00d\x02\x00}\x01\x00|\x00\x00|\x01\x00\x17}\x02\x00|\x02\x00S'

That's what the bytecode really looks like. It's a bytes object, so print is treating it like a string. Lets look at the raw numbers instead.

>>> print(list(func.__code__.co_code))

[100, 1, 0, 125, 0, 0, 100, 2, 0, 125, 1, 0, 124, 0, 0, 124, 1, 0, 23, 125, 2, 0, 124, 2, 0, 83]

That's a lot better! In fact, looking at it we see it's very similar to the dis output from earlier. Look at the first three bytes: 100, 1, and 0. If you recall, the first line of dis output was LOAD_CONST 1. And if we import another module...

>>> import opcode
>>> opcode.opmap['LOAD_CONST']
100

100! I'll skip to the chase and tell you that the next two bytes are interpreted as first_byte + 256 * second_byte, and that is used as the arg to LOAD_CONST. This arg is an index into the co_consts object:

>>> func.__code__.co_consts
(None, 1, 2)

That means all those first 3 bytes are doing is loading the first constant. The next 3 bytes, 125, 0, 0, unsuprisingly are the STORE_FAST. We already know the argument formula, and 0 + 256 * 0 is 0. This is used in co_varnames:

>>> func.__code__.co_varnames
('a', 'b', 'c')

The next 6 bytes repeat the same process, just for b. 124/LOAD_FAST also takes a co_varnames index, and this is executed twice for a and b. Now we get to the more interesting part, doing the addition. This is the first non-argument opcode we've seen. Remember earlier I said opcodes could be either 1 or 3 bytes? Python handles this in a very simple way - if the opcode is < 90, it is 1 byte. Otherwise it's 3 bytes. Addition is opcode 23, so it is 1 byte and has no arguments (the actual addition arguments were already loaded with LOAD_FAST). While addition could take arguments and load them itself, since there are several kind of loads, it was most likely easier for the Python developers to split that work into 3 opcodes, than to have several versions of BINARY_ADD and INPLACE_ADD (a += b) that load their arguments in different ways.

After the BINARY_ADD/23 byte, we have another STORE_FAST, LOAD_FAST, and finally another 1-byte opcode RETURN_VALUE/83. Pretty simple overall!

Of course we aren't going to stop there! Lets do something interesting with this function. We'll patch it so that instead of setting c = a + b, it subtracts.

First, lets write a function that takes a list of opcodes and an opcode, and returns the first index of that opcode.

import opcode
def find_opcode_index(opcodes, op):
    i = 0
    while i < len(opcodes):
        if opcodes[i] == op:
            return i
        if opcodes[i] < opcode.HAVE_ARGUMENT: #90, as mentioned earlier
            i += 1
        else:
            i += 3
    return -1

>>> find_opcode_index(list(func.__code__.co_code), opcode.opmap['BINARY_ADD'])
18

6 3-width opcodes in, as we would expect. Now, we simply have to replace this opcode with BINARY_SUBTRACT, right? Nope - you can't mutate __code__'s co_code, it's read-only. We have to reconstruct the entire code object.

>>> help(func.__code__)

Help on code object:

class code(object)
 |  code(argcount, kwonlyargcount, nlocals, stacksize, flags, codestring,
 |        constants, names, varnames, filename, name, firstlineno,
 |        lnotab[, freevars[, cellvars]])
 ...

When "Not for the faint of heart." is quoted in the documentation, you know you're doing something interesting ;). Let's get to making this thing!

fco = func.__code__ # we'll be using this a lot, so let's make it shorter
func_code = list(fco.co_code)
add_index = find_opcode_index(func_code, opcode.opmap['BINARY_ADD']) #18, as calculated earlier
if add_index >= 0: #fix iPython weirdness with re-running cells, don't worry about this
    func_code[add_index] = opcode.opmap['BINARY_SUBTRACT'] # actually replace the opcode
# the fun part starts here
func.__code__ = type(fco)(
    fco.co_argcount,
    fco.co_kwonlyargcount,
    fco.co_nlocals,
    fco.co_stacksize,
    fco.co_flags,
    bytes(func_code),
    fco.co_consts,
    fco.co_names,
    fco.co_varnames,
    fco.co_filename,
    fco.co_name,
    fco.co_firstlineno,
    fco.co_lnotab,
    fco.co_freevars,
    fco.co_cellvars
) # I think this type is a record for the most __init__ arguments in the Python stdlib. Luckily we're just copying them all over

(An interesting side note: Python actually type-checks assignments to __code__. It will throw an exception if you do something silly, like assign None)

>>> func()
-1

We did it! func() now sets c = a - b, which is 1 - 2, which is what we got! Now we're done and can go home!

Just kidding. Lets do something more interesting!

The problem that initially led me to look into Python bytecode was actually an optimization problem. For reasons that are outside the scope of this article, I had to optimize a function that in a tight inner loop needed to use a passed-in math operator. That's to say, the function was passed, say, +, and inside the loop it had to apply + to several million pairs of numbers. To do this, at the beginning of the function it assigned a local variable to the corresponding operator.foo function, such as operator.add:

>>> import operator
>>> operator.add(1, 2)
3

and inside the loop, it called this function on all the pairs of numbers. This was fine, and is the Pythonic way to approach this problem. However, again for reasons outside the scope of this article, the function had to run in under 20 seconds, and it was taking a few seconds more than that. I had stripped out every other micro-optimization possible, and I couldn't rewrite the function in any other language (again, reasons out of scope...). By patching the function, I was able to reduce the runtime to 18 seconds. Lets work out how to do this!

To clarify the problem, I had a function that at its core looked something like this (imagine the code in this function inside several nested loops, one with an n of several hundred thousand, and with more complex numbers being added than just i ;-) ):

def slow_func(op=operator.add):
    c = 0
    for i in range(100):
        c = op(i, c)
    return c

Also, for simplicity we'll assume that the function is only called with one op ever (in the inspiring program, that operator was named on the command line). That means we don't need to generate multiple versions of the function, we can just patch the main one. It would be relatively simple to generate a version for each operator and store them in a dict, it's just not really related to the main task of patching the function. Lets get to it!

>>> import dis
>>> dis.dis(slow_func)

  2           0 LOAD_CONST               1 (0)
              3 STORE_FAST               1 (c)

  3           6 SETUP_LOOP              35 (to 44)
              9 LOAD_GLOBAL              0 (range)
             12 LOAD_CONST               2 (100)
             15 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             18 GET_ITER
        >>   19 FOR_ITER                21 (to 43)
             22 STORE_FAST               2 (i)

  4          25 LOAD_FAST                0 (op)
             28 LOAD_FAST                2 (i)
             31 LOAD_FAST                1 (c)
             34 CALL_FUNCTION            2 (2 positional, 0 keyword pair)
             37 STORE_FAST               1 (c)
             40 JUMP_ABSOLUTE           19
        >>   43 POP_BLOCK

  5     >>   44 LOAD_FAST                1 (c)
             47 RETURN_VALUE

That's a little more complicated than our first function, but it's not that scary. I'll skip an opcode-by-opcode rundown this time, and just say that the for loop starts at SETUP_LOOP, the body starts at FOR_ITER, and it ends/repeats at POP_BLOCK. Also, op can be loaded with LOAD_FAST as arguments are treated as local variables in Python, but range has to be loaded with LOAD_GLOBAL. Other than that, nothing is very different than our earlier, simpler function.

Lets redefine our function now in a form that's easier to manipulate. This isn't strictly necessary, we could easily just patch out the op call, it just makes things a bit clearer, and more importantly puts a big marker in the function that things aren't as straightforward as they seem. We don't want to confuse any maintenance programmers (well, any more then they already will be...)

def slow_func_tbp():
    c = 0
    for i in range(100):
        c = MARKER_FUNCTION_FOR_PATCHING(c, i)
    return c

We removed the argument since it won't be used, and replaced the call to it with a non-existing function we will be able to easily look for. The new bytecode looks like this:

>>> dis.dis(slow_func_tbp)

  2           0 LOAD_CONST               1 (0)
              3 STORE_FAST               0 (c)

  3           6 SETUP_LOOP              35 (to 44)
              9 LOAD_GLOBAL              0 (range)
             12 LOAD_CONST               2 (100)
             15 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             18 GET_ITER
        >>   19 FOR_ITER                21 (to 43)
             22 STORE_FAST               1 (i)

  4          25 LOAD_GLOBAL              1 (MARKER_FUNCTION_FOR_PATCHING)
             28 LOAD_FAST                0 (c)
             31 LOAD_FAST                1 (i)
             34 CALL_FUNCTION            2 (2 positional, 0 keyword pair)
             37 STORE_FAST               0 (c)
             40 JUMP_ABSOLUTE           19
        >>   43 POP_BLOCK

  5     >>   44 LOAD_FAST                0 (c)
             47 RETURN_VALUE
~~~


Our strategy then will be to look for `LOAD_GLOBAL` with an argument equal to
the index of `MARKER_FUNCTION_FOR_PATCHING` in `co_names`. We'll replace that
with a `NOP` (opcode 9), keep the next two `LOAD_FAST`, replace the
`CALL_FUNCTION` with the appropriate inplace opcode, `NOP` out the two argument
bytes since inplace opcodes take no arguments, and we're done. Simple! Don't
worry, if you don't understand the code will make everything clear.

First, let's redefine our earlier opcode-finding function to find all indices of
an opcode.


~~~~{.python}
import opcode
def find_all_opcode_indexes(opcodes, op):
    i = 0
    while i < len(opcodes):
        if opcodes[i] == op:
            yield i
        if opcodes[i] < opcode.HAVE_ARGUMENT: #90, as mentioned earlier
            i += 1
        else:
            i += 3

Now we can start working on the actual function:

def patch_function(func, inplace_opcode, marker_name='MARKER_FUNCTION_FOR_PATCHING'):
    func_code = list(func.__code__.co_code)
    marker_coname_idx = func.__code__.co_names.index(marker_name) # co_names is a tuple of nonlocal name references
    load_global_marker_idx = 0
    for idx in find_all_opcode_indexes(func_code, opcode.opmap['LOAD_GLOBAL']):
        if func_code[idx + 1] + 256 * func_code[idx + 2] == marker_coname_idx:
            load_global_marker_idx = idx
            break
    print(load_global_marker_idx, func_code)
    ... # to be finished

>>> patch_function(slow_func_tbp, 0) # inplace opcode doesn't matter yet
25 [100, 1, 0, 125, 0, 0, 120, 35, 0, 116, 0, 0, 100, 2, 0, 131, 1, 0, 68, 93, 21, 0, 125, 1, 0, 116, 1, 0, 124, 0, 0, 124, 1, 0, 131, 2, 0, 125, 0, 0, 113, 19, 0, 87, 124, 0, 0, 83]

We can see that the function found index 25. The opcode there is LOAD_GLOBAL/116, and the argument formula calculates to 1, which is the correct co_names index (0 is range). Lets finish the function!

def patch_function(func, inplace_opcode, marker_name='MARKER_FUNCTION_FOR_PATCHING'):
    func_code = list(func.__code__.co_code)
    marker_coname_idx = func.__code__.co_names.index(marker_name) # co_names is a tuple of nonlocal name references
    load_global_marker_idx = 0
    for idx in find_all_opcode_indexes(func_code, opcode.opmap['LOAD_GLOBAL']):
        if func_code[idx + 1] + 256 * func_code[idx + 2] == marker_coname_idx:
            load_global_marker_idx = idx
            break
    # Do the actual patching
    cur_op = load_global_marker_idx
    func_code[cur_op + 0] = opcode.opmap['NOP'] # NOP out the LOAD_GLOBAL and its argument bytes
    func_code[cur_op + 1] = opcode.opmap['NOP']
    func_code[cur_op + 2] = opcode.opmap['NOP']
    cur_op += 3 # Skip over written NOPs
    cur_op += 6 # Skip over two LOAD_FAST opcodes we want to keep
    func_code[cur_op + 0] = opcode.opmap[inplace_opcode] # Replace CALL_FUNCTION with +=, -=, etc opcode
    func_code[cur_op + 1] = opcode.opmap['NOP'] # NOP out arguments because inplace opcodes don't take them
    func_code[cur_op + 2] = opcode.opmap['NOP']
    # Patching finished! That was easy. Now we just build a new code object like before.
    fco = func.__code__
    func.__code__ = type(fco)(
        fco.co_argcount,
        fco.co_kwonlyargcount,
        fco.co_nlocals,
        fco.co_stacksize,
        fco.co_flags,
        bytes(func_code),
        fco.co_consts,
        fco.co_names,
        fco.co_varnames,
        fco.co_filename,
        fco.co_name,
        fco.co_firstlineno,
        fco.co_lnotab,
        fco.co_freevars,
        fco.co_cellvars
    )
    # We're done!

That's all we need to do! Not as complicated as it seemed at first, huh? Lets patch slow_func_tbp and see if it works:

>>> patch_function(slow_func_tbp, 'INPLACE_ADD')
>>> slow_func_tbp()
4950

We can verify this by running the original slow_func:

>>> slow_func()
4950

They match! Lets look at the disassembly:

>>> dis.dis(slow_func_tbp)

  2           0 LOAD_CONST               1 (0)
              3 STORE_FAST               0 (c)

  3           6 SETUP_LOOP              35 (to 44)
              9 LOAD_GLOBAL              0 (range)
             12 LOAD_CONST               2 (100)
             15 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             18 GET_ITER
        >>   19 FOR_ITER                21 (to 43)
             22 STORE_FAST               1 (i)

  4          25 NOP
             26 NOP
             27 NOP
             28 LOAD_FAST                0 (c)
             31 LOAD_FAST                1 (i)
             34 INPLACE_ADD
             35 NOP
             36 NOP
             37 STORE_FAST               0 (c)
             40 JUMP_ABSOLUTE           19
        >>   43 POP_BLOCK

  5     >>   44 LOAD_FAST                0 (c)
             47 RETURN_VALUE

You can see our 3 NOPs, the INPLACE_ADD, and 2 more NOPs where the argument bytes used to be. And since we made the patcher a function, we can patch other functions as well!

Hopefully this served as a good introduction to patching Python bytecode. It's really not as scary or arcane as it seems at first sight (if you want scary/arcane, try parsing Java class files... shudder). I hope you have happy hacking with this technique, and the standard disclaimer applies: ONLY USE THIS IN PRODUCTION CODE IF YOU KNOW WHAT YOU'RE DOING!!! SEEING THIS WILL MAKE YOUR FELLOW PROGRAMMERS EXTREMELY UNHAPPY!!! As they say, always program like the person maintaining the code after you is a violent psychopath, and knows where you live.

Exercises for the reader:

If you find any errors in this article, please contact me. If you are reading this as an iPython notebook, my information can be found at jonathon- vogel.com/contact. Thanks for reading!

permalink for I've migrated my blog... again!

I've migrated my blog... again!

It seems I migrate my blog more than I update it, but I have once again. I got fed up with Heroku, and wanted a VPS for other projects anyways, so I bit the bullet, bought a VPS from Linode, and rewrote everything. I'm now using the amazing fabric library to statically generate my site and upload it to a VPS running nginx. I was even able to reuse most of my old code! Not that many people read this thing, but for the few who do, loading times shouldn't be > 10 seconds anymore (Heroku's free tier is seriously terrible for low-traffic sites). Besides that, everything should be basically the same, except that I also improved the RSS feed to actually validate.

permalink for Setting up PyOpenNI in Linux

Setting up PyOpenNI in Linux

I've had a Kinect for a while. I wrote a few interesting scripts with it, then forgot about it for a while. In between that time and now, I went through several OS upgrades that wiped away my changes. I decided I want to mess with it again, and had to reinstall OpenNI and the python extensions. This is a somewhat tricky process, so here's how it's done.

Step 1: Installing OpenNI, NITE, and the drivers

These are the drivers and middleware that will do all the heavy lifting for you. I grabbed them from the simple-openni project, which is a Kinect library for Processing. I don't care about processing, but they also conveniently bundle the libraries for you. Just grab the zipfile, and execute sudo ./install.sh in the OpenNI-Bin-Dev-Linux-[platform]-[version] folder, the NITE-Bin-Dev-Linux-[platform]-[version] folder, and the kinect/Sensor-Bin-Linux-[platform]-[version] folder.

Step 2: Add the library exports to your shell

The installation process generates a file with the location of the library files. You need to source this file for the installing of PyOpenNI. Simply run

source [folder you extracted openni in]/OpenNIDevEnvironment

However if it's not there, simply run

sudo updatedb && locate OpenNIDevEnvironment

Which should find the current location for your to source, though updatedb will take a few minutes to run generally.

Step 3: Build PyOpenNI

These instructions are on the Github, however I'll copy them here.

git clone https://github.com/jmendeth/PyOpenNI.git
mkdir PyOpenNI-build
cd PyOpenNI-build
cmake ../PyOpenNI
make
<copy to your python modules folder>

Step 4: Blacklist gspca_kinect

If you're using a newer Linux kernel (I'm using one of the 3.10 series), there is an existing module that only allows you to use the kinect as a generic webcam. This conflicts with OpenNI, so you need to blacklist it.

You can temporarily blacklist the module by running

sudo rmmod gspca_kinect

And permanently blacklist it by adding blacklist gspca_kinect to your blacklist file (google how to do it for your distribution).

Step 5: Not really a step

You can use OpenNI through Python now. But here's a bit of extra support code for controlling the motor and LED, which PyOpenNI doesn't do (requires pyusb):

import usb

LED_OFF              = 0
LED_GREEN            = 1
LED_RED              = 2
LED_YELLOW           = 3
LED_BLINK_YELLOW     = 4
LED_BLINK_GREEN      = 5
LED_BLINK_RED_YELLOW = 6

_dev = usb.core.find(idVendor=0x045e, idProduct=0x02B0) #The kinect motor device

def set_led_color(color):
    _dev.ctrl_transfer(0x40, 0x06, color, 0x0, [])

def set_motor_angle(degree):
    _dev.ctrl_transfer(0x40, 0x31, degree, 0x0, [])

def set_motor_up():
    set_motor_angle(60)

def set_motor_level():
    set_motor_angle(0)

def set_motor_down():
    set_motor_angle(-60)

Thanks for reading, hope this helps you!

permalink for Packaging Jython scripts in a single executable Jarfile

Packaging Jython scripts in a single executable Jarfile

There's a lot of outdated and just plain wrong information on the internet about packaging Jython scripts. After I wanted to start a project in Jython, I figured I should find a simple, easy way to package your app as an executable (ie, double-clickable) .jar. Turns out, it's a lot easier than it seems.

The only caveat to this process seems to be to use a relatively-recent Jython version. Some poking around suggests >2.5.1 should work, I used 2.5.3.

First, download jython-standalone, and copy it to your project directory as jython.jar. Next, create the file manifest.txt with the following content:

    Main-Class: org.python.util.JarRunner

Now, create __run__.py with the following content:

    import app

And create the folder app. You can place any additional startup code in either __run.py__ or app/__init__.py, along with placing your other code files in app. Now, code the same as you would any other Python project (along with all that Jython goodness). When you want to build your project, create a shell script with this:

#!/bin/sh
rm output.jar 2>/dev/null
cp jython.jar output.jar
zip -r output.jar app
zip output.jar __run__.py
jar ufm output.jar manifest.txt
rm jartmp*.tmp 2>/dev/null
echo "Done building!"

Which will package your app as output.jar. Now, you can run java -jar output.jar, or double-click output.jar. Enjoy!

permalink for I've migrated my blog!

I've migrated my blog!

I've migrated my blog to my own custom platform. I won't write about why I did it, since I don't have a specific technical reason Blogger failed me. I simply wanted a bit more control over the platform, wanted a personal site on my own server anyways, and had a bit of an itch to scratch. But instead, I'll go over some of my technical and design choices and why I made them.

Language / platform: Python 2 and Flask. Really, it came down to experience: while it's not the hippest, most async, most baddass platform, it's solid and works. Why Python 2? Again, just familiarity. Python 3 isn't a huge hurdle to me, but 2 is simply warmer and fuzzier.

Storage: The file system. All my posts live as un-rendered markdown in a subdirectory. To make a post, I create it, commit, and redeploy my heroku app. Any images go in static/post-name/, though that's just a convention.

Post format: Markdown. I spend a lot of time on Reddit, so I'm used to it. It also renders really fast, which is nice since I didn't want to deal with caching and just re-render the post on each request. I might set up some simple caching at some point (since I can just have a cached/ directory that is cleaned on each re-deploy), but for now I can live without it since my blog is low-traffic.

I'm using Pygments for highlighting, which I'm very happy with. I set up a little system where I leave an HTML comment denoting the language before a markdown code block, since I dislike blog systems that guess at the language and end up highlighting shell output as Java or something. I'm also using the wonderful Solarized theme from PyPy. I dislike Solarized in my editor (it's too blue!), but for reading code it's so nice.

I also designed the site myself, which is probably obvious from how hideous it is, but it has some charm to me in its bare-bones look. I'll probably mess around with it more later, and try and get Jinja to generate nicer-looking code, but overall I'm pretty happy with it, for my first real shot at web design. Credits to Subtle Patterns for the nice background.

As for comments, I've taken a lead from prog21 and "outsourced" them to other sites. If I feel that I want feedback on a post, I post it on Reddit. Otherwise, I let the internet stumble upon it if they need it.

Overall, I'm happy with the exercise. My blog has gone from an inscrutable Google system to ~100 lines of Python (including the home and contact pages), and a few Jinja templates that are basically plain HTML.

permalink for Cracking Word Searches with Haskell

Cracking Word Searches with Haskell

After learning Haskell a few months ago, I've had disappointingly few opportunities to use it. For a simple data-crunch a python repl starts up faster and lets me work quicker, and many of my other projects require a specific language. So when I was browsing reddit and saw /u/SWEAR_WORD_SEARCH, I thought it would be a fun, quick project to crack these in Haskell. (quick warning: if you didn't guess from the SWEAR_WORD part, this article contains some profane word searches ;-) The format of the word searches posted by this user are pretty simple. Here's an example:

N Z C G S M Z C M Q A C B O S W R E D V
X O K P G H N R E X T C V P E Z G W F V
W C D T N C P I Y F M E R J S Z W L E S
T R L K Z A S T M Z J V O N W F C Z X K
E S N N O H R W Y J I S Z A S M Y I S P
K T M W J R A D F W F M G W L I G O Q Q
C L D O L W G A M U V K G N I K C U F S
V X L K U O G H H I E Y Y Z C I S F B H
G J W B V U O R F H P E Y M N I I T O Q
G P V S N Y Y M R N H S A C J F J X H Z
T G J M F B R L E M H N R C S R I J M M
K V S A J V O N A W J E A F U U D W L B
P B H R H H Y A D B D W D Y G S W H A A
I Q I T M K I A Q I Y M W E A T U M J M
P P X W A T C H B H Y T Y Z D R S U Y U
V Q J N T L X L F V S D P G P A E Q E K
S Z N A E C Y N E J K H T W M T S P Z M
G Q M S E X T L R G Y G U B Y I J I A B
J C U D P E Z P M D W F Z G I N Z L L Y
P Y G X F W H P N T X R S I F G C C L T

Find the words: FRUSTRATING INCREDIBLY FUCKING WATCH SMART There's two parts here: the word search itself, and the word list we need to find. We could parse this with a combinator library like Parsec, but it's simple enough that lines should be enough. Indeed, after a bit of hacking in ghci and refactoring, I got

parsePuzzle :: String -> ([[Char]], [String])
parsePuzzle p = (parseLetters $ lines p, parseWords $ lines p)
    where dup f a = f a a
          parseLine = map head . words
          parseLetters = map parseLine . dup (take . (subtract 2) . length)
          parseWords = words . last

While the point-free style didn't work perfectly on parseLetters (I'm not a fan of Forth-like argument manipulation in Haskell, such as flip and dup), I think it's better than the alternative with names. Now we need to locate the words in the 2d letter matrix. This is surprisingly easy, using Data.List.transpose. We need to search through rows, columns, and diagonals. Rows are easy, they're simply the fst of the tuple we get from parsePuzzle. Columns are similar, simply use transpose on the rows. We need a little trick for the diagonals, though, however it's not as hard as it initially seems. First, we need a helper function mapIndex. This is similar to the function map, which you are hopefully familiar with, however it also provides the function with the index of the current item.

mapIndex :: (a -> Int -> b) -> [a] -> [b]
mapIndex f l = map (uncurry f) $ zip l [0..length l - 1]

(uncurry has signature (a -> b -> c) -> (a, b) -> c, and is used to unwrap the tuple from zip into two arguments) This might not seem particularly useful, but in fact we're only a step away from finding the diagonals now. By passing flip drop to mapIndex, and transposing the result, we get

ghci> putStrLn $ unlines p
NZCGSMZCMQACBOSWREDV
XOKPGHNREXTCVPEZGWFV
WCDTNCPIYFMERJSZWLES
TRLKZASTMZJVONWFCZXK
ESNNOHRWYJISZASMYISP
KTMWJRADFWFMGWLIGOQQ
CLDOLWGAMUVKGNIKCUFS
VXLKUOGHHIEYYZCISFBH
GJWBVUORFHPEYMNIITOQ
GPVSNYYMRNHSACJFJXHZ
TGJMFBRLEMHNRCSRIJMM
KVSAJVONAWJEAFUUDWLB
PBHRHHYADBDWDYGSWHAA
IQITMKIAQIYMWEATUMJM
PPXWATCHBHYTYZDRSUYU
VQJNTLXLFVSDPGPAEQEK
SZNAECYNEJKHTWMTSPZM
GQMSEXTLRGYGUBYIJIAB
JCUDPEZPMDWFZGINZLLY
PYGXFWHPNTXRSIFGCCLT

ghci> putStrLn $ unlines $ transpose $ mapIndex (flip drop) p
NODKORGHFNHEDEDASILT
ZKTZHAAHHHNAYAREPAY
CPNARDMIPSRFGTSQZB
GGCSWFUEEACUSUUEM
SHPTYWVYYCSUWMYK
MNIMJFKYMJRDHJU
ZRYZIMGZNFIWAM
CEFJSGNCIJJLA
MXMVZWIIIXMB
QTEOALKSTHM
ACRNSICFOZ
CVJWMGUBQ
BPSFYOFH
OEZCIQS
SZWZSQ
WGLXP
RWEK
EFS
DV
V

... the left to right diagonals! We can get the left to right diagonals easily as well, by using (map reverse p) instead of p in the right of the pipeline. Finally, to cover the bottom diagonals as well we need to reverse p before we process it, these are the prime versions. Now, we should easily be able to find if a word exists in the word search.

findWord :: String -> [[Char]] -> String
findWord s p = head $ filter (checkIn s) (rows p ++ cols p ++ dial p ++ diar p ++ dial' p ++ diar' p)
    where checkIn s line = s `isInfixOf` line || reverse s `isInfixOf` line
          rows p = p
          cols p = transpose p
          dial p = transpose $ mapIndex (flip drop) p
          diar p = transpose $ mapIndex (flip drop) (map reverse p)
          dial' p = transpose $ mapIndex (flip drop) (reverse p)
          diar' p = transpose $ mapIndex (flip drop) (map reverse (reverse p))




ghci> findWord "FUCKING" (fst $ parsePuzzle puzzle)
"CLDOLWGAMUVKGNIKCUFS"

ghci> take 7 $ drop 1 $ reverse "CLDOLWGAMUVKGNIKCUFS"
"FUCKING"

That's all well and good - we can see if a word appeared in the word search! But we really need to know what letter positions the word appears in, so we can do something special when printing those positions out. There's a few ways we could do this: a zipper, printing as we go, etc. However, the simplest solution is to simply keep position data with the characters. It will be preserved through the transformations, and at the end we can simply check to see if the characters we're printing are in those positions. First, let's add a type synonym.

type Letter = (Char, Int, Int)

and modify parsePuzzle to generate Letters instead of Chars:

parsePuzzle :: String -> ([[Letter]], [String])
parsePuzzle p = (parseLetters $ lines p, parseWords $ lines p)
    where dup f a = f a a
          parseLine chars y = mapIndex (\[c] x -> (c, x, y)) $ words chars
          parseLetters = mapIndex parseLine . dup (take . (subtract 2) . length)
          parseWords = words . last

Sadly the point-free style doesn't work for parseLine anymore, but luckily we had to make only one change to parseLetters, changing map to our mapIndex function from earlier. Finally, we need to change findWord to use Letter as well. This is pretty simple:

findWord :: String -> [[Letter]] -> [Letter]
findWord s p = head $ filter (checkIn s) (rows p ++ cols p ++ dial p ++ diar p ++ dial' p ++ diar' p)
        where lts = map (\(c, _, _) -> c)
              checkIn s line = s `isInfixOf` lts line || reverse s `isInfixOf` lts line
              rows p = p
              cols p = transpose p
              dial p = transpose $ mapIndex (flip drop) p
              diar p = transpose $ mapIndex (flip drop) (map reverse p)
              dial' p = transpose $ mapIndex (flip drop) (reverse p)
              diar' p = transpose $ mapIndex (flip drop) (map reverse (reverse p))



ghci> findWord "FUCKING" (fst $ parsePuzzle puzzle)
[('C',0,6),('L',1,6),('D',2,6),('O',3,6),('L',4,6),('W',5,6),('G',6,6),('A',7,6),('M',8,6),('U',9,6),('V',10,6),('K',11,6),('G',12,6),('N',13,6),('I',14,6),('K',15,6),('C',16,6),('U',17,6),('F',18,6),('S',19,6)]

It might not seem like much, but in fact, we're incredibly close to finished! First, let's collect the positions for every word into one list:

ghci> let (p, w) = parsePuzzle puzzle
ghci> nub $ foldl (\pos word -> findWord word p ++ pos) [] w
[('G',3,0),('P',3,1),('T',3,2),('K',3,3),('N',3,4),('W',3,5),('O',3,6),('K',3,7),('B',3,8),('S',3,9),('M',3,10),('A',3,11),('R',3,12),('T',3,13),('W',3,14),('N',3,15),('A',3,16),('S',3,17),('D',3,18),('X',3,19),('P',0,14),('P',1,14),('X',2,14),('A',4,14),('T',5,14),('C',6,14),('H',7,14),('B',8,14),('H',9,14),('Y',10,14),('T',11,14),('Y',12,14),('Z',13,14),('D',14,14),('R',15,14),('S',16,14),('U',17,14),('Y',18,14),('U',19,14),('C',0,6),('L',1,6),('D',2,6),('L',4,6),('W',5,6),('G',6,6),('A',7,6),('M',8,6),('U',9,6),('V',10,6),('K',11,6),('G',12,6),('N',13,6),('I',14,6),('K',15,6),('C',16,6),('U',17,6),('F',18,6),('S',19,6),('W',15,0),('Z',15,1),('Z',15,2),('F',15,3),('M',15,4),('I',15,5),('I',15,7),('I',15,8),('F',15,9),('R',15,10),('U',15,11),('S',15,12),('T',15,13),('A',15,15),('T',15,16),('I',15,17),('N',15,18),('G',15,19)]

(nub, from Data.List, removes duplicates from a list) Finally, we need to print out the puzzle. First, let's focus on just printing it without any decorations for where words are.

mapM_ (\ row -> do
         mapM_ (\ (c, _, _) -> putChar c) row
         putStrLn ""
      ) p

Expanding it to highlight the rows is easy enough as well.

main = do
    let (p, w) = parsePuzzle puzzle
        positions = nub $ foldl (\pos word -> findWord word p ++ pos) [] w
    mapM_ (\ row -> do
             mapM_ (\ l@(c, _, _) ->
                      if l `notElem` positions
                      then putChar c
                      else putStr ("\x1b[32m" ++ [c] ++ "\x1b[0m")
                   ) row
             putStrLn ""
          ) p

While using raw escape codes to color the output is a bit ugly, and doesn't work on Windows, it's better than pulling in a whole library just for this. This works OK:

Running the program in a terminal. It highlights the entire line the word is on, not just the word.

However, you might see a problem. Usually when solving a word search, you would only highlight the word, not the entire line. To fix this, we need to change findWord. Right now, findWord returns the entire line. This is a relic from before we returned positions. Now, since we return positions, there's enough context in the word itself. The current code of findWord looks like this:

findWord :: String -> [[Letter]] -> [Letter]
findWord s p = head $ filter (checkIn s) (rows p ++ cols p ++ dial p ++ diar p ++ dial' p ++ diar' p)
        where lts = map (\(c, _, _) -> c)
              checkIn s line = s `isInfixOf` lts line || reverse s `isInfixOf` lts line
              rows p = p
              cols p = transpose p
              dial p = transpose $ mapIndex (flip drop) p
              dial' p = transpose $ mapIndex (flip drop) (reverse p)
              diar p = transpose $ mapIndex (flip drop) (map reverse p)
              diar' p = transpose $ mapIndex (flip drop) (map reverse (reverse p))

There's a few places we could change the code to make this work. However, I'm going to add an additional step after we extract the filtered list's head. This is a bit inefficient, since it means we have to locate s/reverse s twice, however it keeps the code cleaner and simpler. To start out, we need a function, similar to isInfixOf, but that returns the position of the infix list. Sadly this doesn't exist in the standard library, but happily it exists on the Haskell wiki. Now that we have this function, we can modify findWord to work like we want.

findWord :: String -> [[Letter]] -> [Letter]
findWord s p = onlyString s $ head $ filter (checkIn s) (rows p ++ cols p ++ dial p ++ diar p ++ dial' p ++ diar' p)
    where lts = map (\(c, _, _) -> c)
          checkIn s line = s `isInfixOf` lts line || reverse s `isInfixOf` lts line
          indexOfR s line = fromJust $ findSublistIndex s (lts line) `orElse` findSublistIndex (reverse s) (lts line)
          onlyString s line = take (length s) $ drop (indexOfR s line) $ line
          rows p = p
          cols p = transpose p
          dial p = transpose $ mapIndex (flip drop) p
          dial' p = transpose $ mapIndex (flip drop) (reverse p)
          diar p = transpose $ mapIndex (flip drop) (map reverse p)
          diar' p = transpose $ mapIndex (flip drop) (map reverse (reverse p)

We added the functions onlyString, which returns only the range of letters in line corresponding to s, and indexOfR, which returns the index of either s or reverse s in line. It uses the function Data.Generics.Aliases.orElse, which returns the first or second Maybe, depending on which is not Nothing. We then unwrap that with fromJust, since we know that at least one will not be Nothing. This works a lot better, only highlighting the actual words:

Only highlighting the words now!

This about wraps everything up. Here's the full code to the project:

import Data.List (transpose, findIndex, isInfixOf, isPrefixOf, tails, nub)
import Data.Maybe (fromJust)
import Data.Generics.Aliases (orElse)

type Letter = (Char, Int, Int)

main = do
    let (p, w) = parsePuzzle puzzle
        positions = nub $ foldl (\pos word -> findWord word p ++ pos) [] w
    mapM_ (\ row -> do
             mapM_ (\ l@(c, _, _) ->
                      if l `notElem` positions
                      then putChar c
                      else putStr ("\x1b[32m" ++ [c] ++ "\x1b[0m")
                   ) row
             putStrLn ""
          ) p

findSublistIndex :: Eq a => [a] -> [a] -> Maybe Int
findSublistIndex xss xs = findIndex (isPrefixOf xss) $ tails xs

mapIndex :: (a -> Int -> b) -> [a] -> [b]
mapIndex f l = map (uncurry f) $ zip l [0..length l - 1]

findWord :: String -> [[Letter]] -> [Letter]
findWord s p = onlyString s $ head $ filter (checkIn s) (rows p ++ cols p ++ dial p ++ diar p ++ dial' p ++ diar' p)
    where lts = map (\(c, _, _) -> c)
          checkIn s line = s `isInfixOf` lts line || reverse s `isInfixOf` lts line
          indexOfR s line = fromJust $ findSublistIndex s (lts line) `orElse` findSublistIndex (reverse s) (lts line)
          onlyString s line = take (length s) $ drop (indexOfR s line) $ line
          rows p = p
          cols p = transpose p
          dial p = transpose $ mapIndex (flip drop) p
          dial' p = transpose $ mapIndex (flip drop) (reverse p)
          diar p = transpose $ mapIndex (flip drop) (map reverse p)
          diar' p = transpose $ mapIndex (flip drop) (map reverse (reverse p))

parsePuzzle :: String -> ([[Letter]], [String])
parsePuzzle p = (parseLetters $ lines p, parseWords $ lines p)
    where dup f a = f a a
          parseLine chars y = mapIndex (\[c] x -> (c, x, y)) $ words chars
          parseLetters = mapIndex parseLine . dup (take . (subtract 2) . length)
          parseWords = words . last

puzzle = "\
\N Z C G S M Z C M Q A C B O S W R E D V\n\
\X O K P G H N R E X T C V P E Z G W F V\n\
\W C D T N C P I Y F M E R J S Z W L E S\n\
\T R L K Z A S T M Z J V O N W F C Z X K\n\
\E S N N O H R W Y J I S Z A S M Y I S P\n\
\K T M W J R A D F W F M G W L I G O Q Q\n\
\C L D O L W G A M U V K G N I K C U F S\n\
\V X L K U O G H H I E Y Y Z C I S F B H\n\
\G J W B V U O R F H P E Y M N I I T O Q\n\
\G P V S N Y Y M R N H S A C J F J X H Z\n\
\T G J M F B R L E M H N R C S R I J M M\n\
\K V S A J V O N A W J E A F U U D W L B\n\
\P B H R H H Y A D B D W D Y G S W H A A\n\
\I Q I T M K I A Q I Y M W E A T U M J M\n\
\P P X W A T C H B H Y T Y Z D R S U Y U\n\
\V Q J N T L X L F V S D P G P A E Q E K\n\
\S Z N A E C Y N E J K H T W M T S P Z M\n\
\G Q M S E X T L R G Y G U B Y I J I A B\n\
\J C U D P E Z P M D W F Z G I N Z L L Y\n\
\P Y G X F W H P N T X R S I F G C C L T\n\
\Find the words:\n\
\FRUSTRATING INCREDIBLY FUCKING WATCH SMART"

Exercises for the reader:

permalink for Fixing network list duplication with nm-applet on Linux

Fixing network list duplication with nm-applet on Linux

I use XFCE in Ubuntu, and for a while used the default network applet built into the status bar. Eventually I wanted to get rid of it, since it had problems connecting to networks occasionally (I use a laptop and thus switch wifi networks often). Simply removing the status bar made XFCE replace it with nm-applet, which is much better. However, I was shocked to find over 500 network duplicates for one network (network-name, network-name 1, network-name 2, ..., network-name 538)! However, it's quite easy to delete this and other unused networks without manually clicking the name, then delete in nm-connection-editor.

Standard disclaimer: I'm not responsible if you fuck up your system blah blah blah

All the networks that show up when you hover over a name are stored in /etc/NetworkManager/system-connections/.

    ls -l /etc/NetworkManager/system-connections/
    total Too Damn Many
    -rw------- 1 root root 332 Oct 11 17:45 Bay_Breeze
    -rw------- 1 root root 333 Jul 27 00:17 [CENSORED, since it has my location] (repeated about 500 times)

To delete the duplicates, simply do sudo rm /etc/NetworkManager/system-connections/NAME\ * (replace NAME with the name of your duplicated network, obviously). If NAME has spaces in it, make sure to prefix the spaces with backslashes (e.g., My Network -> My\ Network). If you have two networks that aren't duplicates but have the same start, then one has a space followed (eg. My Network and My Network Repeater), temporarily move the second to your home directory or other safe place, then move it back afterwards. However, even if you do accidentally delete some networks you wanted to keep, the only thing that will happen is you'll have to type the password in again.

Explanation of sudo rm /etc/NetworkManager/system-connections/NAME\ *: * in a shell command expands to all possibilities. For example, in a directory with files file1, file2, file3, apple, file* would expand to file1 file2 file3, but not apple. In the same way, NAME\ * expands to all connections of NAME<space>thing, in this case thing being numbers. rm deletes every file passed to it, so the command will delete every duplicate network, but not the original network.

permalink for Ed: The standard EDitor

Ed: The standard EDitor

Since some people aren't getting it, the beginning of this article is a joke. What editor you use isn't important! As long as it's ed.

Getting started with Ed - the standard editor

Many modern programmers might laugh at ed. "What a quaint editor"! It doesn't have code completion! I can't load it with thousands of shitty plugins! It's name is only 2 letters! It's not written in hipster-script! Well, they're wrong. Ed is the standard text EDitor. It's on every Unix machine, ever! It doesn't waste your time with useless effects like cursors or showing you the current contents of the file. It's efficient! It's better! It's... ed!

Getting into Ed

Ed can seem intimidating at first, so I suggest you don't just start a blank file. Instead, create a file with an inferior editor - I refuse to use vi, and Emacs is banned from my machines, so I used echo.

echo 'ed-test\nHello world\n\tHi\nend.' > ed-test

You can cat the contents if you'd like to get a bearing on the file layout. This is important because unlike inferior editors, ed will not waste your time by showing you file contents. You have to ask for them. Otherwise, ed assumes you know what you're doing. You can now start ed. Run

ed ed-test

and you will be presented with the pinnacle of EDitor UI -

$ ed ed-test
29

Before doing anything else, press ^C. This will ensure you are in command mode.

$ ed ed-test
29
^C
?

For starters, you should attach training wheels. Like I have said, ed will not waste your time printing unneeded characters. In fact, I suggest typing a few things. Ed is in command mode at the moment, so you will not change the file. The only response you will get is ? (unless you stumble upon one of ed's single-letter commands). ? is ed telling you it doesn't understand your command. Lets make this a little more friendly. Input the commands H<newline> and P<newline>. H toggles on verbose error help messages, and P will give you a visual indicator (a * character) for when you are in command mode.

$ ed ed-test
29
^C
?
H
Interrupt
P
*

Now if you mess around, you should get better error messages, such as invalid address or invalid command suffix. This is very wasteful, so turn it off as you become more familiar with ed.

Viewing the File

Lets print the file! But first, we need an explanation of ed commands. Ed commands are single letters, however you generally will not type single letters in command mode. Instead, commands usually take 3 things: an address(range) first, then the command name, then a suffix. Both the address and suffix are optional, and the command will usually fill in useful defaults.

For example, to print the current buffer (the copy of the file in memory), type ,n

*,n
1 ed-test
2 Hello world
3  Hi
4 end.

This prints line numbers too. (,l only prints lines, though it's uglier. I recommend always using n). In this command, , was the address (range from first line to last line), and n was the command (print lines). There are many other addresses, and I won't list them all here. man ed has a list if you are curious, but some useful ones are:

You may have noticed the -- command in my command log. An address expression by itself acts as a command that moves the current line to the value of the expression and prints it.

Editing the file

Now, let's edit text! Editing text in ed is not done in command mode, but text (input) mode. There are several commands to put you in text mode, and ^C will bring you out. a is the simplest, so let's start with that. a accepts a line (not a range), and it will append text after that line.

*,n
1 ed-test
2 Hello world
3  Hi
4 end.
*2a
Appending!
^C
?
Interrupt
*,n
1 ed-test
2 Hello world
3 Appending!
4  Hi
5 end.

As you can see, we've added a line! However, our changes aren't saved yet. If you open another terminal, you will see the file is unchanged. We'll cover saving later.

Another text command is c (change). It accepts a range, deletes that range, and then appends where it used to be.

*,n
1 ed-test
2 Hello world
3 Appending!
4 Test1
5 Test2
6 Test3
7 No more testing!
8  Hi
9 end.
*4,6c
Removed tests.
^C
?
Interrupt
*,n
1 ed-test
2 Hello world
3 Appending!
4 Removed tests.
5 No more testing!
6  Hi
7 end.

c removed 4,6 and then started at 4.

d is much like c, but it doesn't allow input. It just deletes the lines, and sets the current line to either after the section if possible, or before the section if the section was at the end of the file.

*,n
1 ed-test
2 Hello world
3 Appending!
4 Removed tests.
5 No more testing!
6  Hi
7 end.
*5,6d
*,n
1 ed-test
2 Hello world
3 Appending!
4 Removed tests.
5 end.
*.
end.

Another very useful command is u (undo). It's what it sounds like. After running 5,6d, I ran u:

*u
*,n
1 ed-test
2 Hello world
3 Appending!
4 Removed tests.
5 No more testing!
6  Hi
7 end.

Like it never even happened! To redo, run u again. u only keeps one piece of history.

y and x can be used to yank and place. y takes a range to copy into the copy buffer, and x takes a line to place it at.

*,n
1 ed-test
2 Hello world
3 Appending!
4 Removed tests.
5 No more testing!
6  Hi
7 end.
*2,4y
*5x
*,n
1 ed-test
2 Hello world
3 Appending!
4 Removed tests.
5 No more testing!
6 Hello world
7 Appending!
8 Removed tests.
9  Hi
10 end.

Other bits

Ed has many other very powerful commands, such as G for editing lines matching a regular expression (and address modes that accept regular expressions). Read the man page to find out about them, sadly this tutorial doesn't have space for all the power of ed.

Saving a file is simple, simply type w. You can specify a range to w to only save part of a file. You can also suffix w with a file name to save to the non-default file (the default is either the file used in the original $ ed file command, or the first file saved to if ed was invoked without a file). w WILL CLOBBER EXISTING FILES WITHOUT ASKING! Use W to append to a file instead. If you're saving and quitting, wq can combine the two operations.

That's ed, the standard EDitor! More efficient than vi, has a decent text editor unlike Emacs! I suggest reading the man pages (man ed) to see all of ed's wonderful commands and useful shortcuts, like running shell commands within ed and using regular expressions (ed is the namesake of sed).

Good luck and godspeed with ED, the standard EDitor!

permalink for Installing and using FreeTTS in Java on Linux

Installing and using FreeTTS in Java on Linux

FreeTTS is a quite handy text-to-speech synthesizer that works in Java. It's also a bitch to install and use.

The first thing you'll want to do is download FreeTTS from it's site, http://sourceforge.net/projects/freetts/files/. This will get you a zip that looks like a jar but isn't. Extract the FreeTTS folder out somewhere (I did it to ~/Downloads/freetts).

Now the first thing you need to do is rip out the jsapi.jar file. For some reason, this is buried in what is probably the most annoying license file known to man, which you have to page through with less, but if you hit enter at the end instead of y (because you weren't reading it, of course) you get kicked out and have to start over. However, Linux has a very handy program called yes that will spam y for you.

First run chmod +x <freettsfolder>/lib/jsapi.sh to make it an executable script. Then run yes | <freettsfolder>/lib/jsapi.sh and page through with enter. When you get to the license agreement yes should spam in a y before you hit enter and the script will extract it's payload.

Now jsapi is in whatever directory your shell is in. Move it into <freettsfolder>/lib so it's with it's friends. Since I'm an Eclipse user, I then added this to the build path for a project (right click -> properties -> Java Build Path -> libraries tab -> add external jars) and created a new class.

You'll need to import com.sun.freetts.*, then set a system property with

System.setProperty("freetts.voices", "com.sun.speech.freetts.en.us.cmu_us_kal.KevinVoiceDirectory");

to use the Kevin voice. Presumably you can set other voices, which you could probably find by decompiling the cmu_us_kal jar that was in lib, but the Kevin voice is not bad.

Finally, you can speak a sentence with the following:

Voice kevin = VoiceManager.getInstance().getVoice("kevin16");
kevin.allocate();
kevin.speak("You may be wondering why I have gathered you here tonight...");

And you should get glorious machine-generated sound pouring from your speakers. I did not run into the Linux/java sound bug that I found mentioned on StackOverflow while trying to get this to work, but the newest version of FreeTTS is more than a year younger than those issues so I figure it was fixed.

Happy sound saying!

permalink for So, started a blog

So, started a blog

Since proggit doesn't allow self posts I can stick my thoughts here and link them instead.