Saturday, September 4, 2010

Lexponential: Little Brother

Corey Doctorow is great, he puts his books online under a creative commons license, so they're especially easy to run through the lexponential software. A few days ago, I ran Little Brother through Lexi, and here are a few interesting bits:
  • There were approximately 107,000 words in the novel, and just less than 10,000 different words.
  • 5,054 of the words were only used once
  • 90% of words in the novel are in a list only 2437 items long
  • 70% of words in the novel are in a list only 352 items long
  • 50% of words in the novel are in a list only 77 items long
  • the 12 most often used words in the novel constitute 25% of all the words it contains

Wednesday, September 1, 2010


When I started the job of learning Arabic I asked one of my instructors if there was a way to differentiate between the more and less useful words we were picking up.

He said no. I said he was wrong, and since then I've been trying to find a way to prove it.

There was a linguist named George Kingsley Zipf who discovered a pattern in language. Zipf counted how many times each word in the Brown Corpus was used and found a powerlaw, which means the most common word - in English, for example - is used twice as much as the second most common word, which is used twice as much as the fourth, which is twice as common as the eighth, etc...

For English, this means that the 135 most common words constitute HALF of all words written in the language (spoken language is different, and even more dramatic).

I'm kind of ashamed it took me so long do this.

I finally got together with a programmer friend, and we hashed out a plan for some software that would do the kind of word counts we needed to see the pattern.

It works with any sample you feed in, as long as the words are separated by spaces (I've heard Mandarin doesn't space between words, so no luck there). This means that for English, and Russian, and Spanish, and Arabic, and Farsi, and Hebrew, and whatever else you want to feed in, the program will tell you how many times each individual word is used.

Which means that when you're learning a new language - Mr. Language-Instructor-Man - not only are some words more important than others, now we can know exactly how much more important.