Wednesday, September 1, 2010

Lexponential

When I started the job of learning Arabic I asked one of my instructors if there was a way to differentiate between the more and less useful words we were picking up.

He said no. I said he was wrong, and since then I've been trying to find a way to prove it.

There was a linguist named George Kingsley Zipf who discovered a pattern in language. Zipf counted how many times each word in the Brown Corpus was used and found a powerlaw, which means the most common word - in English, for example - is used twice as much as the second most common word, which is used twice as much as the fourth, which is twice as common as the eighth, etc...

For English, this means that the 135 most common words constitute HALF of all words written in the language (spoken language is different, and even more dramatic).

I'm kind of ashamed it took me so long do this.

I finally got together with a programmer friend, and we hashed out a plan for some software that would do the kind of word counts we needed to see the pattern.

It works with any sample you feed in, as long as the words are separated by spaces (I've heard Mandarin doesn't space between words, so no luck there). This means that for English, and Russian, and Spanish, and Arabic, and Farsi, and Hebrew, and whatever else you want to feed in, the program will tell you how many times each individual word is used.

Which means that when you're learning a new language - Mr. Language-Instructor-Man - not only are some words more important than others, now we can know exactly how much more important.

No comments: