American National Corpus

Many Unix users are familiar with the system-wide english dictionary, known as /usr/dict/words. Nearly every word in /usr/dict/words has been registered by domain squatters, and most variants of the words are already in the brains of every automated password cracker on the planet.

This wordlist is a source when performing research on frequency tables and cryptanalysis for use in security, and for wonder

I’m unclear as to the origins of the dictionary (it possibly came from the spell or ispell utilities years ago), but I do know that it’s riddled with inaccuracies and basically, it would fail a spelling bee.

Today I found two projects to create a better English wordlist, and I can only imagine what will happen when domain squatters find it. The plan is to have a list with a hundred million words, spoken, and written.

There’s the American National Corpus, and the British National Corpus. Both contain enough words to keep password cracking software and domain squatters busy for years.

At last count, the ANC held 18 million words. I guess they have a long way to go.

