Saturday, January 22, 2011

Google Books

by Richard Crews
Google has undertaken to scan electronically everything ever written in any language. So far their project, Google Books, has gathered two trillion words from 15 million books. That represents 12% of every book in every language published since the Gutenberg Bible in the 1450s.

Although copyright disputes have limited the right to put some recent works in readable form, Google has produced a giant database of words and word patterns for analysis. Some interesting findings are:

The English language now has a vocabulary of one million words.

More than 50% of these--after excluding proper names--have been missed by all dictionaries. (Yep, there are over 500,000 words in use in English that have never been picked up by dictionary scholars.)

Historical periods of repression or censorship of certain authors, ideas, and fields are readily trackable--and some new ones previously unknown to historians have been discovered.

[Science, 17 Dec 2010, page 1600]