Only 10 Words Make Up 25% Of The English Language
Those ten words, listed in order of frequency, comprise around 25% of the recorded English language, according to an ambitious project at Oxford University.
The project, called the Oxford English Corpus, is a growing database of examples from 21st-century English, ranging from literature and scientific journals to emails. The Corpus contains more than two billion instances of words, called "tokens."
"A type is a unique string of letters, regardless of how often it is used. A token is a single occurrence of a type. The sentence 'the cat sat on the mat' contains six tokens but five types, because there are two occurrences of the type 'the,'" Professor Patrick Hanks, former editor of English dictionaries at the Oxford University Press, told Business Insider.
The Corpus hit two billion tokens in 2010. Lexicographers then deduced the ten words that appear the most. But a Harvard professor named George Kingsley Zipf had already predicted the result back in 1935.
"The weak version of Zipf's Law says that words are not evenly distributed across texts; instead, there are a few words that are very common and a very large number of words that are very rare. And there is a neat curve linking the two extremes. Useful words such as 'useful' and 'curve' are quite low on the curve; boring words like 'thing,' 'go,' 'say,' 'give' and 'take' are quite high on the curve," Hanks said.
Hanks doesn't mean "neat" as a outdated form of "cool," either. He means orderly, organized, statistically beautiful.
The ten aforementioned words comprise about 25% of our language. Going further, the top 100 words comprise about 50% of our language, while 50,000 words comprise 95% of our language. To account for the last 5%, we need a vocabulary of more than a million words.
To test the theory, I counted the number of times each of the ten words appears in this article - 98 out of 391. Thus, "the," "be," "to," "of," "and," "a," "in," "that," "have," and "I" make up about 25.06% of this article. Right on the money.
If we consider "content words" (words with tangible meaning) instead of "function words," the top ten list changes to include: "time," "person," "year," "way," "day," "thing," "man," "world," "life," and "hand."