In addition we use the following sources:
- Published SAT and GRE exams. We look for words that have similar characteristics to words in published exams.
- The BNC (British National Corpus) and Google Web Corpus.
- The Living Word Vocabulary, an extensive 30-year study on words known by grade level.
- The Academic Word List (AWL).
- Published fiction and non-fiction book lists for the Common Core.
From various corpora we extract a difficulty metric that is based on frequency (how many times a word appears in print), dispersion (how likely it is that a word appears across different subject matters), and range (how many subsections of the corpus the word shows up in). This acts as a base for determining difficulty. When we are faced with having to choose between two words and we can't choose both, we prefer the one with a larger dispersion. For example both psychosis and morsel have the same frequency in the BNC but morsel has twice the dispersion. We then modulate this data with difficulty metrics from The Living Word Vocabulary study.
It's relatively easy to create very large word lists, but it's difficult to create small but relevant word lists.
A note to our SAT and GRE test takers: these exams have no complete master list. Guides such as 500 Words You Should Know for the GRE and The 100 Most Common SAT Words are just marketing gimmicks employed by test prep companies.