We take great effort in making our words relevant. The Membean word database is constructed using a rigorous process sourced from multiple corpora, including The College Board Vocabulary Study that created the Breland Corpus.
In addition we use the following sources:
- Published SAT and GRE exams. We look for words that have similar characteristics to words in published exams
- The BNC (British National Corpus) and Google Web Corpus
- The Living Word Vocabulary, an extensive 30-year study on words known by grade level
- The Academic Word List (AWL)
- Published fiction and non-fiction book lists for the Common Core
Membean and Grade Levels
Word lists do not follow grade levels. From various corpora we extract a difficulty metric that is based on the following:
- Frequency (how many times a word appears in print)
- Dispersion (how likely it is that a word appears across different subject matters)
- Range (how many subsections of the corpus the word shows up in)
These three things act as a base for determining difficulty.
When we are faced with having to choose between two words and we can't choose both, we prefer the one with a larger dispersion. For example, both psychosis and morsel have the same frequency in the BNC, but morsel has twice the dispersion.
We then modulate this data with difficulty metrics from The Living Word Vocabulary study. It's relatively easy to create very large word lists, but it's difficult to create small but relevant word lists.