How we create language content

How we create
language content

All of Oxford Languages’ content aims to describe, rather than prescribe, the way languages are used by people around the world. We take an evidence-based approach to language content creation, looking at real examples of the ways words are used in context to provide an accurate picture of a language.


To gather this evidence, we have a world-class language research programme that utilizes big data technologies to continuously monitor language development in real time.


Our corpora – massive collections of spoken and written language data – track and record the very latest language developments across an enormous variety of publications, covering everything from specialist journals to newspapers to social media posts.


We have large corpora in English, Arabic, Indonesian, and many other languages in development, enabling the lexicographers and language technologists who create our dictionaries, datasets, and language resources to identify new and emerging words in context and spot trends and patterns in usage, spelling, regional varieties, and more.


Another important source in our language research programme is language users themselves. We run reading programmes and appeals, workshops and classroom sessions, talking directly to a wide range of communities to record and reflect their language accurately and meaningfully.


This is especially important for languages that are not well represented online and so cannot be accurately tracked through our corpus technology at this point in time. Documenting the varieties and dialects of a language by its speakers enables us to create new resources and build new corpora to ensure that these communities benefit from digital access and representation.


Our expert team of lexicographers source all of our descriptive sentence examples from our vast language databases to provide accurate and meaningful descriptions of words in use. The team analyses the corpus data to select examples that support a word in the correct grammatical and semantic context without distracting from the essential information the definition conveys.


We do our best to eliminate sentence examples that repeat factually incorrect, prejudiced, or offensive statements from the source and are always grateful when readers inform us of cases that do not meet our rigorous quality standards – whether due to human error or changing cultural sensitivities – so that we can review and update our content.


All of our content, sourced via corpora and community initiatives alike, is the result of continuous research and review as we seek to document and describe new language developments as they unfold, providing the world’s most trusted language content.