Mark Davies
Professor of Linguistics
Brigham Young University


I am a professor of Linguistics at Brigham Young University in Provo, Utah, USA.  My primary areas of research are corpus linguistics, language change and genre-based variation, the design and optimization of linguistic databases, and frequency and collocational analyses (all for English, Spanish, and Portuguese).

Please feel free to take a look at my CV, or try out some of the corpora that I've created, which are probably the most widely used corpora in existence. 


I received a B.A. from Brigham Young University in 1986 with a double major in Linguistics and Spanish, which was followed by an M.A. in Spanish Linguistics from BYU in 1989. I then received a PhD from the University of Texas at Austin in 1992, with a specialization in "Ibero-Romance Linguistics".

Research (see CV)

As a professor of Spanish at Illinois State University (1992-2003), most of my publications dealt with historical and genre-based variation in Spanish and Portuguese syntax. Since coming to BYU in 2003, however, my research has dealt primarily with general issues in corpus design, creation, and use (especially with regards to English), as well as word frequency. Overall, I have published six books and more than seventy articles, I have given numerous (invited) presentations at international conferences. One evidence of the impact of this research is that I was invited to write the introductory chapter for the recent Cambridge Handbook of English Corpus Linguistics (2015).


In the 2015-16 academic year, I was given the Karl G. Maeser Research and Creative Arts Award at BYU, which recognizes achievements in research. This award is given each year to only two or three people from the 1,500+ full-time faculty members at BYU, and it had not been given to anyone else in the College of Humanities since 2007. I have also received several awards from the College of Humanities at BYU (approx 200 faculty members), including the Barker lectureship, a two-year College Professorship, and two terms (two years + three years) as a Fellow for the Humanities Center.


I have received six large federal grants to create and analyze corpora. These include four from the National Endowment for the Humanities: 2001-02 (to create a large corpus of historical Spanish), 2004-2006 (to create a large corpus of historical Portuguese, with Michael Ferreira), 2009-2011 (to create a large corpus of historical English), and 2015-2017 (to enlarge the Spanish and Portuguese corpora). The two grants from the National Science Foundation were in 2002-2004 (to examine genre-based variation in Spanish, with Douglas Biber) and 2013-2016 (to examine "web-genres", with Douglas Biber and Jesse Egbert). In addition to these six US-based grants, I have had a large subcontract for a grant from the UK Arts and Humanities Research Council (2014-2016; to create the architecture and web interface for large semantically-tagged corpora). I am also a co-PI for a grant from the Korea Research Foundation (2014-2017, with Jong-Bok Kim) to examine three related syntactic constructions in English from a corpus-based perspective. See below for more information on these projects.


In Fall 2015 I'm teaching Corpus Linguistics (LING 485R) and ELang 495, dealing with "A Corpus-based Approach to the History of American English". Other recent classes with publicly-available websites include Historical and Comparative Linguistics (LING 450) and English Grammar (ELang 325).

Billion word extensions to the Spanish and Portuguese corpora

In early 2015 I was awarded (see p37) a two year grant from the US National Endowment for the Humanities to create much larger, updated versions of the Corpus del Espaņol and the Corpus do Portuguęs. The Corpus del Espaņol will be 100 times as large as before (two billion words, compared to 20 million words for the 1900s) and the Corpus do Portuguęs will be 50 times as big as before (one billion words, compared to 20 million words for the 1900s). In addition, each corpus will allow users to see the frequency by country, as is already possible for English with the GloWbE corpus.

Hansard Corpus (British Parliament)

Part of the SAMUELS project and funded by the AHRC (UK). This corpus contains 1.6 billion words in 7.6 million speeches in the British Parliament from 1803-2005. A unique feature of the corpus is that it is semantically tagged, which allows for powerful meaning-based searches. In addition, users can create "virtual corpora" by speaker, time period, House of Parliament, and party in power, and compare across these corpora. The end result is a corpus that will be of value not only to linguists (as the largest structured corpus of historical British English from the 1800s-1900s), but it will also be very useful for historians, political scientists, and others.

Wikipedia Corpus

In early 2015 we released a new corpus based on 1.9 billion words in 4.4 million articles from Wikipedia. You can quickly and easily create "virtual corpora" from the 4.4 million web pages (e.g. electrical engineering, investments, or basketball), and then search just that corpus, or create keyword lists based on that virtual corpus. If you want to create a customized corpus for a particular topic, but don't want to have the hassle of collecting all of the texts yourself, this should be a very useful corpus.

Downloadable full-text corpus data

You can now also download all of the texts for our three largest corpora to your computer: COCA (180,000 texts; 440 million words of text), COHA (385 million words; 115,000 texts), or GloWbE (1.8 million texts, 1.8 billion words of text).  With this data on your own computer, you can do many things that would be difficult or impossible via the regular web interface, such as sentiment analysis, topic modeling, named entity recognition, advanced regex searches, creating treebanks, and creating your own word frequency, collocates, and n-grams lists.

Our Academic Vocabulary List of English improves substantially on the AWL created by Coxhead (2000). Most of this data is also integrated into the WordAndPhrase (Academic) site, so that you can see a wealth of information about each word. See the Applied Linguistics article.

GloWbE: Corpus of Global Web-Based English

1.9 billion word corpus from 1.8 web pages in 20 different English-speaking countries. In addition to being very large (20 times as big as the BNC), this corpus also allows you to carry out powerful searches to compare the English dialects and see the frequency of words, phrases, grammatical constructions, and meaning in these twenty different countries.

Even more so than the standard COCA interface, this website is designed to provide information on nearly everything that you might want to know about words and phrases and their usage on one screen and with one search. Best of all, you can enter entire texts and see detailed information about each word in the text, and see related phrases from COCA.

Google Books Corpus

This improves greatly on the standard n-grams interface from Google Books. It allows users to actually use the frequency data (rather than just see it in a picture), to search by wildcard, lemma, part of speech, and synonyms, to find collocates, and to compare data in different historical periods.

Corpus of Historical American English (COHA)

400 million word corpus of historical American English, 1810-2009. The corpus is 100 times as large as any other structured corpus of historical English, and it is well-balanced by genre in each decade. As a result, it allows researchers to examine a wide range of changes in English with much more accuracy and detail than with any other available corpus. (Funded by the US National Endowment for the Humanities)

English word frequency, collocates, and n-grams

Based on COCA and other corpora, the data provides a very accurate listing of the top 100,000 words in English (including frequency by genre), the frequency of 4,300,000+ collocate pairs, and the frequency of all n-grams (1, 2, 3, 4-grams) in the corpus.

Frequency dictionary of American English

The dictionary contains the top 5000 words (lemmas) in American English, based on the data from the Corpus of Contemporary American English (COCA). The dictionary gives the top collocates  for each of the 5000 words, which gives a very good idea of the overall meaning of each word. (Co-authored with Dee Gardner (BYU), and published by Routledge.)

Corpus of Contemporary American English (COCA)

This 450+ million word corpus is the only large and balanced corpus of American English. It is used by more than 40,000 individual users each month, which makes it perhaps the most widely-used online corpus currently available. Because of its design, it is also perhaps the only large corpus of English that can be used to look at ongoing changes in the language.

Frequency dictionary of Portuguese

The dictionary is based on the 20 million words from the 1900s portion of the 45 million word Corpus do Portuguęs. It is the first frequency dictionary of Portuguese that is based on a large corpus from several different genres. (Co-authored with Prof. Ana Preto-Bay of the Department of Spanish and Portuguese at BYU, and published by Routledge.)

Corpus do Portuguęs

45 million word corpus of Portuguese (1300s-1900s).  The corpus allows users to find the frequency, distribution, and use of words, phrases, and grammatical constructions in different historical periods, as well as in the genres and dialects of Modern Portuguese. (Created in conjunction with Michael Ferreira of Georgetown University, and funded by the US National Endowment for the Humanities)

Frequency dictionary of Spanish

This is the first major frequency dictionary of Spanish that has been published in English since 1964.  It is based on the 20 million words from the 1900s portion of the 100 million word Corpus del Espaņol, and it includes many features not found in any previous dictionary of Spanish. (Published by Routledge.)

Register variation in Spanish

Used large corpora of many different registers  of Spanish as the basis for a "Multi-dimensional analysis of register variation in Spanish". (Carried out in conjunction with Douglas Biber of NAU, and funded by the US National Science Foundation)

Corpus del Espaņol

100 million word corpus of Spanish (1200s-1900s).  The corpus allows users to find the frequency, distribution, and use of words, phrases, and grammatical constructions in different historical periods, as well as in the genres of Modern Spanish. (Funded by the US National Endowment for the Humanities)

Corpus of LDS General Conference talks

24 million words in more than 10,000 talks from 1851 to the current time. Allows users to track the frequency of words and phrases by decade, find collocates (nearby words) to see changes in meaning, compare between different historical periods, and more.

Legal consultancies

Corpora and the law

I have twice served as an "expert witness" in cases related to trade names in Spanish and English. The cases involved lawsuits (totaling more than 500 million dollars) against a large multinational firm, which was accused of selling a dangerous product in Latin America. Corpus-based data that I provided showed, however, that the company's trade name had become generic (like xerox or kleenex), so that people who used another company's product mistakenly called it by the generic name of the product in question. Partially as a result of the corpus data that I presented, the lawsuits against this company were dismissed.

In addition, linguistic data from my corpora have been used in a number of court cases, including a case from the US Supreme Court in February 2011 (see overview, amicus brief, Atlantic, Language Log). They have also been used as the basis for some articles in law reviews, such as this and this. There's also an interesting overview of some of these cases (as well as my work with corpora) in a recent Deseret News article.

Note that because I have created the corpora that are used for these and similar cases, I can retrieve data from the corpora in ways that might not be possible for other linguist expert witnesses.


In order to create large corpora and place them online, I have acquired experience in a number of different technologies.  These include database organization and optimization (mainly with SQL Server, including advanced SQL queries), web-database integration (ActiveX Data Objects), server-side scripting (mainly Active Server Pages, via VBScript), client-side programming (mainly DHTML / Javascript), basic file and text manipulation (regular expressions, batch files, etc), VB.NET (for processing billions of words of data) and several different corpus and text-related tools (like AntConc and TextPad).  I also maintain the hardware and software for my Windows 20012 servers, including the administration of Internet Information Services (IIS).


Beyond life at the university, my interests include comparative religion, world cultures, history, Mormon studies, languages of the world, and the relationship between technology and culture, including the Internet.  And of course I enjoy spending time with my family -- Kathy, Spencer (and now Holly too, and soon the little one as well :-), Joseph, and Adam.