![]() |
Mark Davies Professor, Corpus Linguistics Brigham Young University |
|
||||
|
|
|
|
Education |
I received a B.A. from Brigham Young University in 1986 with a double major in Linguistics and Spanish, which was followed by an M.A. in Spanish Linguistics from BYU in 1989. I then received a PhD from the University of Texas at Austin in 1992, with a specialization in "Ibero-Romance Linguistics". |
| Research |
As a professor of Spanish at Illinois State University (1992-2003), most of my publications dealt with historical and genre-based variation in Spanish and Portuguese syntax. Since coming to BYU in 2003, however, my research has dealt primarily with general issues in corpus design, creation, and use (especially with regards to English), as well as word frequency. Overall, I have published four books and more than sixty articles, I have given numerous (invited) presentations at international conferences, and I have received five large federal grants (three NEH and two NSF) to develop and use corpora. |
|
Teaching |
In Winter 2013 I taught Corpus Linguistics (Ling 485) and Historical and Comparative Linguistics (Ling 450). Other recent classes include English Grammar (ELang 325), Empirical Methods in English Linguistics (ELang 273), and ELang 495, dealing with "A Corpus-based Approach to the History of American English". |
|
|
|
|
1.9 billion word corpus from 1.8 web pages in 20 different English-speaking countries. In addition to being very large (20 times as big as the BNC), this corpus also allows you to carry out powerful searches to compare the English dialects and see the frequency of words, phrases, grammatical constructions, and meaning in these twenty different countries. |
|
|
www.wordandphrase.info |
Even more so than the standard COCA interface, this new website is designed to provide information on nearly everything that you might want to know about words and phrases and their usage. You can enter entire texts and see detailed information about each word (all on one screen): definitions, overall frequency, frequency by genre, 20-30 collocates, 200 concordance lines, synonyms, and WordNet entries. You can also have it suggest alternate phrases for any phrase in your text, based on COCA data. Finally, you can browse through and search a huge frequency dictionary of English, and see detailed information on each word. |
|
Google Books Corpus |
This improves greatly on the standard n-grams interface from Google Books. It allows users to actually use the frequency data (rather than just see it in a picture), to search by wildcard, lemma, part of speech, and synonyms, to find collocates, and to compare data in different historical periods. The corpus is currently based on 155 billion words of American English from 1810-2009, but I have applied for a grant to apply this interface and architecture to the other Google Books datasets as well, including British English, Spanish, French, and German. |
|
Corpus of Historical American English
(COHA) |
400 million word corpus of historical American English, 1810-2009. The corpus is 100 times as large as any other structured corpus of historical English, and it is balanced in each decade between fiction, popular magazines, newspapers, and academic. As a result, it allows researchers to examine a wide range of changes in English with much more accuracy and detail than with any other available corpus. (Funded by the US National Endowment for the Humanities) |
|
Based on the Corpus of Contemporary American English (COCA), the data provides a very accurate listing of the top 500,000 word forms in American English, the top 60,000 lemmas (including frequency in each of the five main genres and 40+ sub-genres), the frequency of 4,800,000+ collocate pairs, and the frequency of all 150+ million 3-grams (three word strings) in the corpus. |
|
|
The dictionary contains the top 5000 words (lemmas) in American English, based on the data from the Corpus of Contemporary American English (COCA). The dictionary gives the top collocates for each of the 5000 words, which gives a very good idea of the overall meaning of each word. (Co-authored with Dee Gardner (BYU), and published by Routledge.) |
|
|
This 450+ million word corpus is the only large and balanced corpus of American English. It is used by more than 40,000 individual users each month, which makes it perhaps the most widely-used online corpus currently available. Because of its design, it is also perhaps the only large corpus of English that can be used to look at ongoing changes in the language. |
|
|
Frequency
dictionary of Portuguese (2007) |
The dictionary is based on the 20 million words from the 1900s portion of the 45 million word Corpus do Portuguęs. It is the first frequency dictionary of Portuguese that is based on a large corpus from several different genres. (Co-authored with Prof. Ana Preto-Bay of the Department of Spanish and Portuguese at BYU, and published by Routledge.) |
|
Corpus do Portuguęs (2006) |
45 million word corpus of Portuguese (1300s-1900s). The corpus allows users to find the frequency, distribution, and use of words, phrases, and grammatical constructions in different historical periods, as well as in the genres and dialects of Modern Portuguese. (Created in conjunction with Michael Ferreira of Georgetown University, and funded by the US National Endowment for the Humanities) |
|
Frequency
dictionary of Spanish
(2005) |
This is the first major frequency dictionary of Spanish that has been published in English since 1964. It is based on the 20 million words from the 1900s portion of the 100 million word Corpus del Espaņol, and it includes many features not found in any previous dictionary of Spanish. (Published by Routledge.) |
|
Register variation in
Spanish (2004) |
Used large corpora of many different registers of Spanish as the basis for a "Multi-dimensional analysis of register variation in Spanish". (Carried out in conjunction with Douglas Biber of NAU, and funded by the US National Science Foundation) |
|
Corpus del Espaņol (2002) |
100 million word corpus of Spanish (1200s-1900s). The corpus allows users to find the frequency, distribution, and use of words, phrases, and grammatical constructions in different historical periods, as well as in the genres of Modern Spanish. (Funded by the US National Endowment for the Humanities) |
| Corpus of LDS General Conference talks |
24 million words in more than 10,000 talks from 1851 to the current time. Allows users to track the frequency of words and phrases by decade, find collocates (nearby words) to see changes in meaning, compare between different historical periods, and more. |
| Other projects |
If you're interested in multilingual corpora, you might try a few that I've created: the Polyglot Bible (Gospel of Luke in 30 languages) and the Latin-Old Spanish-Modern Spanish Bible (entire text). |
|
|
|
|
Legal consultancies Corpora and the law |
In 2004
and again in 2008 I served as an "expert witness" in cases related
to trade names in Spanish and English. The cases involved lawsuits
(totaling more than 500 million dollars) against a large
multinational firm, which was accused of selling a dangerous product
in Latin America. Corpus-based data that I provided showed, however,
that the company's trade name had become generic (like xerox or
kleenex), so that people who used another company's product
mistakenly called it by the generic name of the product in question.
Partially as a result of the corpus data that I presented, the
lawsuits against this company were dismissed. In addition, linguistic data from my corpora have been used in a number of court cases, including a case from the US Supreme Court in February 2011 (see overview, amicus brief, Atlantic, Language Log). They have also been used as the basis for some articles in law reviews, such as this and this. There's also an interesting overview of some of these cases (as well as my work with corpora) in a recent Deseret News article. Note that because I have created the corpora that are used for these and similar cases, I can retrieve data from the corpora in ways that might not be possible for other linguist expert witnesses. |
|
|
|
|
Technology |
In order to create large corpora and place them online, I have acquired experience in a number of different technologies. These include database organization and optimization (mainly with SQL Server, including advanced SQL queries), web-database integration (ActiveX Data Objects), server-side scripting (mainly Active Server Pages, via VBScript), client-side programming (mainly DHTML / Javascript), basic file and text manipulation (regular expressions, batch files, etc), and several different corpus and text-related tools (like WordSmith and TextPad). I also maintain the hardware and software for my three Windows 2003 Servers, including the administration of Internet Information Services. |
|
Interests |
Beyond life at the university, my interests include comparative religion, world cultures, history, languages of the world, and the relationship between technology and culture, including the Internet. And of course I enjoy spending time with my family -- Kathy, Spencer, Joseph, and Adam. |
|
|
mark
|
|
|
|