|
Mark Davies Professor, Corpus Linguistics Brigham Young University |
|
||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
|
EDUCATION |
I received a B.A. from Brigham Young University in 1986 with a double major in Linguistics and Spanish, which was followed by an M.A. in Spanish Linguistics from BYU in 1989. I then received a PhD from the University of Texas at Austin in 1992, with a specialization in "Ibero-Romance Linguistics". |
|
| PUBLICATIONS AND PRESENTATIONS |
For the first ten years or so of my career, most of my publications dealt primarily with historical and genre-based variation in Spanish and Portuguese syntax. Since that time, however, they have increasingly dealt with general issues in corpus design, creation, and use, especially with regards to English. In the future, they will probably deal increasingly with genre-based variation and historical change with English. Overall, I have had more than fifty articles published (or accepted for publication) in these areas (including four books), as well as numerous presentations at international conferences. [SEE VITA] |
|
|
TEACHING |
In Fall 2009 I'm teaching two sections of Historical and Comparative Linguistics (Ling 450). Other recent classes are Corpus Linguistics (Ling 485), Empirical Methods in English Linguistics (ELang 273), and English Grammar (ELang 325). In Winter 2010 I'll be teaching for the first time ELang 495, dealing with "A Corpus-based Approach to the History of American English". |
|
|
|
||
|
CORPUS OF
HISTORICAL AMERICAN
ENGLISH (COHA) |
In March 2009 I received a large grant from the National Endowment for the Humanities to create a 300 million word Corpus of Historical American English (early 1800s - present time). The corpus will be balanced in each decade (and therefore overall, as well) between fiction, popular magazines, newspapers, and academic. The 300 million word COHA (1800s-2000s) will nicely complement the nearly 400 million word COCA (1990s-2000s). Most importantly, it will allow researchers to examine a wide range of changes in American English with much more accuracy and detail than with any other available corpus. An online beta version of the corpus should be available in Summer 2010, and the final version will be available in December 2010. |
|
|
FREQUENCY
DICTIONARY OF AMERICAN ENGLISH |
This will be published by Routledge in Jan-Feb 2010, and will be co-authored with Prof. Dee Gardner of the Department of Linguistics and English Language at BYU. It will contain the top 5000 words (lemmas) in American English, based on the data from the Corpus of Contemporary American English (COCA). Rather than providing one single sample sentence for each word, this dictionary will give the top 15-25 collocates (grouped by part of speech) for each of the 5000 words, which should give a much better idea of the overall meaning of each word. |
|
|
In 2008 I placed online a 400+ million word corpus of American English. This is the only large-scale corpus of American English, and it is in fact the largest (and hopefully most useful) structured corpus of any language freely available on the web. The corpus contains twenty million words in each year from 1990 to the present, with four million words each year in the five genres of spoken, fiction, popular magazines, newspapers, and academic. Best of all, the corpus will be continually updated -- 20 million words each year -- from this point on. |
||
|
I created this dictionary in conjunction with Prof. Ana Preto-Bay from the Department of Spanish and Portuguese at BYU. The dictionary is based on the 20 million words from the 1900s portion of the 45 million word Corpus do Português. It is the first frequency dictionary of Portuguese that is based on a large corpus from several different genres, and it has a format quite similar to the Frequency Dictionary of Spanish, discussed below. It was published by Routledge in late 2007. |
||
|
CORPUS DO PORTUGUÊS (2006) |
In April 2004 I was awarded a two year grant from the National Endowment for the Humanities to create a 45 million word corpus of historical Portuguese, in conjunction with Prof. Michael Ferreira of Georgetown University. Completed in 2006, this corpus allows users to compare the frequency, distribution, and use of words, phrases, and grammatical constructions between different historical periods, registers, and dialects of Portuguese. The corpus underwent a major upgrade in 2008, to give it the same architecture, interface, and features as the other corpora that I've placed online. |
|
|
This frequency dictionary of Spanish was published in late 2005, and was the first major frequency dictionary of Spanish published in English since 1964. It is based on the 20 million words from the 1900s portion of the 100 million word Corpus del Español, and it includes many features not found in any previous dictionary of Spanish. [MORE INFORMATION] |
||
|
In July 2002 I was awarded a two year grant from the National Science Foundation to research the "Multi-dimensional analysis of register variation in Spanish". As Co-PI with Prof. Douglas Biber of Northern Arizona University, we used large corpora of many different registers of Spanish from the 1600s-1900s to explore the syntactic variation between these registers. |
||
|
CORPUS DEL ESPAÑOL |
In April 2001 I was awarded a 16 month grant from the National Endowment for the Humanities to develop a 100 million word searchable corpus of historical and modern Spanish texts on the web. Unlike other large corpora of Spanish, my Corpus del Español allows users to perform advanced searches based on part of speech, lemma, collocates, synonyms, and frequency in different time periods and genres. The corpus underwent a major upgrade in late 2007, to give it the same architecture, interface, and features as the other corpora that I've placed online. |
|
| OTHER PROJECTS |
In the past year, I've also created a 400+ million word corpus of transcripts of spoken American English (2000-present), as well as a 100 million word corpus from an American magazines, 1920s-present. For reasons of copyright, however, these are not currently available to others. If you're interested in multilingual corpora, you might try a few that I've created: the Polyglot Bible (Gospel of Luke in 30 languages) and the Latin-Old Spanish-Modern Spanish Bible (entire text). |
|
| LEGAL CONSULTANCIES | In 2004 and again in 2008 I served as an "expert witness" in cases related to trade names in Spanish and English. The cases involved a lawsuits against a large multinational firm, which was accused of selling a dangerous product in Latin America. Corpus-based data that I provided showed, however, that the company's trade name had become genericized (like Xerox or Kleenex), so that people used another company's product but called it by the genericized name of the product in question. Partially as a result of the corpus data that I presented, the lawsuits against this company were dismissed. | |
|
|
||
|
TECHNOLOGIES |
In order to create large corpora and place them online, I have acquired experience in a number of different technologies. These include database organization and optimization (mainly with SQL Server, including advanced SQL queries), web-database integration (ActiveX Data Objects), server-side scripting (mainly Active Server Pages, via VBScript), client-side programming (mainly DHTML / Javascript), basic file and text manipulation (regular expressions, batch files, etc), and several different corpus and text-related tools (like WordSmith and TextPad). I also maintain the hardware and software for my three Windows 2003 Servers, including the administration of Internet Information Services. |
|
|
INTERESTS |
Beyond life at the university, my interests include comparative religion, world cultures, history, languages of the world, and the implications of technology, including the Internet. And of course I enjoy spending time with my family -- Kathy, Spencer, Joseph, and Adam. |
|
|
|
|
|
|
|
||