Mark Davies
Professor, Corpus Linguistics
Brigham Young University

 

 INTRODUCTION


I am a professor of Corpus Linguistics in the Department of Linguistics and English Language at Brigham Young University in Provo, Utah.  From 1992-2003, I was a professor of Spanish Linguistics at Illinois State University.

My primary areas of research and activity are:
--    Corpus and computational linguistics
--    Design and optimization of linguistic databases
--    Web scripting and web-database integration
--    Historical linguistics and syntactic variation
--    English, Spanish, and Portuguese

Please feel free to try out some of the searchable online corpora that I've created, including the following (more information):
    Corpus of Contemporary American English (COCA) 360+ million words  (US, 1990-2007)
    British National Corpus* 100 million words  (UK, 1980s-1993)
    TIME Corpus 100 million words  (US, 1923-present)
    OED Corpus of Historical English 37 million words (Old Eng-1900s)
    Corpus del Español 100 million words  (1200s-1900s)
    Corpus do Português 45 million words  (1300s-1900s)
        * my architecture and interface  

EDUCATION

 

I received a B.A. from Brigham Young University in 1986 with a double major in Linguistics and Spanish, which was followed by an M.A. in Spanish Linguistics from BYU in 1989. I then received a PhD from the University of Texas at Austin in 1992, with a specialization in "Ibero-Romance Linguistics".

PUBLICATIONS AND PRESENTATIONS
[SEE VITA]
 

For the first ten years or so of my career, most of my publications dealt primarily with historical and genre-based variation in Spanish and Portuguese syntax. Since that time, however, they have increasingly dealt with general issues in corpus design, creation, and use, especially with regards to English. Overall, I have had nearly forty articles published in these areas, as well as numerous presentations at international conferences. [SEE VITA]

TEACHING

  In Summer 2008 I'm teaching English Grammar (ELang 325). Other recent classes include Empirical Methods in English Linguistics and Corpus Linguistics.


CORPUS OF CONTEMPORARY AMERICAN ENGLISH (COCA) (2008)
 

  In 2008 I placed online a 360+ million word corpus of American English. This is the only large-scale corpus of American English, and it is in fact the largest (and hopefully most useful) structured corpus of any language freely available on the web. The corpus contains twenty million words in each year from 1990 to the present, with four million words each year in the five genres of spoken, fiction, popular magazines, newspapers, and academic. Best of all, the corpus will be continually updated -- 20 million words each year -- from this point on.

FREQUENCY DICTIONARY OF PORTUGUESE (2007)

 

I created this dictionary in conjunction with Prof. Ana Preto-Bay from the Department of Spanish and Portuguese at BYU. The dictionary is based on the 20 million words from the 1900s portion of the 45 million word Corpus do Português. It is the first frequency dictionary of Portuguese that is based on a large corpus from several different genres, and it has a format quite similar to the Frequency Dictionary of Spanish, discussed below. It was published by Routledge in late 2007.

FREQUENCY DICTIONARY OF SPANISH (2005)

 

This frequency dictionary of Spanish was published in late 2005, and was the first major frequency dictionary of Spanish published in English since 1964.  It was based on more than 20 million words from many different registers, and includes many features not found in any previous dictionary of Spanish. [MORE INFORMATION]

CORPUS DO PORTUGUÊS (2004-06)

 

In April 2004 I was awarded a two year grant from the National Endowment for the Humanities to create a corpus of historical Portuguese, in conjunction with Prof. Michael Ferreira of Georgetown University.  This corpus allows users to compare the frequency, distribution, and use of words, phrases, and grammatical constructions between different historical periods, registers, and dialects of Portuguese.

REGISTER VARIATION IN SPANISH (2002-04)

 

In July 2002 I was awarded a two year grant from the National Science Foundation to research the "Multi-dimensional analysis of register variation in Spanish". As Co-PI with Prof. Douglas Biber of Northern Arizona University, we used large corpora of many different registers  of Spanish from the 1600s-1900s to explore the syntactic variation between these registers.

CORPUS DEL ESPAÑOL (2001-02, MAJOR NEW RELEASE 2007)

 

In April 2001 I was awarded a 16 month grant from the National Endowment for the Humanities to develop a 100 million word searchable corpus of historical and modern Spanish texts on the web.  Unlike other large  corpora of Spanish, my Corpus del Español allows users to perform advanced searches based on part of speech, lemma, synonyms, and word and clause frequency. 

OTHER PROJECTS

 

In the past year, I've also created a 400+ million word corpus of transcripts of spoken American English (2000-present), as well as a 100 million word corpus from an American magazine, 1920s-present. For reasons of copyright, however, these are not currently available to others. If you're interested in multilingual corpora, you might try a few that I've created: the Polyglot Bible (Gospel of Luke in 30 languages) and the Latin-Old Spanish-Modern Spanish Bible (entire text). 


TECHNOLOGIES

 

In order to create large corpora and place them online, I have acquired experience in a number of different technologies.  These include database organization and optimization (mainly with SQL Server, including advanced SQL queries), web-database integration (ActiveX Data Objects), server-side scripting (mainly Active Server Pages, via VBScript), client-side programming (mainly DHTML / Javascript), basic file and text manipulation (regular expressions, batch files, etc), and several different corpus and text-related tools (like WordSmith and TextPad).  I also maintain the hardware and software for my three Windows 2003 Servers, including the administration of Internet Information Services.

INTERESTS

 

Beyond life at the university, my interests include comparative religion, world cultures, world history (especially ancient and medieval), languages of the world, and the implications of technology, including the Internet.  And of course I enjoy spending time with my family -- my wife Kathy, and "my three sons" -- Spencer, Joseph, and Adam.

EMAIL

 


American National Corpus