HISTORICAL I
 

Text archives

  1. Literature Online

  2. Literature Online (LION)

  3. Early English Books Online (Michigan)

  4. American Periodicals (1741-1900)

  5. New York Times (1851-present)

  6. Oldie radio scripts

  7. Movie scripts

  8. Google Books

  9. Project Gutenberg

  10. TIME Magazine

  11. OED

Part of speech (noun, verb)
Lemmatization (forms of go)
Show frequency by period

Small corpora

Large vs small corpora (d2e Helsinki 2015)

Recent change: COCA (changes page), NOW (PPT)

My corpora

  1. English

    1. OED Corpus (37m words, Old English - present)

    2. TIME Corpus (100m words, 1923-2006)

    3. General Conference (25 million words, 1851-20010)
       

    4. Early English Books Online [EEBO]: 755 million words, 1470s-1690s

    5. Corpus of Historical American English [COHA]: 400m words, ~1810-2009

    6. Corpus of Contemporary American English [COCA]: 560m words, 1990-present

    7. News on the Web [NOW]: 5.8+ billion words, 2012-yesterday
       

  2. Spanish

    1. CORDE

    2. Corpus del Español
       

  3. Portuguese

    1. Corpus do Português


Martin Hilpert's work with motion charts for COHA (general motion charts)


What can you do with a real corpus?

  1. Overall frequency
     

    1. Words and phrases

      1. COHA, TIME (main page)

      2. Spanish: soldado, casto

      3. Gen Conf: (main page)

      4. Problem: spelling and lemmatization (notwithstanding, seem, haver)
         

    2. Morphemes/roots

      1. -aholic, -gate in TIME

      2. [fazer], [haver] in Portuguese
         

    3. Syntactic constructions

      1. end up Ving

      2. who/whom

      3. going to / will (COHA)

      4. preposition stranding

      5. accustomed to V/V-ing

      6. split infinitives

      7. subjunctive (if I was/were)

      8. modals of obligation: should / must / ought to / need to / have to

      9. problem: part of speech tagging

Advanced syntax:

  1. relative pronouns ([nn*] [cst*]|[ddq*]|[pnqs*] he [v*] / [nn*] -- he [v*])
     

  2. pre/post verbal negation (and "do support") with have
    (older: [p*] [have] [x*] [a*]|[d*] [nn*] / newer: [p*] [do] [x*] [have] [a*]|[d*] [nn*])

Collocates (changes in meaning)

  1. Chip

  2. Engine

  3. Wife

  4. Crisis
     

  1. Comparison by time period

    1. Verbs in 1930s-40s vs 1990s-2000s

    2. *heart* in 1800s vs 1900s (COHA)

    3. [ Class looks for others in 1800s / 1900s: OED ]

    4. Phrasal verbs

    5. Feminine -ess nouns

    6. ADJ with woman in OED, TIME

    7. ADJ with mujer in Spanish (1800s/1900s)

    8. ADJ with mulher in Portuguese (1800s/1900s)

Limitations

  1. Time-delay (whom, like)

  2. Balance (Spanish, COHA)

  3. Spelling variation: notwithstanding, peas, up (others??) // Pt lemmatization


Activity

  1. Find frequency of synonyms

  2. Historical syntax


Google Books (compare: COHA, GB, GB-BYU)

COCA

Using Google Books (Science)