HISTORICAL I
 

Text archives

  1. New York Times

  2. Literature Online

  3. Literature Online (LION)

  4. Early English Books Online (Michigan)

  5. American Periodicals (1741-1900)

  6. New York Times (1851-present)

  7. Oldie radio scripts

  8. Movie scripts

  9. Google Books

  10. Project Gutenberg

  11. TIME Magazine

  12. OED

Part of speech (noun, verb)
Lemmatization (forms of go)
Show frequency by period

Small corpora

Large vs small corpora (d2e Helsinki 2015)

My corpora

  1. English

    1. OED Corpus (37m words, Old English - present)

    2. TIME Corpus (100m words, 1923-2006)

    3. General Conference (25 million words, 1851-20010)
       

    4. EEBO

    5. Corpus of Historical American English [COHA] (400m words, ~1810-2009)

    6. Corpus of Contemporary American English [COCA] (425 words, 1990-present)
       

  2. Spanish

    1. CORDE

    2. Corpus del EspaŮol
       

  3. Portuguese

    1. Corpus do PortuguÍs


Martin Hilpert's work with motion charts for COHA (general motion charts)


What can you do with a real corpus?

  1. Overall frequency
     

    1. Words and phrases

      1. COHA, TIME (main page)

      2. Spanish: soldado, casto

      3. Gen Conf: (main page)

      4. Problem: spelling and lemmatization (notwithstanding, seem, haver)
         

    2. Morphemes/roots

      1. -aholic, -gate in TIME

      2. [fazer], [haver] in Portuguese
         

    3. Syntactic constructions

      1. end up Ving

      2. who/whom

      3. going to / will (COHA)

      4. preposition stranding

      5. accustomed to V/V-ing

      6. split infinitives

      7. subjunctive (if I was/were)

      8. modals of obligation: should / must / ought to / need to / have to

      9. problem: part of speech tagging

Advanced syntax:

  1. relative pronouns ([nn*] [cst*]|[ddq*]|[pnqs*] he [v*] / [nn*] -- he [v*])
     

  2. pre/post verbal negation (and "do support") with have
    (older: [p*] [have] [x*] [a*]|[d*] [nn*] / newer: [p*] [do] [x*] [have] [a*]|[d*] [nn*])

Collocates (changes in meaning)

  1. Chip

  2. Engine

  3. Wife

  4. Crisis
     

  1. Comparison by time period

    1. Verbs in 1930s-40s vs 1990s-2000s

    2. *heart* in 1800s vs 1900s (COHA)

    3. [ Class looks for others in 1800s / 1900s: OED ]

    4. Phrasal verbs

    5. Feminine -ess nouns

    6. ADJ with woman in OED, TIME

    7. ADJ with mujer in Spanish (1800s/1900s)

    8. ADJ with mulher in Portuguese (1800s/1900s)

Limitations

  1. Time-delay (whom, like)

  2. Balance (Spanish, COHA)

  3. Spelling variation: notwithstanding, peas, up (others??) // Pt lemmatization


Activity

  1. Find frequency of synonyms

  2. Historical syntax


Google Books

COCA