Nice overview

 

Date

Geogr

Size

Content

Notes

SEU ("index" cards)
(Survey of English Usage)

1959-
(1953-87)

UK

1m
(200 x 5000w)

1/2 wr, 1/2 sp

Randolph Quirk
Spoken in 1980 as London/Lund Corpus (LLC)

First generation (most in ICAME Collection) (also available from class website)

Brown

1961

US

1m
(500 x 2000w)

All written
75% informative
25% imaginative

Indifference/hostility
News, rel, pop, bios, misc, learned
Gen, mystery, SF, adv/West, romance, humor
Problems: size, little regional diffs

 

Brown = (on average) 1/450th the number of tokens
~ #10,000 (~1,600 tokens in COCA): turbulent 4, exaggerated 13, wooded 5, tidy 1, smoky 5, shimmering 3 (=5 average)
~ #30,000 (~150 tokens in COCA): decorous 1, kitschy, faithless, pulsating 3
~ #40,000 (~70 tokens in COCA): scorned, mistrustful, changeless, untalented

LOB

1961
(1970-78)

UK

1m
(500 x 2000w)

Approx same

Approx same

FLOB
FROWN

1991

US/UK

1m each

Approx same

Approx same

Australia
New Zealand
India

1978 India
1986 Aus/NZ

India
NZ/Aus

1m each

Approx same

Approx same (-Western, SF, romance in Kolhapur)

London-Lund

 

 

500k
(100 x 5000 w)

 

From the SEU
Lots of markup (tone, pause, stress, simult)

Second generation "mega corpora"

Cobuild/BoE
(more)

1980s>
 

70% UK
20% US
10% other

7.3m 1982
>500m 2006

25% spoken

Monitor corpus has morphed into Bank of English
Joint commercial/academic

"MarkDavies"; ?Z6QZFz?
 

BNC (search)

1991-95

UK

100m

Spreadsheet
(download)
90% written
  75% informative
  25% imaginative
10% spoken

Help from British gov't

Spoken:
A) 2000 hours conv
B) controlled context

ANC

c2000 >

US

~11m

3.2m spoken
8.3m written

 

ICE

1990 >

Many countries
(e.g. UK)

1m each

Overview
600k spoken
400k written

 

COCA 1990-present US 450 million    
Sketch Engine         mdavies; verl.oya<ano

 


Language acquisition

CHILDES (c1985-; ~20m)
MICASE (Michigan Corpus of Academic Spoken English)
ICLE (International Corpus of Learner English (1990s>, 2m; 19 countries, 500w essays)
 


Lexicographical

-- Need 500m words (surcingle)
-- Am Her Intl; 1969; 5m; kids 7-15
-- Longman Dict Contemp Eng (1980s): on tape (sample page)
-- OED: 2.4m quotations
    (The Meaning of Everything; The Professor and the Madman)
-- Many full-text databases (eg. LitOnline, ASP)
 


Other
-- European Corpus Initiative
 


Spoken

-- Lancaster/IBM Spoken English Corpus (SEC) 1984-87; 52k; UK
-- CANCODE (Cambridge and Nottingham Corpus of Discourse in English). 5 million words
-- Wellington Corpus Spok NZ Eng (1988-93); 1m
-- COLT: Corpus of London Teenage Language

Linguistic Data Consortium: Membership, collects corpora, used by programmers, speech recognition-transcribed orthographically, phonetically, time stamp. Examples:

 


Historical (1500s-1900s)
-- Helsinki (1984-91); 1.5m (OE-1700s)
-- ARCHER
-- Fairly complete listing
 
 speech, historical, web, parallel