Date

Geogr

Size

Content

Notes

SEU

1959-
(1953-87)

UK

1m
(200 x 5000w)

1/2 wr, 1/2 sp

Randolph Quirk
Spoken in 1980 as London/Lund Corpus (LLC)

First generation (most in ICAME Collection)

Brown

1961

US

1m
(500 x 2000w)

All written
75% informative
25% imaginative

Indifference/hostility
News, rel, pop, bios, misc, learned
Gen, mystery, SF, adv/West, romance, humor
Problems: size, little regional diffs

LOB

1961
(1970-78)

UK

1m
(500 x 2000w)

Approx same

Approx same

FLOB
FROWN

1991

US/UK

1m each

Approx same

Approx same

Australia
New Zealand
India

1978 India
1986 Aus/NZ

India
NZ/Aus

1m each

Approx same

Approx same (-Western, SF, romance in Kolhapur)

London-Lund

 

 

500k
(100 x 5000 w)

 

From the SEU
Lots of markup (tone, pause, stress, simult)

Second generation "mega corpora"

Cobuild/BoE
(search)

1980s>
 

70% UK
20% US
10% other

7.3m 1982
>500m 2006

25% spoken

Monitor corpus has morphed into Bank of English
Joint commercial/academic
 

ICE

1990 >

Many countries
(e.g. UK)

1m each

Overview
600k spoken
400k written

 

BNC (search)

1991-95

UK

100m

Spreadsheet
(download)
90% written
  75% informative
  25% imaginative
10% spoken

Help from British gov't

Spoken:
A) 2000 hours conv
B) controlled context

ANC

c2000 >

US

~11m

3.2m spoken
8.3m written

 

 

Language acquisition

CHILDES (c1985-; ~20m)
MICASE (Michigan Corpus of Academic Spoken English)
ICLE (International Corpus of Learner English (1990s>, 2m; 19 countries, 500w essays)

Lexicographical

-- Need 500m words (surcingle)
-- Am Her Intl; 1969; 5m; kids 7-15
-- Longman Dict Contemp Eng (1980s): on tape (sample page)
-- OED: 2.4m quotations
    (The Meaning of Everything; The Professor and the Madman)
-- Many full-text databases (eg. LitOnline, ASP)

Other
-- European Corpus Initiative

Spoken

-- Lancaster/IBM Spoken English Corpus (SEC) 1984-87; 52k; UK
-- Wellington Corpus Spok NZ Eng (1988-93); 1m
-- COLT: Corpus of London Teenage Language

Linguistic Data Consortium: Membership, collects corpora, used by programmers, speech recognition-transcribed orthographically, phonetically, time stamp. Examples:

 

Historical (1500s-1900s)
-- Helsinki (1984-91); 1.5m (OE-1700s)
-- Fairly complete listing