|
Using the web as corpus 1. Size of the web
Simple ratios:
With linguistic data:
|
(A) Freq in BNC |
(C) Freq in Google |
|
(B) Size of BNC |
(D) Size of Google |
|
Formula: (B x C) / A |
Words with frequency of 80 in the BNC:
| annealing, appraise, archivists, asthmatic, attractor, backswing,
bedsit, blameless, boatman, bogwood, botanists, buggered, buggery,
burgundy, calmness, caramel, castigated, clawing, clench, clings,
coachman, colectomy, collectivist, compensations, congressman,
conjunctions, conquerors, contaminants, contemplates, contrive,
controversially, countervailing, cruisers, deathbed, decays,
diplomatically, domineering, dominoes, eccentricities, empties,
eugenics, exigencies, ferociously, forested, frontline, garnished,
glade, gnawed, gradation, grieved, gulping, handheld, headroom, heathen,
heaviness, hideously, holocaust, icily, impressionistic, inimitable,
innuendo, irradiated, joyfully, kayak, keg, landowning, largesse,
latched, lexis, luminosity, lumped, luv, lysosomal, malfunction, meaty,
medreses, memorably, mesenchyme, mettle, misinterpretation, national,
neoplasia, netball, neutrophil, nibble, obliges, overseer, participates,
pate, pelmet, perfused, personification, po, polluter, predilection,
promontory, purist, quays, rabies, racehorse, raindrops, reciprocated,
redoubtable, refunds, resurrect, reversals, ri, sacrosanct, sepsis,
sheathed, shopfloor, slackened, slats, slimmer, smithy, snubbed, sporty,
staid, steers, strangling, suns, swallows, sympathize, tans, tarpaulin,
unbelief, unfailing, ungainly, unselfish, unsubstantiated, untied,
variceal, warlike, weaned, whoop, wicketkeeper, woodworm |
Example: garnished
|
(A) 80 |
(C) 1,750,000 |
|
(B) 100,000,000 |
(D) 2,187,500,000,000
(~2.2 trillion) |
|
Formula: (B x C) / A |
2. Which register?
Too many in BNC register = too low of estimate from Web
Too few in BNC register = too high of estimate from Web
|
furtive |
|
|
|
SPOKEN |
1 |
1,990,000 |
| |
10,335,000 |
20,566,650,000,000 |
|
FICTION |
98 |
1,990,000 |
| |
16,195,000 |
328,857,653,061 |
|
NEWSPAPER |
8 |
1,990,000 |
| |
10,638,000 |
2,646,202,500,000 |
|
ACADEMIC |
5 |
1,990,000 |
| |
15,430,000 |
6,141,140,000,000 |
|
OTHER NF |
7 |
1,990,000 |
| |
16,634,000 |
4,728,808,571,429 |
Try some very colloquial phrases -- like so not, he's
all worried, etc.
3. What dialect (geographical)?
British English
Australian English
UK, AU, NZ, but what about US? .EDU? .US? .COM?
4. What types of queries?
Possibly for frequency of words and phrases, but what about:
-
Part of speech
-
Lemma
-
Concordances
-
Collocates
5. Meaningful comparisons
Other "non-corpus corpora"
-- To calculate per million:
FREQ*(1,000,000/SIZE)
A.
General
Conference (via library.lds.org;
Advanced / Magazine / Conference reports)
-
65,000 words per conference
(e.g. April 2005)
-
About 4,500,000 words total
(1971-2005; 65,000 x 70)
-
But search engine just shows number of talks;
problem with multiple occurrences in given talk
| grace |
BNC-Sermons (S_sermon) |
General Conference |
| Tokens |
42 |
249 |
| Size |
82,000 |
4,500,000 |
| Per million |
510 |
55 |
B. Newspaper corpus (New
York Times; BYU
list)
Look for "new words" in the
Oxford English Dictionary
|
TIME PERIOD |
Millions of words |
|
1850-1899 |
39 |
|
1900-1949 |
127 |
|
1950-1999 |
158 |
|
TOTAL |
323 |
|