Using the web as corpus

1. Size of the web

Simple ratios:

3 9
4 ??

With linguistic data:

(A) Freq in BNC (C) Freq in Google
(B) Size of BNC (D) Size of Google

Formula: (B x C) / A

Words with frequency of 80 in the BNC:

annealing, appraise, archivists, asthmatic, attractor, backswing, bedsit, blameless, boatman, bogwood, botanists, buggered, buggery, burgundy, calmness, caramel, castigated, clawing, clench, clings, coachman, colectomy, collectivist, compensations, congressman, conjunctions, conquerors, contaminants, contemplates, contrive, controversially, countervailing, cruisers, deathbed, decays, diplomatically, domineering, dominoes, eccentricities, empties, eugenics, exigencies, ferociously, forested, frontline, garnished, glade, gnawed, gradation, grieved, gulping, handheld, headroom, heathen, heaviness, hideously, holocaust, icily, impressionistic, inimitable, innuendo, irradiated, joyfully, kayak, keg, landowning, largesse, latched, lexis, luminosity, lumped, luv, lysosomal, malfunction, meaty, medreses, memorably, mesenchyme, mettle, misinterpretation, national, neoplasia, netball, neutrophil, nibble, obliges, overseer, participates, pate, pelmet, perfused, personification, po, polluter, predilection, promontory, purist, quays, rabies, racehorse, raindrops, reciprocated, redoubtable, refunds, resurrect, reversals, ri, sacrosanct, sepsis, sheathed, shopfloor, slackened, slats, slimmer, smithy, snubbed, sporty, staid, steers, strangling, suns, swallows, sympathize, tans, tarpaulin, unbelief, unfailing, ungainly, unselfish, unsubstantiated, untied, variceal, warlike, weaned, whoop, wicketkeeper, woodworm

Example: garnished

(A) 80 (C) 1,750,000
(B) 100,000,000 (D) 2,187,500,000,000 (~2.2 trillion)

Formula: (B x C) / A

2. Which register?

Too many in BNC register = too low of estimate from Web
Too few in BNC register = too high of estimate from Web

furtive    
SPOKEN 1 1,990,000
  10,335,000 20,566,650,000,000
FICTION 98 1,990,000
  16,195,000 328,857,653,061
NEWSPAPER 8 1,990,000
  10,638,000 2,646,202,500,000
ACADEMIC 5 1,990,000
  15,430,000 6,141,140,000,000
OTHER NF 7 1,990,000
  16,634,000 4,728,808,571,429

Try some very colloquial phrases -- like so not, he's all worried, etc.

3. What dialect (geographical)?

British English
Australian English

UK, AU, NZ, but what about US?   .EDU? .US? .COM?

4. What types of queries?

Possibly for frequency of words and phrases, but what about:

  • Part of speech

  • Lemma

  • Concordances

  • Collocates

5. Meaningful comparisons

  • Can't just give raw frequency for one feature in Corpus1 vs raw frequency in Corpus2

  • Have to do one of the following:

    • "Normalize" frequency (e.g. per million words) in the two corpora, OR

    • Compare two features in each corpora (e.g. at hospital vs. at the hospital)


Other "non-corpus corpora"

-- To calculate per million: FREQ*(1,000,000/SIZE)

A. General Conference (via library.lds.org; Advanced / Magazine / Conference reports)

  • 65,000 words per conference (e.g. April 2005)

  • About 4,500,000 words total (1971-2005; 65,000 x 70)

  • But search engine just shows number of talks; problem with multiple occurrences in given talk

grace BNC-Sermons (S_sermon) General Conference
Tokens 42 249
Size 82,000 4,500,000
Per million 510 55

B. Newspaper corpus (New York Times; BYU list)

Look for "new words" in the Oxford English Dictionary

TIME PERIOD

Millions of words

1850-1899 39
1900-1949 127
1950-1999 158
TOTAL 323