CORPORA - UNSTRUCTURED

(Additional page)


Kilgarriff, Adam and Gregory Grefenstette (2003) “Web as corpus”. Computational Linguistics 29:1-15

1. Is the Web a corpus? Explain why or why not.
2. Aren't "mega-corpora" like the BNC big enough?  Why use anything larger? (VIEW: Top 3000 [vvi] in NEWS)
3a. How big is the Web, in terms of number of words?
3b. How much has it grown over the past few years?
3c. How is it distributed across languages?
4a. How do the authors respond to the objection that the Web isn't really representative of "Language X"?
4b. How do the authors respond to the objection that the Web has too many errors? (Common errors)


SketchEngine (was BootCat)

Linguist's Search Engine

WebCorp

Query Google

KwicFinder

Grab-A-Site and Google results


Google's Digital Library Initiative
Update (1/20/2007)
New Yorker article
Content tagging of web

CORPORA discussion of Google counts (#1 #2)
Comparing counts on different Google servers (#1, #2)

Creating corpora from text archives


SIZE OF THE WEB (Spanish):

phrase CdE Google 04 Size 04 Google 06 Size 06 Google 08 Size 08
han dicho 355 412,000 23,211,267,606 2,460,000 138,591,549,296 5,970,000 336,338,028,169
pusieron 606 467,000 15,412,541,254 1,660,000 54,785,478,548 6,490,000 214,191,419,142
hasta que se 410 490,000 23,902,439,024 1,750,000 85,365,853,659 1,030,000 50,243,902,439
son más 666 941,000 28,258,258,258 3,210,000 96,396,396,396 7,310,000 219,519,519,520
      22,696,126,536   93,784,819,475   205,073,217,318