CORPORA - UNSTRUCTURED(Additional
page)
Kilgarriff, Adam and Gregory Grefenstette (2003) “Web as corpus”.
Computational Linguistics 29:1-15
1. Is the Web a corpus? Explain why or why not.
2. Aren't "mega-corpora" like the BNC big enough? Why use anything larger?
(VIEW: Top 3000 [vvi] in NEWS)
3a. How big is the Web, in terms of number of words?
3b. How much has it grown over the past few years?
3c. How is it distributed across languages?
4a. How do the authors respond to the objection that the Web isn't really
representative of "Language X"?
4b. How do the authors respond to the objection that the Web has too many
errors? (Common
errors)
SketchEngine (was
BootCat)
Linguist's Search Engine
WebCorp
Query Google
KwicFinder
Grab-A-Site and
Google results
Google's Digital Library Initiative
Update (1/20/2007)
New Yorker article
Content
tagging of web
CORPORA discussion of Google counts (#1
#2)
Comparing counts on different
Google servers (#1,
#2)
Creating corpora from
text archives
SIZE OF THE WEB (Spanish):
|
phrase |
CdE |
Google 04 |
Size 04 |
Google 06 |
Size 06 |
Google 08 |
Size 08 |
|
han dicho |
355 |
412,000 |
23,211,267,606 |
2,460,000 |
138,591,549,296 |
5,970,000 |
336,338,028,169 |
|
pusieron |
606 |
467,000 |
15,412,541,254 |
1,660,000 |
54,785,478,548 |
6,490,000 |
214,191,419,142 |
|
hasta que se |
410 |
490,000 |
23,902,439,024 |
1,750,000 |
85,365,853,659 |
1,030,000 |
50,243,902,439 |
|
son más |
666 |
941,000 |
28,258,258,258 |
3,210,000 |
96,396,396,396 |
7,310,000 |
219,519,519,520 |
|
|
|
|
22,696,126,536 |
|
93,784,819,475 |
|
205,073,217,318 |
|