PROJECT 3: WEB AS CORPUS
send the assignment to me,
with LING 485: Project
3 in the subject line
(including the colon and the right capitalization).
Also please name your Word file in the following way: Project
+ underscore + uncapitalized lastname, e.g.
There are three different possibilities
for this project (A, B, and C). The first one (A) isn't very tech-y; the
other two are a bit more technical. But if you do B or C
correctly, you'll get 10/10 on the project (not a very common
A. DEFAULT OPTION:
ONE of the
following. Note that #1-2 (and possibly #3, unless you really spend
a long time on it) are very easy compared to everything
else, and the maximum possible score on these would probably
be B+ (kind of like diving -- easier dive = lower possible
Pick a number of words where they are often misspelled, where there is
variation in spelling, or there is correct/incorrect usage (see
2). How do the differences vary across
different domains (e.g. .edu, .com, .gov, .uk, .ca, etc)? Take a brief sampling
of the types of pages (company websites, individuals' own homepages, etc) that
have the "aberrant" usages/spellings.
Choose a "non-standard" grammatical construction and see how it varies
across domains or websites. Use
site:domain in the search field, e.g.
"so not happy" site:.edu
or "had went"
site:reddit.com. Don't just search in one domain / website,
but rather compare between different domains / websites.
Examples (feel free to choose some for
another language, if you want)
I'm like so, so not interested, who to talk with / with whom to talk, had went / had saw
3. Semantics / collocates
Pick a relatively uncommon word (most of which will still have tens of
thousands of hits on the Web). Maybe use
(and then frequency list and then starting range, e.g. 13,980 -- and don't take words with hyphens). Sample the first 70-80 occurrences
from the Google results and see
what words collocate most frequently. How does this compare to COCA or the BNC?
Explain exactly how you extracted the collocates from Google.
Choose a feature (lexical, grammatical, or whatever) with which you know
or suspect that there would be a difference from one country to
another. Use Google Settings / Advanced Search and then "Region".
Note that you absolutely have to account for the overall size of the
"corpus" for that country (i.e. the number of web pages), by using a
"Control word" (as we did in class on Thursday). If you don't do
that, I'll return it to you ungraded. Here's an
example for a single word. I'm expecting you to do 4-5 related
Control word: put
per 1,000 put
= 218% more than UK
OPTION B. Use
to create a corpus for a topic of your choice. Send me a one
paragraph description of how it worked out, as well as a
list of the URLs. Also, ZIP and store the actual web pages
on the cloud (e.g. Google Drive or DropBox) and send me a
OPTION C. (Only
try this one if you are already really familiar with regular
regular expressions and a web scraper (e.g.
get 10,000 links from Google for a particular topic or
country or language. Then download the actual pages. Send me
an email with an attachment listing all 10,000 URLS (a
simple text file is fine), and one paragraph telling me how
you got the data. Then ZIP all of the web pages and put them
on the cloud (e.g. Google Drive or DropBox) and send me a link.
Google search (from Google News). For a language other than
English, replace [of] and [en] below, and adjust [US] for
the desired countries. For a range of dates, run it several
times, with different values for the part in green (e.g.
9/20/2015, 9/21/2015, 9/22/2015, etc).
regular expressions to extract the URLs. For example,
which will put all of the URLs at the beginning of the line.
Then extract lines beginning with @@ and save these as a
Create a batch file (e.g. getData.bat), where the only thing
in the batch file is the following (replace the part in
yellow with the file containing the URLs, and the part in
green with the destination folder for the downloaded pages) :
"c:\Program Files\WinHTTrack\"httrack -qgC2%%P0s0b0u1j0%%s%%u%%I0p3DaK0T20R1H3%%kA25000%%f#f