PROJECT 3: WEB AS CORPUS
 


Please send the assignment to me, with LING 485: Project 3 in the subject line (including the colon and the right capitalization).

Also please name your Word file in the following way: Project number + underscore + uncapitalized lastname, e.g. 3_snodgrass


There are three different possibilities for this project (A, B, and C). The first one (A) isn't very tech-y; the other two are a bit more technical. But if you do B or C correctly, you'll get 10/10 on the project (not a very common thing).


A. DEFAULT OPTION: Do ONE of the following. Note that #1-2 (and possibly #3, unless you really spend a long time on it) are very easy compared to everything else, and the maximum possible score on these would probably be B+ (kind of like diving -- easier dive = lower possible score).

1. Orthography
Pick a number of words where they are often misspelled, where there is variation in spelling, or there is correct/incorrect usage (see 1 and 2).  How do the differences vary across different domains (e.g. .edu, .com, .gov, .uk, .ca, etc)? Take a brief sampling of the types of pages (company websites, individuals' own homepages, etc) that have the "aberrant" usages/spellings.

2. Grammar
Choose a "non-standard" grammatical construction and see how it varies across domains or websites. Use site:domain in the search field, e.g. "so not happy" site:.edu or "had went" site:reddit.com. Don't just search in one domain / website, but rather compare between different domains / websites.

     Examples (feel free to choose some for another language, if you want)
     I'm like so, so not interested, who to talk with / with whom to talk, had went / had saw

3. Semantics / collocates
Pick a relatively uncommon word (most of which will still have tens of thousands of hits on the Web). Maybe use WordandPhrase.Info (and then frequency list and then starting range, e.g. 13,980 -- and don't take words with hyphens). Sample the first 70-80 occurrences from the Google results and see what words collocate most frequently. How does this compare to COCA or the BNC? Explain exactly how you extracted the collocates from Google.

4. Dialect
(Note that options #1-3 above don't require you to limit the search to a particular dialect. This one does.)
Choose a feature (lexical, grammatical, or whatever) with which you know or suspect that there would be a difference from one country to another. Use Google Settings / Advanced Search and then "Region". Note that you absolutely have to account for the overall size of the "corpus" for that country (i.e. the number of web pages), by using a "Control word" (as we did in class). If you don't do that, I'll return it to you ungraded. Here's an example for a single word. I'm expecting you to do 4-5 related words.

 

candy

Control word: put

ratio

per 1,000 put

American

850,000,000

4,410,000,000

.1927

192.7

British

41,600,000

472,000,000

.0881

88.1

       

= 218% more than UK

 


OPTION B.
(Note: one of the people in the class emailed me today (Tues, Sep 25) and said that this is working. You're kind of on your own, though, if there are any difficulties.)
Use BootCat to create a corpus for a topic of your choice. Send me a one paragraph description of how it worked out, as well as a list of the URLs. Also, ZIP and store the actual web pages on the cloud (e.g. Google Drive or DropBox) and send me a link.


OPTION C. (Only try this one if you are already really familiar with regular expressions). Using regular expressions and a web scraper (e.g. HTTrack or the program by Spencer Davies (Go) ), get 10,000 links from Google for a particular topic or country or language. Then download the actual pages. Send me an email with an attachment listing all 10,000 URLS (a simple text file is fine), and one paragraph telling me how you got the data. Then ZIP all of the web pages and put them on the cloud (e.g. Google Drive or DropBox) and send me a link.

1. Sample Google search (from Google News). For a language other than English, replace [of] and [en] below, and adjust [US] for the desired countries. For a range of dates, run it several times, with different values for the part in green (e.g. 9/20/2015, 9/21/2015, 9/22/2015, etc).
https://www.google.com/search?q=of&num=100&lr=lang_en&tbs=cdr:1,cd_min:9/20/2015,cd_max:9/20/2015,sbd:1&tbm=nws&cr=countryUS

2. Use regular expressions to extract the URLs. For example, replace href="http with \n@@http, which will put all of the URLs at the beginning of the line. Then extract lines beginning with @@ and save these as a text file.

To get the web pages, use either 3a or 3b:

3a. Spencer Davies (Go). Make sure there are just two files in a new folder: the one from #2 and the .EXE file. Then click on the .EXE file.

3b. HTTrack: Create a batch file (e.g. getData.bat), where the only thing in the batch file is the following (replace the part in yellow with the file containing the URLs, and the part in green with the destination folder for the downloaded pages) :
"c:\Program Files\WinHTTrack\"httrack -qgC2%%P0s0b0u1j0%%s%%u%%I0p3DaK0T20R1H3%%kA25000%%f#f -%%L "c:\movies.txt" -O1 "
c:\movies\"