CORPORA - UNSTRUCTURED

(Mon, Sep 16) Text archives

Get list of texts/pages

Looping through pages and entries (RegEx's or DOM)
Download: HTTrack, Spencer Davies (Go), requests (Python), wget (Linux), etc
Process pages: Python libraries, regular expressions; programming
(Web: clean pages: JusText, BeautifulSoup e.g. CNN)
Search data
Cambridge article: limitations
Simplest: list of pages: movies
Literature Online (e.g. "dreary")
Oxford English Dictionary
New York Times: PDF >> TXT (for COHA; % good words)
Factiva
Google News (AU; 1hr) (day) ; -- 1 >> 2
Licensing issues
Copyright issues

1. Simple: list of URLs
-- c:\movies1\list.txt
-- HTTRack (c:\movies.bat)
-- Spencer's programs

2. More messy, but just use RegExs
-- Google News (AU; 1hr) (day)
-- 1 >> 2

3. Program to extract information
-- DAVIES 2017A: VB.NET IMDB to get info from pages
-- But problem with getting actual file:
-- IMDB, Open Subtitles (ID: 4154756); Download

Supreme Court (year)
c:\a_web\ling485\for_class\sc.txt : 1 >> 2
.*(http://caselaw.findlaw.com/us-supreme-court/[[:digit:]]+/[[:digit:]]+\.html).*">([^<]*)<.*\n.*">([^<]+)<.*\n.*">([^<]+)<.*
\n@@\1@\2@\3@\4
See D2011; VB.NET = news1
Academic Search Premier (metadata to db)
Lexis Nexis: D2014: VB.NET 2005 \coca\1-lex-nex\new (db: coca_spok..x)
Lit Online

(Wed, Sep 18)

5 Advantages of the Web:

size: "simple models, lots of data"; cf. Google translate
authentic: no idea their material would be used for linguistic analysis; e.g. very informal blogs
up-to-date: "linguistic reflections of contemporary culture" (examples?); neologisms (vs. COCA, etc)
affordability: compile and process multi-million word corpus overnight
diversity / languages

10 Limitations / disadvantages of web

wildcards
part of speech (cf. [vv*] [p*] into [vvg*])
semantic: collocates
semantic: synonyms, user-defined lists, etc
genre?
date?
problems with different servers
multi-word: not really have the (16,500 vs 300); he might be taken for a (1,540,000 vs 160)
stemming, punctuation, diacritics (público, publico, publicó)
limit of 1000 hits; most popular/entertainment sites first

15 Biber web

What's the problem?
How got data: relationship to GloWbE
Where got raters?
How categorized: issues (inter-rater, hybrid)
CORE corpus

(Wed, Sep 26) Ways of accessing web data:

Creating GloWbE: "and to the"

Creating NOW: Google News (US; 1hr) (day)

Creating iWeb: (PPT)

Amazon's Alexa
of for each of 200,000 websites (Google and Bing; five computers; problem with # searches)
Download 30 million web pages
Minimum # words, pages per website
BYU corpus architecture

Method	Example	Advantages	Disadvantages
Online corpora	iWeb, GloWbE, NOW, Sketch Engine	Easy	Limited to textual corpora they've created
Google, Bing, etc		No special knowledge	Very limited searches
(Search engine): serially	BootCat	Create large corpora; "off the shelf"	Limited to just what the program can do Whether search engines allow it
(Bing): real-time	WebCorp the only * is that he [is\|was] [they\|he\|she] ed [me\|him\|her\|us\|them] into ing	The web itself	Sometimes flaky; slow
RegEx / programming	Tools: JusText, HTTrack, Spencer Davies (Go)
Lists of URLs	TV/movies	Fairly easy	Have to somehow create the list of URLs
Google: manual	"and is the"	Limited knowledge needed; maybe regex's	Small amount of data; probably still need regex's, etc
Google: script	GloWbE ("and is the") NOW (1 hr) Bing	Unlimited amount of data; full customizability	Need to know how to program (Python, VB.NET, wget, etc)