CORPORA - UNSTRUCTURED
But first
(Mon, Sep 16) Text archives
- Get list of texts/pages
- Looping through pages and entries (RegEx's or DOM)
- Download: HTTrack,
Spencer Davies (Go), requests (Python), wget (Linux), etc
- Process pages: Python libraries, regular expressions;
programming
- (Web: clean pages:
JusText,
BeautifulSoup
e.g. CNN)
- Search data
- Cambridge article: limitations
- Simplest: list of pages:
movies
- Literature Online (e.g. "dreary")
- Oxford English Dictionary
- New York Times: PDF >> TXT (for COHA; % good words)
- Factiva
- Google News (AU;
1hr) (day)
; --
1 >>
2
- Licensing issues
- Copyright issues
1. Simple: list of URLs
--
c:\movies1\list.txt -- HTTRack (c:\movies.bat) -- Spencer's programs
2. More messy, but just use RegExs --
Google News (AU;
1hr) (day) --
1 >>
2
3. Program to extract information -- DAVIES 2017A: VB.NET IMDB to get
info from pages -- But problem with getting actual file: -- IMDB,
Open Subtitles (ID: 4154756);
Download
- Supreme Court (year)
c:\a_web\ling485\for_class\sc.txt
: 1 >>
2 .*(http://caselaw.findlaw.com/us-supreme-court/[[:digit:]]+/[[:digit:]]+\.html).*">([^<]*)<.*\n.*">([^<]+)<.*\n.*">([^<]+)<.* \n@@\1@\2@\3@\4
- See D2011; VB.NET = news1
- Academic Search Premier (metadata to db)
- Lexis Nexis: D2014: VB.NET 2005 \coca\1-lex-nex\new (db:
coca_spok..x)
- Lit Online
(Wed, Sep 18)
5 Advantages of the Web:
- size: "simple models, lots of data"; cf. Google translate
- authentic: no idea their material would be used for
linguistic analysis; e.g. very informal blogs
- up-to-date: "linguistic reflections of contemporary culture"
(examples?); neologisms (vs. COCA, etc)
- affordability: compile and process multi-million word corpus
overnight
- diversity / languages
10 Limitations / disadvantages of web
- wildcards
- part of speech (cf. [vv*] [p*] into [vvg*])
- semantic: collocates
- semantic: synonyms, user-defined lists, etc
- genre?
- date?
- problems with different servers
- multi-word:
not really have the (16,500 vs
300); he might be taken for a (1,540,000 vs 160)
- stemming, punctuation, diacritics (público, publico, publicó)
- limit of 1000 hits; most popular/entertainment sites first
15
Biber web
- What's the problem?
- How got data: relationship to GloWbE
- Where got raters?
- How categorized: issues (inter-rater, hybrid)
- CORE corpus
(Wed, Sep 26) Ways of accessing web data:
Creating GloWbE: "and
to the"
Creating NOW:
Google News (US;
1hr) (day)
Creating iWeb: (PPT)
- Amazon's Alexa
- of for each of 200,000 websites (Google and
Bing; five computers;
problem with # searches)
- Download 30 million web pages
- Minimum # words, pages per website
- BYU corpus architecture
Method |
Example |
Advantages |
Disadvantages |
Online corpora |
iWeb, GloWbE, NOW, Sketch Engine |
Easy |
Limited to textual corpora they've created |
Google, Bing, etc |
|
No special knowledge |
Very limited searches |
(Search engine): serially |
BootCat |
Create large corpora; "off the shelf" |
Limited to just what the program can do Whether search engines allow it |
(Bing): real-time |
WebCorp the only * is that he [is|was] [they|he|she] *ed [me|him|her|us|them]
into *ing |
The web itself |
Sometimes flaky; slow |
RegEx / programming |
Tools:
JusText,
HTTrack,
Spencer Davies (Go) |
|
|
Lists of URLs |
TV/movies |
Fairly easy |
Have to somehow create the list of URLs |
Google: manual |
"and is the" |
Limited knowledge needed; maybe regex's |
Small amount of data; probably still need regex's, etc |
Google: script |
GloWbE ("and is the")
NOW (1
hr)
Bing |
Unlimited amount of data; full customizability |
Need to know how to program (Python, VB.NET, wget, etc) |
|