CORPORA - UNSTRUCTURED

But first



(Mon, Sep 16)
Text archives

  • Get list of texts/pages
    • Looping through pages and entries (RegEx's or DOM)
    • Download: HTTrack, Spencer Davies (Go), requests (Python), wget (Linux), etc
    • Process pages: Python libraries, regular expressions; programming
    • (Web: clean pages: JusText, BeautifulSoup e.g. CNN)
    • Search data
       
    • Cambridge article: limitations
       
    • Simplest: list of pages: movies
    • Literature Online (e.g. "dreary")
    • Oxford English Dictionary
    • New York Times: PDF >> TXT (for COHA; % good words)
    • Factiva
    • Google News (AU; 1hr) (day) ; -- 1 >> 2
       
    • Licensing issues
    • Copyright issues

    1. Simple: list of URLs
    -- c:\movies1\list.txt
    -- HTTRack (c:\movies.bat)
    -- Spencer's programs

    2. More messy, but just use RegExs
    -- Google News (AU; 1hr) (day)
    -- 1 >> 2

    3. Program to extract information
    -- DAVIES 2017A: VB.NET IMDB to get info from pages
    -- But problem with getting actual file:
    -- IMDB, Open Subtitles (ID: 4154756); Download
     

    • Supreme Court (year)
      c:\a_web\ling485\for_class\sc.txt : 1 >> 2
      .*(http://caselaw.findlaw.com/us-supreme-court/[[:digit:]]+/[[:digit:]]+\.html).*">([^<]*)<.*\n.*">([^<]+)<.*\n.*">([^<]+)<.*
      \n@@\1@\2@\3@\4
    • See D2011; VB.NET = news1
    • Academic Search Premier (metadata to db)
    • Lexis Nexis: D2014: VB.NET 2005 \coca\1-lex-nex\new (db: coca_spok..x)
    • Lit Online

    (Wed, Sep 18)

    5 Advantages of the Web:

    • size: "simple models, lots of data"; cf. Google translate
    • authentic: no idea their material would be used for linguistic analysis; e.g. very informal blogs
    • up-to-date: "linguistic reflections of contemporary culture" (examples?); neologisms (vs. COCA, etc)
    • affordability: compile and process multi-million word corpus overnight
    • diversity / languages

    10 Limitations / disadvantages of web

    • wildcards
    • part of speech (cf. [vv*] [p*] into [vvg*])
    • semantic: collocates
    • semantic: synonyms, user-defined lists, etc
    • genre?
    • date?
       
    • problems with different servers
    • multi-word: not really have the (16,500 vs 300); he might be taken for a (1,540,000 vs 160)
    • stemming, punctuation, diacritics (público, publico, publicó)
    • limit of 1000 hits; most popular/entertainment sites first

    15 Biber web

    • What's the problem?
    • How got data: relationship to GloWbE
    • Where got raters?
    • How categorized: issues (inter-rater, hybrid)
    • CORE corpus

    (Wed, Sep 26) Ways of accessing web data:

    Creating GloWbE: "and to the"

    Creating NOW: Google News (US; 1hr) (day)

    Creating iWeb: (PPT)

    • Amazon's Alexa
    • of for each of 200,000 websites (Google and Bing; five computers; problem with # searches)
    • Download 30 million web pages
    • Minimum # words, pages per website
    • BYU corpus architecture
       
    Method Example Advantages Disadvantages
    Online corpora iWeb, GloWbE, NOW, Sketch Engine Easy Limited to textual corpora they've created
    Google, Bing, etc   No special knowledge Very limited searches
    (Search engine): serially BootCat Create large corpora; "off the shelf" Limited to just what the program can do
    Whether search engines allow it
    (Bing): real-time WebCorp
    the only * is that he [is|was]
    [they|he|she] *ed [me|him|her|us|them] into *ing
    The web itself Sometimes flaky; slow
    RegEx / programming Tools: JusText, HTTrack, Spencer Davies (Go)    
    Lists of URLs TV/movies Fairly easy Have to somehow create the list of URLs
    Google: manual "and is the" Limited knowledge needed; maybe regex's Small amount of data; probably still need regex's, etc
    Google: script GloWbE ("and is the")
    NOW (1 hr)
    Bing
    Unlimited amount of data; full customizability Need to know how to program (Python, VB.NET, wget, etc)