WORDLISTS AND KEYWORDS
 

Dee Gardner: Exploring Vocabulary

1. COCA/BNC interface

  • All sections
  • Limit by section (more later)
  • Limit by substring

2. www.wordfrequency.info

  • 5000 lemma list
  • Comparisons to ANC/BNC; notes on compiling 100k list (standard issues for any frequency list)
  • uses for this type of data (natural language processing, materials development)

3. www.ngrams.info

  • natural language processing (predictive software)

  • Intro

  • Previous

  • Importance of corpus (garbage in, garbage out)

  • Size

    • Brown corpus (1 million words): too small

    • About 4,000 lemmas occur 20 times or more (vs 40k BNC, 120k COCA)

    • 83 of the 1,000 most frequent adjectives in COCA occur five times or less in Brown, including
      such common words as fun, offensive, medium, tender, teenage, coastal, scary, organizational, terrific, sexy, cute, innovative, risky, shiny, viable, hazardous, conceptual, and affordable (all of which occur 5,000 times or more in COCA).

    • Of the top 2,000 adjectives in COCA, 425 occur five times or less in Brown, and this rises to 2,053 of the top 5,000 and 5,106 of the top 10,000 (all of which occur 120 times or more in COCA)

  • Tokenization:

    • MWUs: of course, as far as, give up

    • compromise: multiple words, share PoS tags (e.g. CS31, CS32, CS33)

    • Hyphenated: never-to-be-forgotten, run-of-the-mill

    • Contractions: ca n't, she 'll, Fred 's: why?

    • Real variation: awhile, a while

    • Chinese

    • polysemy: bank, bat ("500,000 trained monkeys")

  • Type/token: color / colour, nonsmoker / non-smoker, no / noo / noooooo

  • Type / tokens

  • Type / token ratio (problem / solution)

    # texts

    types

    tokens

    TTR

    1

    720

    2,786

    25.8

    10

    3917

    29,928

    13.1

    100

    12514

    250,000

    5.0

    1000

    35673

    2,855,324

    1.2

    10000

    72597

    24,479,439

    0.3


     
  • Range and dispersion ("dispersion")

    • range: how many "buckets" have at least one drop (one token)

    • dispersion: how evenly filled are the buckets (0.01 - 0.99)

    • Juilland d; cf. other methods

    • Routledge frequency dictionaries: score = raw frequency * dispersion

  • Zipfian distribution: intro (actual data)


4. BNC wordlists


Solutions for customized / personal corpora

WordAndPhrase (enter text)

5. Range (more) (and issue of word families "break, mean"; google "paul nation range program")

6. AntConc (c:\download\)

7. WordSmith

8. TextStat (c:\download\textStat)


Keywords

8. WordClouds.com (example from Wordle)

9. AntConc

10. WordAndPhrase.info

11. Virtual corpora (COCA: Astronomy, Wiki: Biology, GC: Speaker)

Comparisons

  • Genesis vs Bible
  • Book of Mormon vs Bible
  • Alma vs Book of Mormon
  • General Conference vs BNC
  • Specific General Authorities vs General Conference