WORDLISTS AND KEYWORDS
Dee Gardner: Exploring Vocabulary
1. COCA/BNC interface
- All sections
- Limit by section (more later)
- Limit by substring
2. www.wordfrequency.info
- 5000 lemma list
- Comparisons to ANC/BNC; notes on compiling 100k list
(standard issues for any frequency list)
- uses for this type of data (natural language processing,
materials development)
3. www.ngrams.info
- natural language processing (predictive software)
-
Intro
-
What's the purpose of frequency
data?
-
Routledge Frequency Dictionaries
-
Samples:
English,
Spanish,
Portuguese,
Russian,
German,
Chinese,
Korean,
French,
Arabic
-
Also:
Persian,
Turkish,
Dutch,
Japanese,
Czech
-
Previous
-
Importance of corpus (garbage in, garbage out)
-
Size
-
Brown corpus (1 million words): too small
-
About 4,000 lemmas occur 20 times or more (vs 40k BNC, 120k
COCA)
-
83 of the 1,000 most frequent adjectives in COCA occur five
times or less in Brown, including
such common words as fun, offensive, medium, tender, teenage,
coastal, scary, organizational, terrific, sexy, cute,
innovative, risky, shiny, viable, hazardous, conceptual, and
affordable (all of which occur 5,000 times or more in COCA).
-
Of the top 2,000 adjectives in COCA, 425 occur five times or
less in Brown, and this rises to 2,053 of the top
5,000 and 5,106 of the top 10,000
(all of which occur 120 times or more in COCA)
-
Tokenization:
-
MWUs: of course, as far as, give up
-
compromise: multiple words,
share PoS tags (e.g. CS31, CS32, CS33)
-
Hyphenated: never-to-be-forgotten,
run-of-the-mill
-
Contractions: ca n't, she 'll, Fred 's: why?
-
Real variation: awhile, a while
-
Chinese
-
polysemy: bank, bat ("500,000
trained monkeys")
-
Type/token: color / colour,
nonsmoker / non-smoker, no / noo / noooooo
-
Type / tokens
-
Type / token ratio (problem
/ solution)
# texts |
types |
tokens |
TTR |
1 |
720 |
2,786 |
25.8 |
10 |
3917 |
29,928 |
13.1 |
100 |
12514 |
250,000 |
5.0 |
1000 |
35673 |
2,855,324 |
1.2 |
10000 |
72597 |
24,479,439 |
0.3 |
-
Range and
dispersion ("dispersion")
-
range: how many "buckets" have at
least one drop (one token)
-
dispersion: how evenly filled are
the buckets (0.01 - 0.99)
-
Juilland d; cf. other methods
-
Routledge frequency dictionaries:
score = raw frequency * dispersion
-
Zipfian distribution:
intro (actual data)
4. BNC wordlists
Solutions for customized / personal corpora
WordAndPhrase (enter text)
5.
Range
(more)
(and issue of word families "break, mean"; google "paul nation range
program")
6.
AntConc (c:\download\)
7.
WordSmith
8.
TextStat (c:\download\textStat)
Keywords
8.
WordClouds.com (example
from Wordle)
9. AntConc
10. WordAndPhrase.info
11. Virtual corpora (COCA: Astronomy, Wiki: Biology, GC: Speaker)
Comparisons
- Genesis vs Bible
- Book of Mormon vs Bible
- Alma vs Book of Mormon
- General Conference vs BNC
- Specific General Authorities
vs General Conference
|