Optional: 3-4 page paper (which includes charts and/or figures).

Send it to me (mark_davies@byu.edu) as a Word document by Friday, Dec 13, 8 AM -- sorry, no exceptions.

Pick a syntactic construction (or an issue in morphology) from the list below and then comment on variation and change with the construction.

For the phenomenon you've chosen, look at the following types of variation from the BYU corpora:

1. Genre variation: COCA  and BNC
2. Dialectal variation: COCA/BNC and GloWbE
3. Historical: COHA, TIME, Google Books, and COCA (1990-2012)
(4. Lexical: what words did it start with, or change with the most)

If you want to see an overview of the phenomenon, look in our class textbook (Biber, et al) . The [CHAPTER] column in the table below shows which chapter in the book discusses the construction.

Examples:

  • END up V-ING (he'll end up paying too much)
  • Help / help to (I helped John (to) clean the car)
  • HAVE as +/- main verb (have you the time?, do you have the time?)
CHAPTER TOPIC
3 Preposition stranding (He's the one I was talking to)
4 Pronouns and gender (he or she)
5 Phrasal verbs: frequency of "separated" verbs: look (up) the word (up)
6 Can / may (can / may I use the phone?)
6 Passive (was studied)
6 Get passive (got run over (vs. was run over))
6 Progressive (is watching)
6 Perfect (HAVE seen)
6 Combinations of perfect, passive, progressive (has been watching, was being considered, etc)
6 Future (going to verb / will verb)
6 Will / shall (I will / shall consider five factors...)
6 Semi-modals (need to , have to, ought to, etc)
6 Modals: frequency of different modals
7 Comparatives (sillier / more silly)
7 Go / come ADJ (go crazy, come clean)
7 Get ADJ(ed)
8 Contraction (they simply cannot / can't do it)
8 No / not negation (I don't have any reason / I have no reason)
9 Frequency of [nn*] [nn*] (the breakfast cereal ad campaign)
10 +/- that (I guess (that) they're not coming)
10 begin / start + INF / V-ING (started watching / to watch)
  Like: and she's like "I'm not going out with him"
  so not ADJ: I'm so not going out with her / he's so not the kind of guy I like

Using Excel / ratios

To create a chart, first copy the frequency data from your web browser to Excel. If you are comparing two constructions (e.g. will/shall, or have proven/proved), then the chart should be the RATIO of the two constructions. Please don't just give the frequency of A and then B, and expect me to create a chart in my mind showing the ratio. To calculate ratios, do the following

    Column A  
Row 1 Feature A 30 <- Cell A1
Row 2 Feature B 70 <- Cell A2
Row 3 % feature B =(A2)/(A1+A2)  

After creating the ratio formula in one cell, just copy and paste it to the other cells in that row. Then highlight the cells in that "ratio" row and choose "Insert" and then "Column" or "Line" (whichever kind of chart you want) to create the chart.
 


With any feature, there will be some difference between genres, time periods, or dialects. The question is whether this difference is statistically significant. To determine this, you can use chi-square. (I should mention that there are some problems with using chi-square with the types of large numbers that you get with these corpora. But we'll ignore that for the time being.)

Example #1: With +/- "to" in the construction "help someone (to) verb", the following is the data from the BNC and COCA is:

  American British
+to 2230 1581
- to 16220 3122
% +to 88% 66%

Plugging the numbers in the four yellow cells into the chi-square calculator, we get a "p-value" of 0 (which is really low, and) which is below .05. So yes, the difference is significant.

Example #2: With "going to VERB" vs. "will VERB" in the five genres of COCA, we get:

  SPOK FIC MAG NEWS ACAD
will [v*] 155791 67245 144578 182891 104482
going to [v*] 209335 46999 26512 41795 6113
% may I 57% 41% 14% 19% 6%

Plugging the numbers in the ten yellow cells into the chi-square calculator, we get a "p-value" of 0 (which is again really low, and) which is again below .05. So yes, the differences is again significant.

Example #3: With "accustomed to [vvi] (accustomed to watch)" vs. "accustomed to [vvg] (accustomed to watching") in the different decades of the TIME Corpus, we get:

  1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s
V 36 64 23 10 6 11 5 5 2
V-ing 17 31 38 48 32 44 30 28 8
%V-ing 32% 33% 62% 83% 84% 80% 86% 85% 80%

 

If you plug in the numbers from the yellow cells into the chi-square calculator, we once more get a "p-value" of 0, which is again significant. This makes sense, because there is a big increase in V-ing from the 1930s-1950s. But if we just include the numbers from the 1950s-2000s, then the p-value increases to .98, which is not below .05, and therefore not statistically significant.
 


SPREADHSEET