CORPUS DESIGN AND CONSTRUCTION

Note: you are only responsible for the Hunston reading for the quiz. I'll be using the Kennedy material for class discussion, but you won't be reading it.


Hunston, p.25-37

1. Size: is there an ideal size for a corpus? What determines what size a corpus should be?
2. Content: what is the importance of the following two factors: what is available, and what the corpus is going to be used for?
3. Representativeness: explain the meaning of this statement from Hunston: being representative inevitably involves knowing what the character of the 'whole is".
4. Permanence: what does this mean? Why is it important?
5. Discuss briefly the different ways in which a corpus might be used (p33-37), and how this relates to corpus design


Kennedy, p.60-70; 70-82

1. What is the difference between a "static" and a "dynamic" corpus?  When is each most appropriate, and what are the advantages and disadvantages of each type?
2. What is the relationship between "representativity" and being able to make generalizations from the corpus?
3. What are some practical ways that one can decide the percentage of the different text types? (p72)
4. Atkins mentions several extralinguistic variables that one might want to use in designing a corpus (p74, par 2). Which of these seem the most important to you? (text origin, participants, medium, style, setting, factuality, topic, date, authorship (nationality, gender, age), intended audience)
5. Consider the pros and cons to each of the following approaches to selecting texts (p63): 1) elitist 2) random 3) currency 4) typicalness 5) availability 5) demographic
6. What are some concrete examples of adequately "sampling" the data? (p74-75)
7. How do you define a "native speaker" of a particular variety of a language?
8. What are some dangers inherent in using newspapers to create a corpus?
9. Give some concrete examples of what size of corpus is best for certain types of linguistic analyses
10. How big should individual samples be?
11. Give some concrete examples that support the idea that "the optimal design of a corpus is highly dependent on the purpose for which it is intended to be used. The compiler of a corpus should, if possible, have a clear idea of what kinds of analyses are likely to be undertaken..."
12. Think of some concrete cases where it might be difficult to categorize texts
13. Give specific examples of how "the practical problems associated with handling large amounts of text should never be underestimated".
14. Name a couple of general principles in terms of storing and keeping records
15. What are the three main ways of getting text into a corpus? Which is probably most common nowdays?
16. What are some special problems associated with collecting and using spoken language?
17. Name a couple of general principles in terms of getting permission to use texts and transcripts
18. Take a five minute look at the principles of the Fair Use Law.  Think of 2-3 specific ways this may affect your collection of texts


Links:

Brown design
BNC design (Excel)
Helsinki corpus

Corpus del Espaņol
Corpus do Portugues
Corpus of Historical English

Fair Use Law
Google Print

BN: "sample chapter"

  • Questions on COCA?
  • Any suggestions on COHA?

Tools:

JusText
Grab-A-Site (Class site)
HTML to ASCII (HTMASC)
PDF to Word to ASCII conversion
VB.NET and databases
(Number of files)