|CORPUS DESIGN AND CONSTRUCTION
Note: you are only responsible for the
Hunston reading for the quiz. I'll be using the Kennedy material for
class discussion, but you won't be reading it.
1. Size: is there an ideal size for a corpus?
What determines what size a corpus should be?
2. Content: what is the importance of the following two factors: what is
available, and what the corpus is going to be used for?
3. Representativeness: explain the meaning of this statement from Hunston: being
representative inevitably involves knowing what the character of the 'whole is".
4. Permanence: what does this mean? Why is it important?
5. Discuss briefly the different ways in which a corpus might be used (p33-37),
and how this relates to corpus design
1. What is the difference between a "static"
and a "dynamic" corpus? When is each most appropriate, and what are the
advantages and disadvantages of each type?
2. What is the relationship between "representativity" and being able to make
generalizations from the corpus?
3. What are some practical ways that one can decide the percentage of the
different text types? (p72)
4. Atkins mentions several extralinguistic variables that one might want to use
in designing a corpus (p74, par 2). Which of these seem the most important to
you? (text origin, participants, medium, style, setting,
factuality, topic, date, authorship (nationality, gender, age), intended
5. Consider the pros and cons to each of the following approaches to selecting
texts (p63): 1) elitist 2) random 3) currency 4) typicalness 5) availability 5)
6. What are some concrete examples of adequately "sampling" the data? (p74-75)
7. How do you define a "native speaker" of a particular variety of a language?
8. What are some dangers inherent in using newspapers to create a corpus?
9. Give some concrete examples of what size of corpus is best for certain types
of linguistic analyses
10. How big should individual samples be?
11. Give some concrete examples that support the idea that "the optimal design
of a corpus is highly dependent on the purpose for which it is intended to be
used. The compiler of a corpus should, if possible, have a clear idea of what
kinds of analyses are likely to be undertaken..."
12. Think of some concrete cases where it might be difficult to categorize texts
13. Give specific examples of how "the practical problems associated with
handling large amounts of text should never be underestimated".
14. Name a couple of general principles in terms of storing and keeping records
15. What are the three main ways of getting text into a corpus? Which is
probably most common nowdays?
16. What are some special problems associated with collecting and using spoken
17. Name a couple of general principles in terms of getting permission to use
texts and transcripts
18. Take a five minute look at the
principles of the Fair Use Law. Think of 2-3 specific ways this may
affect your collection of texts
BNC design (Excel)
Corpus del Espaņol
Corpus do Portugues
Corpus of Historical English
Fair Use Law
BN: "sample chapter"
- Questions on COCA?
- Any suggestions on COHA?
HTML to ASCII (HTMASC)
PDF to Word to ASCII conversion
VB.NET and databases
(Number of files)