PROJECT: DESIGNING A CORPUS
Imagine that BYU has asked you to create a
corpus of a particular language, or a specialized corpus (by register,
historical period, topic, etc) of a given language. Share with the class
the way in which you would take into account the following issues and features,
and why you have made the decisions that you have. (Note: these considerations
roughly follow the outline of the article)
1. Is it an archive, an electronic text
library, a corpus, or a sub-corpus?
2. What types of written and/or spoken texts will be in the corpus?
3. More specifically, briefly discuss the following characteristics of your
corpus: mode, text origin, constitution, medium, style, topic, date, and
author(s).
4. How will you distribute it to others?
5. What types of annotation will there be (tagging, text identification, etc)?
6. More specifically, what information about each text will be included in the
header, index, or source files?
7. Will it be grammatically tagged? Why or why not?
8. How will you handle the following types of text features: (for written) non-ascii
characters, quotations, lists, headings, proper names, and pagination; (for
spoken) speaker change, syntax, accent/dialect, interruptions, pauses, and
inaudible segments?
9. What are some copyright problems that you might face? How will you deal with
these?
10. How representative will your corpus be of the entire population (i.e. all
possible texts)? What means will you take to create a representative corpus?
11. Who will be the main users of your corpus? What types of information will
they likely be looking for?
|