PROJECT: DESIGNING A CORPUS
 

Imagine that BYU has asked you to create a corpus of a particular language, or a specialized corpus (by register, historical period, topic, etc) of a given language.  Share with the class the way in which you would take into account the following issues and features, and why you have made the decisions that you have. (Note: these considerations roughly follow the outline of the article)

1. Is it an archive, an electronic text library, a corpus, or a sub-corpus?
2. What types of written and/or spoken texts will be in the corpus?
3. More specifically, briefly discuss the following characteristics of your corpus: mode, text origin, constitution, medium, style, topic, date, and author(s).
4. How will you distribute it to others?
5. What types of annotation will there be (tagging, text identification, etc)?
6. More specifically, what information about each text will be included in the header, index, or source files?
7. Will it be grammatically tagged? Why or why not?
8. How will you handle the following types of text features: (for written) non-ascii characters, quotations, lists, headings, proper names, and pagination; (for spoken) speaker change, syntax, accent/dialect, interruptions, pauses, and inaudible segments?
9. What are some copyright problems that you might face? How will you deal with these?
10. How representative will your corpus be of the entire population (i.e. all possible texts)? What means will you take to create a representative corpus?
11. Who will be the main users of your corpus? What types of information will they likely be looking for?