| RELATIONAL DATABASES Davies, Mark (2005) "The advantage of using relational databases for large corpora: speed, advanced queries, and unlimited annotation". International Journal of Corpus Linguistics 10: 301-28 (Via Corpora page) 1. To get a basic overview of SQL ("Standard Query Language"), look at Sections 1-3 of the following site. At the bottom of the page there are a few activities to see if you've captured the basic idea. http://sqlcourse.com/select.html By the way, here's some more links, if you want them: http://www.w3schools.com/sql/sql_select.asp http://www.1keydata.com/sql/sql.html 2. Go to the following page and run some SQL queries against a simple BNC database that I've created. The password is the same that we've used to access the corpora. In order to run queries that make sense, you'll need to know something about the two tables we'll be using this next week. The first deals with just single words and their frequency. Here's the structure of the table. W1 refers to the word form, C1 is part of speech ("category"), and L1 is lemma. ONE WORD [TABLE NAME = Y]
Let's try an actual query. Here's one that finds the five most common VVI (infinitival form of the verb) and their frequency for verbs that start with [in]. You'd type this into the [SQL] box on the web form (just copy and paste this):
(make sure that you also select [ONE WORD]) The following words are displayed:
There is also a "seven word sequence" table. This contains the main word in the [w4] slot, with three "slots" to the left (w1, w2, and w3, along with their part of speech tags c1, c2, c3) and the three words to the right (w5, w6, w7). If you want to find the most common noun following [sad], then you'd have [sad] as the [w4] and [nn%] as the [c5] MULTIPLE WORD (TABLE NAME = X]
This gets a bit more tricky. Suppose that you want to find all cases of [heart of] followed by a noun. Input the following query, and make sure you select the [MULTIPLE WORD] table:
This gives us 878 individual occurrences, and we could count up all of the different nouns in the [w6] slot to see the most frequent noun after [heart of]. Luckily, though, we can have the database do this for us, via the following command:
This groups together the different words in the [w6] slot, and shows us how frequent each one is.
By the end of class on Wednesday, you should be able to create SQL queries to find the following: SINGLE WORD TABLE:
MULTIPLE WORD TABLE:
COUNT THE FOLLOWING:
ADDITIONAL PRACTICE: SINGLE WORD TABLE 1. The three most frequent forms of the lemma drink 2. The five most frequent lexical verbs that are a "base" form of the verb (VVI, VVB), by individual entry (e.g. see-VVI and see-VVB are separate entries) 3. The five most frequent forms of the verbal lemmas (smack, smash, hit, crash) that end in -ed 4. The five most frequent nouns (NN*), grouped by lemma (i.e. all forms of man grouped together) 5. The number of words starting with dis-, which occur more than 100 times each ADDITIONAL PRACTICE: MULTIPLE WORD (NGRAM) TABLE 6. The number of occurrences of get out 7. The five most frequent words occurring after the phrase feel the __ 8. The five most common nouns after the phrase had always __ 9. The five most common adjectives after any form of the lemma feel (feel, feels, feeling, felt) 10. The number of different combinations of #9, which occur at least 100 times each TABLE JOINS 1. BNC: Genres (bnc.dbo.sources table)
2. Corpus del Espaņol: synonyms 3. Corpus del Espaņol: user-defined lists 4. BNC: WordNet info 5. BNC: Comparing words via multiple queries
|