RELATIONAL DATABASES
 

Davies, Mark (2005) "The advantage of using relational databases for large corpora: speed, advanced queries, and unlimited annotation". International Journal of Corpus Linguistics 10: 301-28 (Via Corpora page)

1. To get a basic overview of SQL ("Standard Query Language"), look at Sections 1-3 of the following site.  At the bottom of the page there are a few activities to see if you've captured the basic idea. 

http://sqlcourse.com/select.html

By the way, here's some more links, if you want them:

http://www.w3schools.com/sql/sql_select.asp

http://www.1keydata.com/sql/sql.html

http://sqlzoo.net/

2. Go to the following page and run some SQL queries against a simple BNC database that I've created.  The password is the same that we've used to access the corpora.

http://view.byu.edu/sql/


In order to run queries that make sense, you'll need to know something about the two tables we'll be using this next week.  The first deals with just single words and their frequency.  Here's the structure of the table.  W1 refers to the word form, C1 is part of speech ("category"), and L1 is lemma.

ONE WORD [TABLE NAME = Y]

COLUMN NAME freq w1 L1 c1
EXPLANATION frequency word lemma part of speech
EXAMPLE 1938 jumped jump VVD

Let's try an actual query.  Here's one that finds the five most common VVI (infinitival form of the verb) and their frequency for verbs that start with [in].  You'd type this into the [SQL] box on the web form (just copy and paste this):

select top 30 freq,w1 from y where
w1 like 'in%' and
c1 = 'vvi'
order by freq desc

(make sure that you also select [ONE WORD])

The following words are displayed:

4747 included
2715 insisted
1648 involved
1615 introduced
1607 indicated
. . . .

BASIC SYNTAX:

select (top __) [column names]
from [table name]
where [conditions]
          = <>
          like, not like
          in, not in

order by [column names asc/desc]

COUNTING

select count (*)
from [table name]
where [conditions]
 


There is also a "seven word sequence" table.  This contains the main word in the [w4] slot, with three "slots" to the left (w1, w2, and w3, along with their part of speech tags c1, c2, c3) and the three words to the right (w5, w6, w7).  If you want to find the most common noun following [sad], then you'd have [sad] as the [w4] and [nn%] as the [c5]

MULTIPLE WORD (TABLE NAME = X]

COLUMN NAME ID t ....(w1) w3 c3 w4 c4 w5 c5 ...(w7)
EXPLANATION   text   word3 POS3 word4 pos4 word5 POS5  
EXAMPLE 2345 A01   I PNP like VVB chocolate NN1  
  2346 A01   like VVB chocolate NN1 kisses NN2  
  2347 A01   chocolate NN1 kisses NN2 on PRP  

This gets a bit more tricky.  Suppose that you want to find all cases of [heart of] followed by a noun.  Input the following query, and make sure you select the [MULTIPLE WORD] table:

select *
from x where
w4 = 'heart' and w5 = 'of' and c6 like 'n%'

This gives us 878 individual occurrences, and we could count up all of the different nouns in the [w6] slot to see the most frequent noun after [heart of].  Luckily, though, we can have the database do this for us, via the following command:

select count(*),w6
from x where
w4 = 'heart' and w5 = 'of' and c6 like 'nn%'
group by w6
order by count(*) desc

This groups together the different words in the [w6] slot, and shows us how frequent each one is.

BASIC SYNTAX FOR GROUPING: (HELP 1, HELP 2, HELP 3)

select (top __) [column names]
from [table name]
where [conditions]
group by [column names]
having [conditions]
order by [column names asc/desc]


By the end of class on Wednesday, you should be able to create SQL queries to find the following:

    SINGLE WORD TABLE:

1) singular nouns (NN1)

2) words starting with 'des'

3) word ending in [able] that are adjectives (AJ0]

4) members of the lemma [jump]

5) members of the lemma [hit] that are verbs

    MULTIPLE WORD TABLE:

6) words (any part of speech) that are followed by [reason]

7) prepositions (in the first word slot) followed by [the], followed by [way]

8) nouns following [ridiculous]

9) adjectives preceding [time]

10) nouns within two slots to the right of verbal forms of [miss]

    COUNT THE FOLLOWING:

11) How many nouns ending in [-tion] occur 100 times or more?

12) How many past participles [VVN] occur after [have,has,had] 50 times or more?


ADDITIONAL PRACTICE: SINGLE WORD TABLE

1. The three most frequent forms of the lemma drink

2. The five most frequent lexical verbs that are a "base" form of the verb (VVI, VVB), by individual entry (e.g. see-VVI and see-VVB are separate entries)

3. The five most frequent forms of the verbal lemmas (smack, smash, hit, crash) that end in -ed

4. The five most frequent nouns (NN*), grouped by lemma (i.e. all forms of man grouped together)

5. The number of words starting with dis-, which occur more than 100 times each


ADDITIONAL PRACTICE: MULTIPLE WORD (NGRAM) TABLE

6. The number of occurrences of get out

7. The five most frequent words occurring after the phrase feel the __

8. The five most common nouns after the phrase had always __

9. The five most common adjectives after any form of the lemma feel (feel, feels, feeling, felt)

10. The number of different combinations of #9, which occur at least 100 times each


TABLE JOINS

1. BNC: Genres

(bnc.dbo.sources table)

ID t gennum genre keywords gender style date
  text num genre number genre name        
1
A00
66
W_non_ac_medicine
Health; Sex
mixed
med
1985-1994
2
A01
66
W_non_ac_medicine
AIDS; disease; sex
mixed
low
1985-1994
3
A02
51
W_institut_doc
Health; Disease; Finance
mixed
med
1985-1994
4
A03
67
W_pop_lore
human rights; prisons; politics
mixed
med
1985-1994
5
A04
60
W_ac_humanities_arts
art criticism; art history
mixed
high
1985-1994

select count(*),x.w6
from x,bnc.dbo.sources as sources where
x.w4 in ('break','breaks','broken','breaking','broke') and
x.w5 = 'the' and
x.c6 like 'n%' and
sources.genre like '%fic%' and
x.t = sources.t
group by x.w6
order by count(*) desc

2. Corpus del Espaņol: synonyms

3. Corpus del Espaņol: user-defined lists

4. BNC: WordNet info

5. BNC: Comparing words via multiple queries


 

BNC part of speech codes