A Corpus of 21st Century Scots Texts
Intro
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
Texts
Writers
Statistics
Top200
Search
Compare
Search Compare
Collocation
This is an experimental collocation tool. In corpus linguistics, a collocation is a series of
words or terms that co-occur more often than would be expected by chance.
Hopefully it can be used to find and verify binomials. It turns out that I've misunderstood the
term binomial, I assumed it was any two words that frequently occur together, like dinna and
ken or sair and fecht. But the first definition on wikipedia is that they are a
sequence of two or more words or phrases in the same grammatical category, having some semantic
relationship and joined by some syntactic device, for example rock and roll or
bits and pieces.
Anyhoo, this tool should just about work for the task.
List words in a two word radius of canny
canny occurrs 62 times in the corpus.
Total number of unique words within two word radius canny = 120
The locus was the original search term,
Occurrences is the number of times the word is found near the locus in the corpus,
Authors is the number of difference authors who have used these two words
(so you know if its just one guys who really like a specific phrase),
Normalised is calculated as how often we have found this combination of two
words per 100,000 words, so we can compare it to chance occurrences of the word
(I'm not sure if that makes sense)
If the pairing only occurs once, I've ignored it as there is clearly no pattern,
(although I have my doubts about this reasoning).
Locus | Word | Occurrences | Authors |
Normalised |
---|
canny | ah | 10 | 4 | 3968.3 |
canny | a | 8 | 8 | 3174.6 |
canny | the | 7 | 6 | 2777.8 |
canny | and | 6 | 6 | 2381 |
canny | but | 5 | 5 | 1984.1 |
canny | be | 4 | 4 | 1587.3 |
canny | tae | 4 | 4 | 1587.3 |
canny | nae | 4 | 4 | 1587.3 |
canny | we | 4 | 3 | 1587.3 |
canny | ca | 4 | 4 | 1587.3 |
canny | ma | 3 | 2 | 1190.5 |
canny | understaun | 3 | 1 | 1190.5 |
canny | ye | 3 | 2 | 1190.5 |
canny | it | 3 | 3 | 1190.5 |
canny | wi | 3 | 3 | 1190.5 |
canny | get | 3 | 3 | 1190.5 |
canny | douglas | 2 | 1 | 793.7 |
canny | man | 2 | 2 | 793.7 |
canny | why | 2 | 1 | 793.7 |
canny | o | 2 | 2 | 793.7 |
canny | her | 2 | 2 | 793.7 |
canny | fowk | 2 | 2 | 793.7 |
canny | go | 2 | 2 | 793.7 |
canny | wis | 2 | 2 | 793.7 |
canny | believe | 2 | 2 | 793.7 |
canny | an | 2 | 2 | 793.7 |
canny | like | 2 | 2 | 793.7 |
Please bear in mind that this corpus is less than a million words in size. If
your favourite binomials don't occur, its likely to be because our sample size
isn't big enough rather than them being invalid.
List words in a five word radius of simmit
In English there is the phrase 'in Sickness and in Health' where
'sickness' and 'health' are the binomials, but they are outwith a radius of two
words from each other. The two word radius list above would miss them, so we
need to look further afield. Below is a table of words in a five word radius
of the locus to find similar binomials in this Scots corpus.simmit occurs 4 times in the corpus.
Total number of words in this target corpus 21
Total number of unique words within five word radius simmit = 16
Locus | Word | Occurrences | Authors |
Normalised |
---|
simmit | his | 2 | 2 | 9523.8 |
Ignoring the Twitter side of the corpus at this point
The search terms in regex are a bit more complicated because of hashtags and handles