A Corpus of 21st Century Scots Texts

Intro a b c d e f g h i j k l m n o p q r s t u v w x y z Texts Writers Statistics Top200 Search Compare

Search Compare


This is an experimental collocation tool. In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance.
Hopefully it can be used to find and verify binomials. It turns out that I've misunderstood the term binomial, I assumed it was any two words that frequently occur together, like dinna and ken or sair and fecht. But the first definition on wikipedia is that they are a sequence of two or more words or phrases in the same grammatical category, having some semantic relationship and joined by some syntactic device, for example rock and roll or bits and pieces.

Anyhoo, this tool should just about work for the task.

Enter a word to try this on, for example 'fash' as in 'dinnea fash'

List words in a two word radius of canny

canny occurrs 48 times in the corpus.

Total number of unique words within two word radius canny = 98

The locus was the original search term, Occurrences is the number of times the word is found near the locus in the corpus, Authors is the number of difference authors who have used these two words (so you know if its just one guys who really like a specific phrase), Normalised is calculated as how often we have found this combination of two words per 100,000 words, so we can compare it to chance occurrences of the word (I'm not sure if that makes sense)

If the pairing only occurs once, I've ignored it as there is clearly no pattern, (although I have my doubts about this reasoning).

LocusWordOccurrencesAuthors Normalised
Please bear in mind that this corpus is less than a million words in size. If your favourite binomials don't occur, its likely to be because our sample size isn't big enough rather than them being invalid.

List words in a five word radius of simmit

In English there is the phrase 'in Sickness and in Health' where 'sickness' and 'health' are the binomials, but they are outwith a radius of two words from each other. The two word radius list above would miss them, so we need to look further afield. Below is a table of words in a five word radius of the locus to find similar binomials in this Scots corpus.simmit occurs 2 times in the corpus.
Total number of words in this target corpus 9
Total number of unique words within five word radius simmit = 6
LocusWordOccurrencesAuthors Normalised

Ignoring the Twitter side of the corpus at this point

The search terms in regex are a bit more complicated because of hashtags and handles