It is hoped that this website will be useful in determining the appropriate spellings and usage of words more appropriately than resurrecting archaic terms from other Scots dictionaries and transplanting them into inappropriate dialects.
Texts have been acquired and scraped from websites, journals, twitter and books, with no regard to copyright, and tagged up into six different Scots dialects, Central, Doric, Ulster Scots, Shetland, Orkney and Southern.
On the letter pages words are listed in alphabetical order, with the frequency of occurrences in each dialect and how many authors are recorded as having used this word. Words used by fewer than two authors or fewer than five occurrences are ignored.
For the purposes of this corpus Contemporary Scots refers to the Scots language spoken and written in the 21st century, as distinct from Modern Scots which covers 1700 to the present day. This stems from the misassumption that Modern Scots would include words for touchscreen smartphones and fax machines rather than timourous beasties and races of puddins.
Scots language enthusiasts gathered to try to fix the Scots wikipedia, deleting thousands of low quality pages and creating better written new pages, but this effort was undermined by poor understanding of spellings, pages would be well-written in one dialect, then edited by speakers from other dialects or non-speakers who sought to fix words that didn't look Scottish enough, for example a user called CanadianToast changed my spelling of "nearby" to "naurby".
There are many online language resources catering to Scots, the DSL dictionary which doesn't list "naurby" as a word, the SCOTS corpus of modern Scots which has no occurrences of "naurby". In fact the only usage occurrences of the word seem to be on Twitter in the Californian Scots dialect, which seems to be an inappropriate choice of dialect for wikipedia. How would CanadianToast be expected to know that no Scotsman has ever used that spelling, being thousands of miles away from Scotland, and California, and that "nearby" is a perfectly fine Scots word used by Scots writers for centuries.
Perhaps CanadianToast has a Scotsman friend they could ask, but this bears the risk that their friend is dreadful at spelling, doesn't know which dialect uses which spelling or has a minority view about the nature of spelling and usage in Scots. Who decides if a word is Scots or not? clearly a plurality of writers.
This is the sort of situation that this corpus of contemporary Scots texts hopes to resolve.
The creator of this website is not an academic and before August 2020 didn't have much interest in word frequencies, yet here we are with a database of several hundred texts, 600,000 words, more than 50,000 unique words from over a hundred authors.
It is hoped that the use of the texts is acceptable to route round copyright laws. Since the texts aren't public facing, merely the individual words and snippets.
<corpus>
<article id="000124">
<date>01-11-2002</date> (only the year is detected programmatically)
<author>
<person name="Fairnie, Robert"/>
</author>
<leid>Scots</leid> (from: English, Scots, Gaelic)
<byleid>Lallans</byleid> (from: Lallans, Doric, Orkney, Shetland, Ulster, Southern)
<subleid>Glesga</subleid> (from: Glasgow, Northern, Borders, etc if more nuance is detectable)
<genre>newspaper</genre> (from: prose, poetry, government, newspaper, weans, blog, twitter, academic)
<source>digital</source> (from: digital, print)
<publisher>Self-published</publisher>
<title>Scots Tung WITTINS Nummber 108</title>
<text>
Ipsum lori
</text>
</article>
</corpus>
These text and csv files were then parsed using custom perl scripts that just counts words and churns out the various data.
A csv file of the word frequencies for the entire corpus can be downloaded here (do a right click save as).
This file can opened in any modern spreadsheet software, be wary it is more than 50,000 rows long, even doing simple sort operations can take a while.