Intro a b c d e f g h i j k l m n o p q r s t u v w x y z List of texts Statistics Top200 Concordance Search Compare

Intro

This is the Corpus of Contemporary Scots Texts, featuring word frequency statistics from texts written in Scots over the last twenty years.

It is hoped that this website will be useful in determining the appropriate spellings and usage of words more appropriately than resurrecting archaic terms from other Scots dictionaries and transplanting them into inappropriate dialects.

Texts have been acquired and scraped from websites, journals, twitter and books, with no regard to copyright, and tagged up into six different Scots dialects, Central, Doric, Ulster Scots, Shetland, Orkney and Southern.

On the letter pages words are listed in alphabetical order, with the frequency of occurrences in each dialect and how many authors are recorded as having used this word. Words used by fewer than two authors or fewer than five occurrences are ignored.

For the purposes of this corpus Contemporary Scots refers to the Scots language spoken and written in the 21st century, as distinct from Modern Scots which covers 1700 to the present day. This stems from the misassumption that Modern Scots would include words for touchscreen smartphones and fax machines rather than timourous beasties and races of puddins.

About

In August 2020 it was noted that much of the Scots wikipedia was written by non-natives and non-speakers of Scots, bored teenagers seeking dopamine hits from churning out thousands of poor quality wikipedia pages, mis-using terms from poorly chosen online dictionaries. The Scots language wikipedia was considered a joke by native Scots, but used by international organisations as a corpus of language usage, resulting in gross cultural vandalism.

Scots language enthusiasts gathered to try to fix the Scots wikipedia, deleting thousands of low quality pages and creating better written new pages, but this effort was undermined by poor understanding of spellings, pages would be well-written in one dialect, then edited by speakers from other dialects or non-speakers who sought to fix words that didn't look Scottish enough, for example a user called CanadianToast changed my spelling of "nearby" to "naurby".

There are many online language resources catering to Scots, the DSL dictionary which doesn't list "naurby" as a word, the SCOTS corpus of modern Scots which has no occurrences of "naurby". In fact the only usage occurrences of the word seem to be on Twitter in the Californian Scots dialect, which seems to be an inappropriate choice of dialect for wikipedia. How would CanadianToast be expected to know that no Scotsman has ever used that spelling, being thousands of miles away from Scotland, and California, and that "nearby" is a perfectly fine Scots word used by Scots writers for centuries.

Perhaps CanadianToast has a Scotsman friend they could ask, but this bears the risk that their friend is dreadful at spelling, doesn't know which dialect uses which spelling or has a minority view about the nature of spelling and usage in Scots. Who decides if a word is Scots or not? clearly a plurality of writers.

This is the sort of situation that this corpus of contemporary Scots texts hopes to resolve.

The creator of this website is not an academic and before August 2020 didn't have much interest in word frequencies, yet here we are with a database of several hundred texts, 600,000 words, more than 50,000 unique words from over a hundred authors.

It is hoped that the use of the texts is acceptable to route round copyright laws. Since the texts aren't public facing, merely the individual words and snippets.

Technical

The texts are stored as *.txt files that have been marked up with xml, twitter text was collected using Vicinitas to generate Excel files which were converted to csv and edited to remove English language messages. Dozens of books written in Scots have been purchased with the intention to scan them and include the text, but its a boring process and usually only the first dozen or so pages are done.

<corpus>
<article id="000124">
<date>01-11-2002</date> (only the year is detected programmatically)
<author>
<person name="Fairnie, Robert"/>
</author>
<leid>Scots</leid> (from: English, Scots, Gaelic)
<byleid>Lallans</byleid> (from: Lallans, Doric, Orkney, Shetland, Ulster, Southern)
<subleid>Glesga</subleid> (from: Glasgow, Northern, Borders, etc if more nuance is detectable)
<genre>newspaper</genre> (from: prose, poetry, government, newspaper, weans, blog, twitter, academic)
<source>digital</source> (from: digital, print)
<publisher>Self-published</publisher>
<title>Scots Tung WITTINS Nummber 108</title>
<text>
Ipsum lori
</text>
</article>
</corpus>

These text and csv files were then parsed using custom perl scripts that just counts words and churns out the various data.

A csv file of the word frequencies for the entire corpus can be downloaded here (do a right click save as).

This file can opened in any modern spreadsheet software, be wary it is more than 50,000 rows long, even doing simple sort operations can take a while.

Contact

If you are a Scots author and wish to donate texts, ping me a note on twitter @illandancient also gratefully received are suggestions, improvements and corrections, furthermore some kind of validation from academia would provide far more dopamine than a dozen wikipedia pages.