A Corpus of 21st Century Scots Texts

Intro a b c d e f g h i j k l m n o p q r s t u v w x y z Texts Writers Statistics Top200 Search Compare

A Corpus of 21st Century Scots Texts

How to use the Corpus

Enter a word to find all occurrences of, for example sonsie


isogloss map of where in Scotland the word wean is usedThere are isogloss maps that display where in Scots and Northern Ireland words are used more frequently and there is also a word comparison page for comparing different words or spellings, that will show different dialects and genres of text.

About

This is a corpus of 21st Century Scots Texts, featuring word frequency statistics from texts written in Scots over the last twenty years.

It is hoped that this website will be useful in determining the appropriate spellings and usage of wordsin various scots dialects.

Texts have been acquired and scraped from websites, journals, social media and books and tagged up into different regional dialects of Scots.

On the letter pages (see the link bar at the top) words are listed in alphabetical order, with the frequency of occurrences in each dialect and how many authors are recorded as having used this word. Words used by fewer than two authors or fewer than five occurrences are ignored. Clicking through each word will lead to a list of occurrences of that word in the corpus.

History

In August 2020 it was noted that much of the Scots wikipedia was written by non-natives and non-speakers of Scots, bored teenagers seeking dopamine hits from churning out thousands of poor quality wikipedia pages, mis-using terms chosen from online dictionaries. The Scots language wikipedia was considered a joke by native Scots, but used by international organisations as a corpus of language usage, resulting in some degree of cultural vandalism.

Scots language enthusiasts gathered to try to fix the Scots wikipedia, deleting thousands of low quality pages and creating better written new pages, but this effort was undermined by poor understanding of spellings, pages would be well-written in one dialect, then edited by speakers from other dialects or non-speakers who sought to fix words that didn't look Scottish enough.

The creator of this website is not an academic and before August 2020 didn't have much interest in word frequencies, yet here we are with a database of several hundred texts, 1,900,000 words, more than 95,000 unique words from nearly four hundred writers.

It is hoped that the use of the texts is amenable to copyright laws. Since the full texts aren't public facing, merely the individual words, data and extracts.

Technical

The texts are stored in three utf-8 .xml files that have been marked up with xml tags for metadata, one file for social media, one for text with page numbers and one for everything else. Twitter text was collected using Vicinitas to generate Excel files which were converted to csv, edited to remove English language messages and then turned into the xml format. Dozens of books written in Scots have been purchased and scanned.

Around forty new books written in Scots are published each year, it might even be as many as a million words. I maintain a list of these contemporary books on a google sheet here.

The following xml markup is used for each piece of text:

<corpus>
<article id="000124">
<date>01-11-2002</date> (only the year is detected programmatically)
<author>
<person name="Fairnie, Robert"/>
</author>
<leid>Scots</leid> (from: English, Scots, Gaelic)
<byleid>Central</byleid> (from: Central, Doric, Orkney, Shetland, Ulster, Southern)
<subleid>Glesga</subleid> (from: Glasgow, Northern, Borders, etc if more nuance is detectable)
<genre>newspaper</genre> (from: prose, poetry, government, newspaper, weans, blog, twitter, academic)
<source>digital</source> (from: digital, print)
<publisher>Self-published</publisher>
<title>Scots Tung WITTINS Nummber 108</title>
<page no='215'> (optional field for documents with page numbers)
<text>
Ipsum lori
</text>
</page> (optional)
</article>
</corpus>

These text and csv files were then parsed using custom perl and php scripts that count words and churns out the various data.

A csv file of the word frequencies for the entire corpus can be downloaded here (do a right click save as).

This file can opened in any modern spreadsheet software, be wary it is around than 100,000 rows long, even doing simple sort operations can take a while. The behaviour of accented characters in the csv file and Microsoft Excel isn't fully understood.

Contact

If you are a Scots author and wish to donate texts, ping me a note on twitter @illandancient or email illandancient@googlemail.com
Also gratefully received are suggestions, improvements and corrections. Occasionally edits to the scripts can mess up the pages, please send me a note on twitter if this happens.

Development log

To Do list

* If the search function comes across author's name it brings up work by that author
* Change the fonts so it all looks nice

Log

* Heard about the Scots wikipedia thing (2020-08-26)
* Contacted publishers (2020-08-28)
* Contacted academics (2020-09-20)
* Started pulling together texts for corpus (2020-09-27)
* Created perl corpus utilities (2020-10-07)
* Started website (2020-11-01)
* Corpus size 363,186 (2020-11-22)
* Created corpus statistics page
* Corpus size 493,275 (2020-12-16)
* Corpus size 521,418 (2020-12-30)
* Corpus size 610,000 (2021-01-11)
* Created spelling comparison page (2021-01-16)
* Experimental collocation page (2021-02-01)
* Corpus size 701,812 (2021-02-02)
* Extended concordance distance to 30 characters (2021-02-07)
* Made header links into php include code block (2021-03-05)
* Completed page number routines (2021-03-06)
* Started development log (2021-03-07)
* Corpus size 801,924 words (2021-03-08)
* Tried to implement most idiosyncratic words for each author by comparing the normalised word frequency between corpus and author, but it took 11.03 seconds per page for all the lookups. Instead I've just set it to calculate the normalised occurences for each word so you can cross-reference with the comparison page at your leisure, the slow lookup page is here authora.php, you'll have to enter the author id yourself. (2021-03-08)
* Page for each twitter author with creepy bits about identifying who they respond to most frequently in Scots tweets. The plan was to create a network diagram, but actually peple don't tweet to other people in the corpus that often. (2021-03-09)
* Investigated what happens with the iacute letter in source material, changed many scripts and pages to handle this, although I'm not entirely sure its satisfactory. The word "spírit" gets split into "sp" and "rit". The more gaelic influences varieties of Scots (Ulster) occasionally use some gaelic accented letters with the fada acute accent, and the more scandinavian varieties of Scots (Shetland) occasionally use the umlaut two dots accent, but it is only a few authors in each dialect who use these letters.
* Created experimental letter frequency page which counts each character instead of just letters, so I don't miss any accents or unexpected characters (2021-03-16)
* Created experimental Levenshtein distance page. The Levenshtein distance is basically the number of edits it takes to change one word to another. This page created a new wordlist from the corpus, then rattles through and finds the Levenshtein distance between a search term and every word on the corpus, then ranks them and presents the top 15. Its not the best for finding similar or related words, but its okay for finding typos. For example 'window' doesn't bring up 'wundae'.
* Corpus size 1,007,053 words (2021-04-05)
* Created experimental accented character search page. To help verify the use of accented characters. I'm not sure to what extent accented characters can be passed via url. The php script will filter out any non-accented characters A-Z and a-z. I think that emojis are already filtered out of tweets.
* Limited twitter concordance to five word occurrences per twitter user, and started including backlinks to original tweets (2021-05-01)
* Removed tracking cookies from pages, I stopped caring who uses the website. (2021-05-11)
* Started adding second level of dialect classes to give finer granuality to regions
* Created experimental isogloss map of regions to visualise where words and spellings are used (2021-08-30)
* Corpus size 1,586,019 words (2021-09-02)
* Added experimental usage over time graph that down the same thing as the google n-gram viewer but with less data to work with (2021-09-02)
* Got all excited that the NI Assembly had voted for 'simultaneous translation in Ulster-Scots' and that the corpus could be used for Machine Translation training data, but it turned out to be mis-reporting and that it was just spoken interpretation from Ulster-Scots into English, rather than a huge effort of translating Hansard. (2021-06-15)
* Started creating a bilingual Excel document for Machine Translation training purposes and discovered that it isn't easy and is very slow to do manually. (2021-06-28)
* Corpus size 1,359,305 words (2021-07-05)
* Created experimental newly coined words page which looks for words that have become more popular int he last few years (2021-10-28)
* Created experimental n-gram frequency page which tracks the frequencies of 3-grams or 4-grams etc in various Scots dialects (2021-10-30)
* Created year page that gives a breakdown of various staticstics for each selected year (2021-11-07)
* Corpus size 1,871,489 words (2021-11-14)
* Corpus size 2,165,508 words, 321 writers (2022-02-27)
* Created experimental Lemmatiser that takes an input text and does a lookup with a lexicon file to identify all the lemmas. Also created a housekeeping page that lists words to be added to the lexicon (2022-02-27). Its a first draft version 1 sort of thing, to identify what words are perhaps homographs of two different lemmas, the next step would be to create a similar lookup table for parts of speech, or perhaps a single combined lexicon, then create some kind of intelligence that can make educated guesses about which way homographs go, and then churn out a more accurate lemmatiser / part of speech tagger.
* Created a general corpus housekeeping page that lists articles which might have missing information (2022-02-27)
* Started a new statistics page that lists the number of articles in each sub-dialect and sub-genre, in a kind of Sankey Diagram table format.