A Corpus of 21st Century Scots Texts

Intro a b c d e f g h i j k l m n o p q r s t u v w x y z List of texts Statistics Top200 Search Compare

A Corpus of 21st Century Scots Texts

How to use the Corpus

Enter a word to find all occurrences of, for example sonsie


There is also a word comparison page for comparing different words or spellings, that will show different dialects and genres of text.

About

This is a corpus of 21st Century Scots Texts, featuring word frequency statistics from texts written in Scots over the last twenty years.

It is hoped that this website will be useful in determining the appropriate spellings and usage of wordsin various scots dialects.

Texts have been acquired and scraped from websites, journals, twitter and books, and tagged up into six different Scots dialects, Central, Doric, Ulster Scots, Shetland, Orkney and Southern.

On the letter pages words are listed in alphabetical order, with the frequency of occurrences in each dialect and how many authors are recorded as having used this word. Words used by fewer than two authors or fewer than five occurrences are ignored. Clicking through each word will lead to a list of occurrences of that word in the corpus.

History

In August 2020 it was noted that much of the Scots wikipedia was written by non-natives and non-speakers of Scots, bored teenagers seeking dopamine hits from churning out thousands of poor quality wikipedia pages, mis-using terms chosen from online dictionaries. The Scots language wikipedia was considered a joke by native Scots, but used by international organisations as a corpus of language usage, resulting in some degree of cultural vandalism.

Scots language enthusiasts gathered to try to fix the Scots wikipedia, deleting thousands of low quality pages and creating better written new pages, but this effort was undermined by poor understanding of spellings, pages would be well-written in one dialect, then edited by speakers from other dialects or non-speakers who sought to fix words that didn't look Scottish enough.

The creator of this website is not an academic and before August 2020 didn't have much interest in word frequencies, yet here we are with a database of several hundred texts, over a million words, more than 60,000 unique words from almost two hundred authors.

It is hoped that the use of the texts is acceptable to route round copyright laws. Since the full texts aren't public facing, merely the individual words, data and extracts.

Technical

The texts are stored in three utf-8 .xml files that have been marked up with xml tags for metadata, one file for twitter, one for text with page numbers and one with general text. Twitter text was collected using Vicinitas to generate Excel files which were converted to csv, edited to remove English language messages and then turned into the xml format. Dozens of books written in Scots have been purchased with the intention to scan them and include the text.

Around a dozen new books written in Scots are published each year, it might even be as many as a million words. I maintain a list of these contemporay books on wikipedia here.

The following xml markup is used for each piece of text:

<corpus>
<article id="000124">
<date>01-11-2002</date> (only the year is detected programmatically)
<author>
<person name="Fairnie, Robert"/>
</author>
<leid>Scots</leid> (from: English, Scots, Gaelic)
<byleid>Lallans</byleid> (from: Lallans, Doric, Orkney, Shetland, Ulster, Southern)
<subleid>Glesga</subleid> (from: Glasgow, Northern, Borders, etc if more nuance is detectable)
<genre>newspaper</genre> (from: prose, poetry, government, newspaper, weans, blog, twitter, academic)
<source>digital</source> (from: digital, print)
<publisher>Self-published</publisher>
<title>Scots Tung WITTINS Nummber 108</title>
<page no='215'> (optional field for documents with page numbers)
<text>
Ipsum lori
</text>
</page> (optional)
</article>
</corpus>

These text and csv files were then parsed using custom perl scripts that count words and churns out the various data. The webpages carry out similar processes using php scripts.

A csv file of the word frequencies for the entire corpus can be downloaded here (do a right click save as).

This file can opened in any modern spreadsheet software, be wary it is more than 50,000 rows long, even doing simple sort operations can take a while. The behaviour of accented characters in the csv file and Microsoft Excel isn't fully understood.

Contact

If you are a Scots author and wish to donate texts, ping me a note on twitter @illandancient also gratefully received are suggestions, improvements and corrections. Occasionally edits to the scripts can mess up the pages, please send me a note on twitter if this happens.

Development log

To Do list

* Identify hapax legomenons for each author (twitter and conventional)
* Page for each article with unique words and corpus hapax legomenons
* Identify most idosyncratic words for each leid
* Add ISBN numbers for each book
* 'Word Sketch' pages
* Do something with the sub-dialect information
* Finish putting page information in all books
* Create a fourth xml file that contains code-switching text, where the scots text is surrounded by a <extract> tag, do they have to be numbered?
* If the search function comes across author's name it brings up work by that author
* Year page where it has like all the texts from any given year and a matrix for genre dialect for that year.
* On the concord page to have some kind of occurrences per year chart if there are more than say 20 occurrences of the word, also dialects n eahc year.

Log

* Heard about the Scots wikipedia thing (2020-08-26)
* Contacted publishers (2020-08-28)
* Contacted academics (2020-09-20)
* Started pulling together texts for corpus (2020-09-27)
* Created perl corpus utilities (2020-10-07)
* Started website (2020-11-01)
* Corpus size 363,186 (2020-11-22)
* Created corpus statistics page
* Corpus size 493,275 (2020-12-16)
* Corpus size 521,418 (2020-12-30)
* Corpus size 610,000 (2021-01-11)
* Created spelling comparison page (2021-01-16)
* Experimental collocation page (2021-02-01)
* Corpus size 701,812 (2021-02-02)
* Extended concordance distance to 30 characters (2021-02-07)
* Made header links into php include code block (2021-03-05)
* Completed page number routines (2021-03-06)
* Started development log (2021-03-07)
* Corpus size 801,924 (2021-03-08)
* Tried to implement most idiosyncratic words for each author by comparing the normalised word frequency between corpus and author, but it took 11.03 seconds per page for all the lookups. Instead I've just set it to calculate the normalised occurences for each word so you can cross-reference with the comparison page at your leisure, the slow lookup page is here authora.php, you'll have to enter the author id yourself. (2021-03-08)
* Page for each twitter author with creepy bits about identifying who they respond to most frequently in Scots tweets. The plan was to create a network diagram, but actually peple don't tweet to other people in the corpus that often. (2021-03-09)
* Investigated what happens with the iacute letter in source material, changed many scripts and pages to handle this, although I'm not entirely sure its satisfactory. The word "spĂ­rit" gets split into "sp" and "rit". The more gaelic influences varieties of Scots (Ulster) occasionally use some gaelic accented letters with the fada acute accent, and the more scandinavian varieties of Scots (Shetland) occasionally use the umlaut two dots accent, but it is only a few authors in each dialect who use these letters.
* Created experimental letter frequency page which counts each character instead of just letters, so I don't miss any accents or unexpected characters (2021-03-16)
* Created experimental Levenshtein distance page. The Levenshtein distance is basically the number of edits it takes to change one word to another. This page created a new wordlist from the corpus, then rattles through and finds the Levenshtein distance between a search term and every word on the corpus, then ranks them and presents the top 15. Its not the best for finding similar or related words, but its okay for finding typos. For example 'window' doesn't bring up 'wundae'.
* Corpus size 1,007,053 (2021-04-05)
* Created experimental accented character search page. To help verify the use of accented characters. I'm not sure to what extent accented characters can be passed via url. The php script will filter out any non-accented characters A-Z and a-z. I think that emojis are already filtered out of tweets.
* Limited twitter concordance to five word occurrences per twitter user, and started including backlinks to original tweets (2021-05-01)
* Removed tracking cookies from pages, I stopped caring who uses the website. (2021-05-11)
* Started adding second level of dialect classes to give finer granuality to regions

Ideas for 2.0

* Location co-ordinates for every article
* Tags instead of genre, so children's poetry could be 'weans' and 'poetry' at the same time
* URLs and ISBN numbers for everything
* Use page number and code-switch extract tagging for all texts
* Handle multi-author anthologies better