There is also a word comparison page for comparing different words or spellings, that will show different dialects and genres of text.
This is a corpus of 21st Century Scots Texts, featuring word frequency statistics from texts written in Scots over the last twenty years.
It is hoped that this website will be useful in determining the appropriate spellings and usage of wordsin various scots dialects.
Texts have been acquired and scraped from websites, journals, twitter and books, with no regard to copyright, and tagged up into six different Scots dialects, Central, Doric, Ulster Scots, Shetland, Orkney and Southern.
On the letter pages words are listed in alphabetical order, with the frequency of occurrences in each dialect and how many authors are recorded as having used this word. Words used by fewer than two authors or fewer than five occurrences are ignored. Clicking through each word will lead to a list of occurrences of that word in the corpus.
In August 2020 it was noted that much of the Scots wikipedia was written by non-natives and non-speakers of Scots, bored teenagers seeking dopamine hits from churning out thousands of poor quality wikipedia pages, mis-using terms chosen from online dictionaries. The Scots language wikipedia was considered a joke by native Scots, but used by international organisations as a corpus of language usage, resulting in some degree of cultural vandalism.
Scots language enthusiasts gathered to try to fix the Scots wikipedia, deleting thousands of low quality pages and creating better written new pages, but this effort was undermined by poor understanding of spellings, pages would be well-written in one dialect, then edited by speakers from other dialects or non-speakers who sought to fix words that didn't look Scottish enough.
The creator of this website is not an academic and before August 2020 didn't have much interest in word frequencies, yet here we are with a database of several hundred texts, 900,000 words, more than 54,000 unique words from over a hundred authors.
It is hoped that the use of the texts is acceptable to route round copyright laws. Since the full texts aren't public facing, merely the individual words, data and extracts.
The texts are stored in three utf-8 .xml files that have been marked up with xml tags for metadata, one file for twitter, one for text with page numbers and one with general text. Twitter text was collected using Vicinitas to generate Excel files which were converted to csv, edited to remove English language messages and then turned into the xml format. Dozens of books written in Scots have been purchased with the intention to scan them and include the text.
Around a dozen new books written in Scots are published each year, it might even be as many as a million words. I maintain a list of these contemporay books on wikipedia here.
The following xml markup is used for each piece of text:
<date>01-11-2002</date> (only the year is detected programmatically)
<person name="Fairnie, Robert"/>
<leid>Scots</leid> (from: English, Scots, Gaelic)
<byleid>Lallans</byleid> (from: Lallans, Doric, Orkney, Shetland, Ulster, Southern)
<subleid>Glesga</subleid> (from: Glasgow, Northern, Borders, etc if more nuance is detectable)
<genre>newspaper</genre> (from: prose, poetry, government, newspaper, weans, blog, twitter, academic)
<source>digital</source> (from: digital, print)
<title>Scots Tung WITTINS Nummber 108</title>
<page no='215'> (optional field for documents with page numbers)
These text and csv files were then parsed using custom perl scripts that count words and churns out the various data. The webpages carry out similar processes using php scripts.
A csv file of the word frequencies for the entire corpus can be downloaded here (do a right click save as).
This file can opened in any modern spreadsheet software, be wary it is more than 50,000 rows long, even doing simple sort operations can take a while. The behaviour of accented characters in the csv file and Microsoft Excel isn't fully understood.
If you are a Scots author and wish to donate texts, ping me a note on twitter @illandancient also gratefully received are suggestions, improvements and
corrections. Occasionally edits to the scripts can mess up the pages, please send me a note on twitter if this happens.
To Do list
* Identify hapax legomenons for each author (twitter and conventional)
* Page for each article with unique words and corpus hapax legomenons
* Identify most idosyncratic words for each leid
* Add ISBN numbers for each book
* 'Word Sketch' pages
* Do something with the sub-dialect information
* Finish putting page information in all books
* Create a fourth xml file that contains code-switching text, where the scots text is surrounded by a <extract> tag, do they have to be numbered?
* If the search function comes across author's name it brings up work by that author
* Heard about the Scots wikipedia thing (2020-08-26)
* Contacted publishers (2020-08-28)
* Contacted academics (2020-09-20)
* Started pulling together texts for corpus (2020-09-27)
* Created perl corpus utilities (2020-10-07)
* Started website (2020-11-01)
* Corpus size 363,186 (2020-11-22)
* Created corpus statistics page
* Corpus size 493,275 (2020-12-16)
* Corpus size 521,418 (2020-12-30)
* Corpus size 610,000 (2021-01-11)
* Created spelling comparison page (2021-01-16)
* Experimental collocation page (2021-02-01)
* Corpus size 701,812 (2021-02-02)
* Extended concordance distance to 30 characters (2021-02-07)
* Made header links into php include code block (2021-03-05)
* Completed page number routines (2021-03-06)
* Started development log (2021-03-07)
* Corpus size 801,924 (2021-03-08)
* Tried to implement most idiosyncratic words for each author by comparing the
normalised word frequency between corpus and author, but it took 11.03 seconds
per page for all the lookups. Instead I've just set it to calculate the normalised
occurences for each word so you can cross-reference with the comparison
page at your leisure, the slow lookup page is here
authora.php, you'll have to enter the author id yourself. (2021-03-08)
* Page for each twitter author with creepy bits about identifying who they respond
to most frequently in Scots tweets. The plan was to create a network diagram,
but actually peple don't tweet to other people in the corpus that often. (2021-03-09)
* Investigated what happens with the iacute letter in source material, changed many
scripts and pages to handle this, although I'm not entirely sure its satisfactory.
The word "spírit" gets split into "sp" and "rit". The more gaelic influences varieties
of Scots (Ulster) occasionally use some gaelic accented letters with the fada acute accent,
and the more scandinavian varieties of Scots (Shetland) occasionally use the umlaut two
dots accent, but it is only a few authors in each dialect who use these letters.
* Created experimental letter frequency page which counts each
character instead of just letters, so I don't miss any accents or unexpected characters
* Created experimental Levenshtein distance page. The Levenshtein
distance is basically the number of edits it takes to change one word to another. This page
created a new wordlist from the corpus, then rattles through and finds the Levenshtein
distance between a search term and every word on the corpus, then ranks them and presents
the top 15. Its not the best for finding similar or related words, but its okay for finding
typos. For example 'window' doesn't bring up 'wundae'.
* Corpus size 1,007,053 (2021-04-05)
* Created experimental accented character search page. To help
verify the use of accented characters. I'm not sure to what extent accented characters can
be passed via url. The php script will filter out any non-accented characters A-Z and a-z.
I think that emojis are already filtered out of tweets.
Ideas for 2.0
* Location co-ordinates for every article
* Tags instead of genre, so children's poetry could be 'weans' and 'poetry' at the same time
* Dundee and Caithness as a seperate dialects
* URLs and ISBN numbers for everything
* Use page number and code-switch extract tagging for all texts
* Handle multi-author anthologies better