1. A structured frequency list of words from the general vocabulary. The source used was one of the lists in Maegaard & Ruus (1986), covering the domain of general interest magazines published over five years (1970-74)[1]. The corpus from which the list has been derived consists of excerpts dealing with subjects as different as agriculture, electronics, photography, gardening, etc., all of them treated on a not too specialised level. The size is approximately 6800 words, covering around 82% of the corpus. The list is divided into three frequency classes (and three separate files), as shown in the following table (figures have been rounded up):
class interval
(number of words)
|
relative
frequency
of each class (%)
|
relative
cumulative
frequency (%)
|
1-189
|
48
|
48
|
190-1732
|
21
|
69
|
1733-6780
|
13
|
82
|
2. A word list covering words belonging to closed part of speech classes, namely determiners, pronouns, prepositions, conjunctions and non-productive adverbs. The list was constructed automatically by extracting the relevant word forms from the electronic version of the Danish spelling dictionary (Retskrivningsordbogen 1986).
3. A list of loan words from the technical domain of computer science, in which a large number of English terms are used. This is a manually-constructed distributed sample of 95 words extracted from a printed dictionary of computer terms (Politikens Dataleksikon 1986). The dictionary includes both Danish and English words. To obtain a sample of 100 terms[2] from a dictionary of approximately 500 pages, we took the first English term on every fifth page. Since we could not always find an English term on the page chosen, we had to repeat the procedure by taking a term from every 17th page, and again from every 20th page. In the end, we had a sample of 100 terms, which we ran through the Word spelling checker to remove spelling errors. For five of the terms chosen, the spelling checker suggested an alternative spelling. To make sure that we did not introduce unconventional spellings in the sample, we removed those five forms.
We then proceeded to do a frequency count on the corpus obtained, and have stored it in the format: <word> <tab> <frequency>, in increasing order.[3]
The list is made of 16530 words; low frequency words account for about 50%.
In addition, two lists with proper names only have been constructed, one containing the names of all Italian capital of provinces (capoluogo di provincia) for a total of 99 cities, and another one containing first names, for a total of 593 names.
ISSCO has also obtained from the Istituto di Linguistica Computazionale of the CNR in Pisa a much larger list, constructed on a bigger corpus. ISSCO has an agreement to use such list, but not to re-distribute it to third parties.
[2] We use the word "term" to indicate that the list contains simple words as well as compounds.
[3] We are grateful to Graham Russell for providing us the script to do the frequency count.