previous next Title Contents

2. Overview of the TEMAA test materials


The following lists are all available in 8-bit ascii DOS text at CST, Copenhagen. The Italian list, as indicated in Section 2.2 however, is in the ISO8859_1 standard (also known as Latin1); it is slightly different from the DOS one, in that accented characters are not recognised in the same way. Windows has adopted the ISO standard.

2.1. Danish word lists

For Danish, the following lists have been collected:

1. A structured frequency list of words from the general vocabulary. The source used was one of the lists in Maegaard & Ruus (1986), covering the domain of general interest magazines published over five years (1970-74)[1]. The corpus from which the list has been derived consists of excerpts dealing with subjects as different as agriculture, electronics, photography, gardening, etc., all of them treated on a not too specialised level. The size is approximately 6800 words, covering around 82% of the corpus. The list is divided into three frequency classes (and three separate files), as shown in the following table (figures have been rounded up):

class interval

(number of words)

relative frequency

of each class (%)

relative cumulative

frequency (%)

1-189
48
48
190-1732
21
69
1733-6780
13
82

2. A word list covering words belonging to closed part of speech classes, namely determiners, pronouns, prepositions, conjunctions and non-productive adverbs. The list was constructed automatically by extracting the relevant word forms from the electronic version of the Danish spelling dictionary (Retskrivningsordbogen 1986).

3. A list of loan words from the technical domain of computer science, in which a large number of English terms are used. This is a manually-constructed distributed sample of 95 words extracted from a printed dictionary of computer terms (Politikens Dataleksikon 1986). The dictionary includes both Danish and English words. To obtain a sample of 100 terms[2] from a dictionary of approximately 500 pages, we took the first English term on every fifth page. Since we could not always find an English term on the page chosen, we had to repeat the procedure by taking a term from every 17th page, and again from every 20th page. In the end, we had a sample of 100 terms, which we ran through the Word spelling checker to remove spelling errors. For five of the terms chosen, the spelling checker suggested an alternative spelling. To make sure that we did not introduce unconventional spellings in the sample, we removed those five forms.

2.2. Italian word lists

The Italian list was constructed at ISSCO from Ansa news wire bulletins, dated from January 1995 to end of April 1995. The corpus was filtered to obtain accented characters in the ISO8859_1 character set; in order to avoid noise in the frequency with proper names, acronyms and so on, capitalized or uppercase words were deleted.

We then proceeded to do a frequency count on the corpus obtained, and have stored it in the format: <word> <tab> <frequency>, in increasing order.[3]

The list is made of 16530 words; low frequency words account for about 50%.

In addition, two lists with proper names only have been constructed, one containing the names of all Italian capital of provinces (capoluogo di provincia) for a total of 99 cities, and another one containing first names, for a total of 593 names.

ISSCO has also obtained from the Istituto di Linguistica Computazionale of the CNR in Pisa a much larger list, constructed on a bigger corpus. ISSCO has an agreement to use such list, but not to re-distribute it to third parties.

2.3. English word lists

The English base lists are are (i) a list of 288 closed class words drawn from the Alvey Grammar 3rd release (lexicon file d.le), (ii) a list of 9532 general open class words derived from the British National Corpus ([BNC]) via word/part of speech frequency lists compiled by Adam Kilgarriff of the University of Brighton, kindly made available by anonymous ftp [AK], and (iii) a list of 32,250 technical words drawn from the European Corpus Initiative CDROM (ECI). [1] The following magazines were used: Haven, Folkeskolen, Arbejdsgiveren, Sygeplejersken, Ingeniørens Ugeblad, Dansk Jagt.

[2] We use the word "term" to indicate that the list contains simple words as well as compounds.

[3] We are grateful to Graham Russell for providing us the script to do the frequency count.


previous next Title Contents