2. The typology of errors

2.1. R-related errors
2.2. Errors of suffixation (not involving the letter "r")
2.3. Silent letters
2.4. Consonant doubling
2.5. Letter substitution
2.6. Compounding
2.7. Loan words
2.8. Other categories

Keeping in mind this fundamental distinction between tractable (from a TEMAA perspective) errors, and intractable ones, we shall now base our description on the typology given in Andersen et al. (1992), where spelling errors are divided into the following basic types:

r-related errors

errors of suffixation ( not involving the letter "r")

silent letters

consonant doubling

letter substitution

compounding errors

errors in loan words

syllable omission and syllable repetition

other error types (apostrophe, capitalisation, etc.)

In the following sections, we shall provide examples for each category. Most of them are data quoted in the reports as naturally occurring examples. The ones that have been constructed are flagged with a "#". In each example, the mispelled word is followed by the corresponding correct word in square brackets. The misspelled word is flagged with a "*" if it is an incorrect word or if the context shows that it is wrong in the context.

For each error type, we shall also indicate what corruption rule could be used to generate the error from the correct form. In what follows, we use a notation where a corruption rule is a context-sensitive rule consisting of a left-hand side expressing a pattern to be matched, the symbol ">", a right-hand side expressing a sequence to be substituted for the pattern matching the left-hand side, and optional constraints enclosed in curly braces[4].

The following symbols are used in the rules:

C stands for one consonant

V stands for one vowel

& stands for one letter (either vowel or consonant)

- stands for one or more letters

. stands for beginning or end of a word

The constraints in curly braces can express either equality ("=") or inequality ("!=") between letters.

For example, the rule that deletes the final e in a word ending with re will be:

rule: -re. > r

A rule where additional constraints are specified is the following:

-re& > r& {&!=r}

The rule states that the sequence re must be replaced by r if it is followed by a letter which is different from r.

A question to be asked is whether any of the corruption rules described in the following sections overlap with one or more of the six operations already available to generate errors in ASCC. With the exception of letter doubling and singling, these operations are based either on a random process, or are constrained by the position of the letters on the keyboard. Most of the rules listed below, instead, are motivated by the interaction between phonetics and orthography, and predict the addition or deletion of one or more letters in certain well-specified combinations of letters. Therefore, from a general point of view, the two sets of rules are fundamentally different. Thus, although in practice some of the manipulations foreseen by our language-specific rules may also be achieved by random deletion or addition, to make sure that all the errors we are interested in testing are indeed generated, it seems reasonable to implement the rules below as separate operations.

Another general observation regards rule interaction. All the error types implemented in ASCC are single error types. In reality, however, several misspelling errors can occur simultaneously in the same word (e.g. *"selfølelig" for "selvfølgelig", En: of course). To account for this, ASCC should be able to apply more than one rule to the same input word.

Finally, we must note that some of rules given below overlap with each other, and have therefore been collapsed in the implementation.

2.1. R-related errors

In Danish, the letter r is a very frequent source of spelling errors. Therefore, we shall deal with it separately, and use Löb (1983) as our model. Löb's observations are based on the systematic analysis of the results obtained by 910 secondary school Danish pupils in their graduation spelling test in 1979. In addition to being very comprehensive, this work is formalised in a way that makes it easily adaptable to our purposes. Furthermore, Løb claims that the 910 pupils chosen for the enquiry are representative for the year group. He provides frequencies of occurrence of the various error types, which are of interest to TEMAA for instance to weight the replacement adequacy results of a spelling checker. Only a few figures will be mentioned here. First of all, r-related errors seem to constitute the largest error group, varying from 30% to 44% of the total number of errors depending on the region. Secondly, one particular error type (replacing rer with re) constitutes 39% of the r-related mistakes made in the enquiry. The second largest group of mistakes (due to a replacement of erer by er) is responsible for 8% of the errors.

Löb distinguishes five types of spelling errors involving the letter r, namely:

errors based on standard Danish pronunciation

errors related to pronunciation in regional variants of Danish

visual errors

technical errors due to wrong dictionary look-up

other errors

The last category includes the kind of idiosyncratic errors that defy automatic treatment. The penultimate one is a group of errors strictly connected to the type of examination the students were submitted to and are therefore not generally relevant. Hence, we shall focus on the first three types.

2.1.1. Errors based on standard Danish pronunciation

This is by far the largest group. We list below the typical examples provided by Löb without referring to the specific phonological cause of the error, since this is not directly relevant to our purposes. Some of the examples quoted by Löb as errors concerning r, however, are not listed here, but in connection with other letters, as they are caused by the wrong deletion of a consonant following an r. We have also left out examples of doubling and singling, which are treated under doubling and singling of consonants in general. Thus, most of the errors left in this section consist in the addition or deletion of the letters r and e, often occurring together in some combination in nominal and verbal endings that are difficult to tell apart phonetically.

English translations are given in parentheses: note that where two translations are indicated, the first one refers to the misspelled word, and the second to the correct one.

Examples Corruption rule

Adding/deleting e

(1.1) klør [kløer] (itches/claws) -Ver > Vr

* vidre [videre] (further) -Cere > Cre

* sneer [sner] (snows) -r > er

(1.2) # * værlse [værelse] (room) -re& > r& {&!=r}

* fler [flere] (more) -re. > r

* længer [længere] (longer)

(-re is either final or followed by a letter different from r )

-r > re

Adding/deleting er

(1.3) * kontrollørne [kontrollørerne] -rer > r

(the conductors)

-r > rer

Adding/deleting r

(1.4) bære [bærer] (carry/carries) -rer > re

flimre [flimrer] (flimmer/flimmers)

* kørerplan [køreplan] (time schedule)

-re > rer

(1.5) * kontrolløerne [kontrollørerne] -VrV > VV {V != V}

(the conductors)

* Panduo [Panduro] (an author's name)

* byrer [byer] (cities) -VV > VrV {V != V}

* muserum [museum] (museum)

(1.6) * hierakisk [hierarkisk] (hierarchic) &VrC > VC {& != r}

* vudere [vurdere] (assess)

* nomal [normal] (normal)

* fasterlavn [fastelavn] (carneval) &VC > VrC {& != r}

* absorlut [absolut] (absolute)

(the constraint makes sure the rules are different from those in 1.4)

Reversing r and e

(1.7) * flimer [flimre] (flimmer) -&re > &er {&!=e}

*tuer [ture] (walks)

* byre [byer] (cities) -er > re

Reversing r and e; adding/deleting r

(1.8) * kuperrene [kupeerne] -Ver <-> Vrre

(the compartments)

#* vier [virre] -rre > er

(shake one's head)

Replacing r with g/j and viceversa

(1.9) * kontroløgerne [kontrollørerne] -rer > ger

(the conductors)

-rer > jer

(1.10) * børerne [bøgerne] -ger > rer

(the books)

-jer > rer

Replacing ar(r) with ej/eg and viceversa

(1.11) * paret [peget] -eget > aret

(pointed)

parret [peget] -eget > arret

(the couple/pointed)

-aret > eget

-arret > eget

(1.12) * naret [nejet] -ejet > aret

(curtsied)

narret [nejet] -ejet > arret

(fooled/curtsied)

-aret > ejet

-arret > ejet

Replacing år(r) with øj and viceversa

(1.13) fåret [føjet] -øjet > året

(the sheep/submitted)

*fårret [føjet] -øjet > årret

(submitted)

-året > øjet

-årret > øjet

Replacing rd with rde/rre/er and viceversa

(1.14) * færde [færd] -rd > rde

(journey)

-rde > rd

(1.15) * færre [færd] -rd > rre

(journey)

-rre > rd

(1.16) *fæer [færd] -rd > er

(journey)

-er > rd -

Vowel replacement

(1.17) * prast [præst] ræ& > ra& {& != r}

(vicar)

* skrætte [skratte] ra& > ræ& {& != r}

(rattle)

(1.18) * rotebil [rutebil] ru > ro

(coach)

* prublemer [problemer] ro > ru

(problems)

(1.19) * fårmiddags [formiddags] or > år

(yesterday morning)

år > or

(1.20) * dokter [doktor] or > er

(doctor)

* bibliotekorne [bibliotekerne] er > or

(the libraries)

2.1.2. Errors based on regional variant pronunciation

Only one type of error due to the influence of a regional variant is mentioned in Löb. Therefore, it does not seem possible on the basis of the material available to distinguish in TEMAA between classes of Danish users on the basis of regional variation. No rules are thus added for this error type.

Examples

dukke [dukker] (doll/dolls)

bleger [blege] (bleaches/pale)

However, Löb shows that there is considerable variation in the frequency of occurrence of the various error types depending on the region. Therefore, differentiation here could be achieved by weighting different errors accordingly.

2.1.3. Visual errors

Examples Corruption rule

(2.1) * passagerne [passagererne] -VrVr > Vr {V = V}

(the passengers)

(2.2) rare [rarere] -rVrV -> rV {V = V}

(nice/nicer)

(2.3) * vinduererne [vinduerne] Vr > VrVr

(the windows)

rV > rVrV

(2.4) * starks [straks] Vr > rV {V != e}

(soon)

2.2. Errors of suffixation (not involving the letter "r")

Like r-related errors, the errors in this group also have a phonological explanation. They are in fact caused by the fact that different suffixes are pronounced in the same way. A number of typical examples are listed below.

The past participle -et suffix for verbs is often pronounced with a soft d // which sounds very much like the d in the past tense -ede suffix. In the first of the two examples given below, the participial form has been replaced by a past tense form, thus resulting in a false negative, wheras in the second example, the misspelled word is not a legal form:

(3.1) har jeg *beskæftigede [beskæftiget] -et. > ede

mig med

(I have occupied myself with)

(3.2) pengene bliver *betragted [betragtet] som -et. > ed

(money is considered as)

Sometimes the past participle -t suffix (which is required with certain verbs) is confused with the more common -et suffix:

(3.3) en lærer havde *slæbet [slæbt] ham Ct. > Cet

(a teacher had dragged him)

2.3. Silent letters

Apart from r, the following letters may be silent in Danish: d, e, g, h, t, v. Therefore, they frequently cause spelling errors . This is especially the case with instances of homophony or near homophony:

men også en *hvis [vis] tilfredsstillelse

(but also a certain satisfaction)

en generation *vis [hvis] forældre arbejder hårdt

(a generation whose parents work hard)

hvorfor har nogle meget *sværdt [svært] ved

(why is it quite difficult for some)

Note that the first two cases above cannot be treated by a spelling checker as the two forms ("vis" and "hvis", En: certain/whose) are homophones, whereas the misspelled word occurring in the third example (*"sværdt") is not part of the Danish vocabulary although it has probably been formed by association with the word "sværd" (En: sword).

However, most of the instances of errors in this class are not attributable to any similarity with the pronunciation of other words. The bulk of errors are due to voiced consonants (with the exception of b) in the final position of a syllable where they appear after l, m, n or r. In such a context, these consonants are either silent or greatly weakened, and are therefore often wrongly omitted in writing. It also happens that, because of the existence of silent letters in other words, the writer wrongly adds letters that are not part of the written word and are not even pronounced:

*hindanden [hinanden]

(each other)

det sociale *samværds [samværs] lille ABC

(lit: social gathering's little ABC)

de ser meget *veltilpadse [veltilpasse] ud

(they look quite satisfied)

The presence or absence of a silent letter is due to a number of factors, such as whether the word has a glottal stop ("hund/hun", En: dog/she), and if so on which letter ("find/fin", En: find/fin), whether the word comes from Latin or not ("inkludere/indtage", En: include/consume) and so on. However, these factors cannot be taken into account here, as the automatic generation of errors possible in ASCC is based on a simple pattern-matching mechanism. Therefore, some of the rules given below are not restricted enough and will also create errors that are unlikely to occur in real text. However, this seems unavoidable if we want to generate all the error types we are interested in.

We shall now discuss each of the possibly silent letters in turn.

silent h

Examples Corruption rule

(4.1) men også en *hvis [vis] tilfredsstillelse .v- > hv

(but also a certain satisfaction)

(4.2) en generation *vis [hvis] forældre arbejder hv- > v

hårdt

(a generation whose parents work hard)

# *erverv [erhverv] (occupation)

The two examples above are naturally occurring ones. However, confusion between words beginning with hj and j may also be foreseen:

(4.3) # *jelm [hjelm] (helmet) hj- > j

(4.4) # *hjern [jern] (iron) j- > hj

# *børnejem [børnehjem] (kindergaarden)

silent d

Examples Corruption rule

(5.1) de er til *gengæl [gengæld] godt udrustet -ld > l

(they are, on the other hand, well equipped)

Bryld slår ret *volsomt [voldsomt] ned på...

(Bryld criticises rather violently)

(5.2) *sansynligvis [sandsynligvis] -nd > n

(probably)

regler *inskærpes [indskærpes] ikke længere

(rules are not stressed any longer)

(5.3) *fær [færd] (journey) -rd > r

går [gård] (yesterday/yard)

(5.4) # *kontrold [kontrol] (control) -l > ld

(5.5) *hindanden [hinanden] -n > nd

(each other)

tjene til ferien og *lommepengende [lommepengene]

(earn for holidays and pocket money)

(5.6) det sociale *samværds [samværs] lille ABC -rC > rdC

(lit: social gathering's little ABC)

*gjordt [gjort] -r. > rd

(done) (r must not be followed by a vowel)

(5.7) # *best [bedst] Vds > Vs

(best)

(5.8) # *påvidst [påvist] Vs > Vds

(proved)

(5.9) # *hvit [hvidt] &dt > &t {& != lnr}

(white)

(5.10) # *konsonandt [konsonant] &t > &dt {& != lnr}

(consonant)

silent e

Examples Corruption rule

(6.1) # bar [bare] (carried/only) -&e. > & {& != r}

(6.2) *tydlig [tydelig] (clear) -el > l

The letter e is also silent when it occurs before m, n, and t (pronounced //). However, in these cases the letter combination which would result from deleting the e would deviate too much from the rules of Danish ortography (e.g. #* "gamml" for "gammel", En: old). Therefore, since no relevant spelling error examples are quoted in the literature we have had access too, we do not set up rules for these cases.

silent g

Examples Corruption rule

(7.1) *selfølelig [selvfølgelig] -lg > l

(of course)

(7.2) # *spurte [spurgte] -rg > r

(asked)

(7.3) # *kule [kugle] -gl > l

(ball)

Wrong addition of a g seems more unlikely than deletion of the same consonant (we have found no recorded example).

The letter g may also be silent in word-final position independently of the preceding letter. The g in the -ig suffix is a case in point:

(7.4) sjældent forlader artiklen, før den er

*færdilæst [færdiglæst] -&ig > &i

(it:seldom leaves the article, before

it is read to the end)

Note that the position of the omitted g is final with respect to the word "færdig", which in the example above, is part of a compound expression.

silent t

Examples Corruption rule

(8.1) hvis mennesket *forsat [fortsat] skal kunne -rts > rs

(if people still must be able to)

man føler sig *nød [nødt] til det

(you feel forced to it)

The fact that the t is silent in the last example is due to an idiosyncracy. The case cannot, therefore, be treated by a general rule.

silent v

Examples Corruption rule

(9.1) *selfølelig [selvfølgelig] -lv > l

(of course)

2.4. Consonant doubling

In Danish, a consonant is doubled if the vowel immediately preceding it is short and if the consonant occurs in a stressed syllable followed by a vowel. Consonant doubling frequently causes spelling errors:

*fatigdom [fattigdom]

(poverty)

lige nu *sider vi og spiser [sidder]

(right now we are sitting and eating)

*uddanelse [uddannelse]

(education)

et *visent blad hos blomsterne [vissent]

(a dead leaf among the flowers)

The opposite can also be observed, where a consonant may be incorrectly doubled:

det øverste billede på venstre *sidde [side]

(the highest picture on the left side)

*erfarringer [erfaringer]

(experiences)

*væssenlig [væsentlig]

(important)

opfattelse af harmoni som *værrende [værende]

(understanding of harmony as being)

Within this category, it would seem that there is a great deal of confusion between words which are homophons or near homophons. For example the verb "sidde" (En: sit) is often erroneously associated with the noun "side" (En: side), in spite of the fact that the two words are not semantically close.

The factors that determine consonant doubling in Danish are specific to the grammar of the language, and consonant doubling in other languages obeys different rules. Nevertheless, it appears that spelling mistakes due to the omission of a consonant where a consonant should be doubled, or to the wrong doubling of a consonant, constitute a common source of error in general. Therefore, in TEMAA consonant doubling and singling are treated as a general type of typing error, without making reference to the specific language. They are handled by the first and the second of the six operations listed in the introduction to this Appendix.

2.5. Letter substitution

This category deals with errors deriving from the substitution of one or more letters in the spelling of a word. There are, especially, certain pairs of vowels where one often is substituted for the other and vice versa, but consonant substitution can occur as well.

The following instances of vowel substitution are very typical. Some Danish vowels are so close to each other in terms of sound quality that the potential for error is high. This is especially the case with the pairs e/æ, a/æ, and y/ø (note that relevant examples have already been quoted in the section on r-related errors).

Examples Corruption rule

(10.1) * nysgarrigt [nysgerrigt] e > a

(curious)

*kommendere[kommandere] a > e

(command)

(10.2) * værden [verden] (world) e > æ

* sprædt [spredt] (spread)

#* portret [portræt] (portrait) æ > e

(10.3) Brians opdragelse har haft y > ø

*betødning [betydning] for

(Brian's education has had importance for)

ø > y

As for consonant substitution, the distinction voiced/non-voiced (i.e. b/p, d/t and g/k) is only active in Danish at the beginning of a word. Therefore, such consonants constitute a frequent source of error when they occur in non-word initial position.

(10.4) * trykhed [tryghed] -g > k

(security)

elever som ikke vil *magge [makke] ret -k > g

(pupils who will not behave)

(10.5) Brian kan *skruppe [skrubbe] af -b > p

(Brian can bugger off)

-p > b

(10.6) mit første * indtryg [indtryk] -k > g

(my first impression)

-g > k

(10.7) *sympol [symbol] (symbol) -b > p

*etaplere [etablere] (establish)

-p > b

Because of identical pronunciation in mid-word and word-final position g and j are also often confused:

(10.8) på deres børns *vejne [vegne] -g > j

(on their children's behalf)

-j > g

2.6. Compounding

In Danish, compounds are written in one word. However, the splitting up of compounds is an increasingly common error phenomenon.

*forbrugs goder [forbrugsgoder] (consumer goods)

i *et hvert [ethvert] hjem (in every home)

*familie idee [familieide] (family idea)

*der ved [derved] ikke sagt (lit: by that not said)

*hvor imod [hvorimod] Klaus Rifbjergs digt (whereas Klaus Rifbjerg's poem)

*oprørs tendenser [oprørstendenser] (rebellion tendencies)

*leve vilkår [levevilkår] (conditions of life)

*overenskomst situationen [overenskomstsituationen] (situation of agreement)

The splitting up of a compound word into two or more of its components is a specific case of space insertion, an error which can occur in connection with simplex words, too[5]. However, it seems that spelling checkers are generally unable to handle space insertion, as they check each word in turn and report on each of them separately. Therefore, compounding errors will not be considered any further here.

2.7. Loan words

The spelling of many loan words deviates radically from the rules of Danish orthography. This makes them a frequent source of error. Especially certain letters and letter combinations can be problematic when they occur in loan words.

Examples Corruption rule

(11.1) # *akselerere [accelerere] (accelerate) ce > se

# *annonsere [annoncere] (announce)

*nyanser [nyancer] (nuances)

(11.2) # *kamouflere [camouflere] (camouflage) cV > k V {V = a,o,u}

# *kreme [creme] (cream) cC > kC

(11.3) # *disipel [discipel] (disciple) sc > s

# *sene [scene] (scene)

(11.4) # *sjance [chance] (chance) ch > sj

# *sjock [chock] (schock)

(11.5) # *bensin [benzin] (petrol) z > s

# *bisar [bizar] (bizarre)

(11.6) # *sylofon [xylofon] (xylophone) x > s

(11.7) # *ekseptional [exceptional](exceptional) xc > ks

(11.8) # *conjak [cognak] (cognac) gn > nj

(11.9) # *salong [salon] (lounge) n > ng

(11.10) # *restaurang [restaurant] (restaurant) nt > ng

(11.11) # *sjenere / *jenere [genere] (annoy) g > sj

(11.12) # *djuice [juice] (juice) j > dj

(11.13) # *vanilje [vanille] (vanilla) ll > lj

(11.14) *niveu [niveau] (level) eau > eu

(11.15) *diskution [diskussion] (discussion) ss -> t

In addition to these, there is a group of loan words for which the Danish spelling was officially changed in 1986 (Retskrivningsordbogen 1986, p.497-506). The problem here is that the spelling has not been consistently changed from a foreign one into one that obeys Danish orthography. Sometimes the spelling that is now sanctioned has become more foreign (e.g. "kampere" "campere", En: camp), and this may cause some confusion. It is not clear how the errors deriving from misspelling of these words should be treated in ASCC. In many cases, the error is of a systematic nature, and would be covered by one of the rules given above (e.g. "campere" -> *"kampere"). In others, however, it is rather idiosyncratic, e.g. "cjartek" -> *"charteque" (En: file), so that it could not be derived by a rule. However, a list of these words comprising their earlier and actual forms could easily be constructed manually.

2.8. Other categories

2.8.1. Apostrophe

Danish only uses an apostrophe in the following contexts:

genitive marking of words ending in -s, e.g. Hans'.

inflection of words with a stem ending in silent consonant, e.g. pommes frites'ene.

inflection of acronyms, e.g. FDF'er.

derivations of numbers, e.g. 6'er.

Influence from English causes the following errors.

Examples Corruption rule

(12.1) på de *unge's [unges] vegne &s. > &'s

(on the youth's behalf)

*Høy's [Høys] tekst

(Høy's text)

Vi modtager gerne *check's [checks]

(We accept cheques)

(12.2) *Brians's [Brians'] væremåde s'. > s's

(Brian's manner)

2.8.2. Capitalisation

This group is dominated by errors of capitalisation in connection with proper names. The pronouns in Danish that need capitalisation (I, De, Dem, Deres) can also cause trouble:

Examples Corruption rule

I dagens *danmark [Danmark] ?

(In today's Denmark)

den overflod som *i har savnet [I]

(the abundance you have lacked)

*i var unge i en tid hvor [I]

(you were young in a time in which)

The rule here is simply to replace a capital letter with a small letter.

2.8.3. Syllable omission

Syllable omission is defined as the omission of one out of two consecutive, identical syllables. The following examples are from Löb (1983):

Examples Corruption rule

*alkoholdig [alkoholholdig] ?

(containing alcohol)

*meddelse [meddelelse]

(message)

It is not clear how to express this as a corruption rule, as we cannot express the concept of syllable. The closest approximation would probably be to check for combinations of two or three letters that are repeated after each other.

2.8.4. Syllable repetition

Syllable repetition is the opposite of syllable omission.

Examples Corruption rule

*rarerere [rarere] ?

(nicer)

*størrere [større]

(bigger)

Again, it is not easy to express this in our notation. The closest approximation will again be to repeat combinations of two or three letters.

[4] The notation informally described here has been designed by Gurli Rohde at Center for Sprogteknologi. The rules have been translated into Perl substitute statements in the implementation.

[5] Simple space insertion is, however, often caused by a simple typing error, whereas the insertion of a space to split the components of a compound in many cases is an intentional spelling error.