; verbose (-v: yes -v-: no) -v - ; keep intermediary files (-x: yes -x-: no) -x - ; flex rules (input file, binary format) ;-b (not specified, not doing actions with already created flex rules.) ; Word list (Optional if -b is specified. Otherwise N/A) (-I filename) ;-I (N/A) ; Output, lemmas of words in input (-I option) ;-O (N/A) ; word/lemma list -i wfl-cs.txt.ph ; extra file name affix -e ziggurat ; suffix only (-s: yes -s-: no) -s - ; make rules with infixes less prevalent(-A: yes -A-: no) -A - ; columns (1 or F or W=word,2 or B or L=lemma,3 or T=tags,0 or O=other) -n FBO ; max recursion depth when attempting to create candidate rule -Q 1 ; flex rules (output, binary format, can be left unspecified) ;-o (Not specified, autogenerated) ; temp dir (including separator at end!) -j tmp/ ; penalties to decide which rule survives (4 or 6 floating point numbers: R=>R;W=>R;R=>W;W=>W[;R=>N/A;W=>NA], where R=#right cases, W=#wrong cases, N/A=#not applicable cases, previous success state=>success state after rule application) -D 0.014974;-0.603475;0.774494;0.070238;0.018401;0.174589; ; compute parms (-p: yes -p-: no) -p ; expected optimal pruning threshold (only effective in combination with -XW) -C -1 ; tree penalty (-XC: constant -XD: more support is better -XE: higher entropy is better -XW: Fewer pattern characters other than wildcards is better) -X C ; current parameters (-P filename) -P parms.txt ; best parameters (-B filename) -B best_ziggurat.txt ; start training with minimal fraction of training pairs (-Ln: 0.0 < n <= 1.0) -L 0.108331 ; end training with maximal fraction of training pairs (-Hn: 0.0 < n <= 1.0) -H 1.000000 ; number of differently sized fractions of trainingdata (natural number) -K 20 ; number of iterations of training with same fraction of training data when fraction is minimal (positive number) -N 100.000000 ; number of iterations of training with same fraction of training data when fraction is maximal (positive number) -M 10.000000 ; competition function (deprecated) ;-f (N/A) ; redo training after homographs for next round are removed (-R: yes -R-: no) ;-R - (N/A) ; max. pruning threshold to evaluate -c 5 ; test with the training data (-T: yes -T-: no) -T ; test with data not used for training (-t: yes -t-: no) -t ; create flexrules using full training set (-F: yes -F-: no) -F ; Number of clusters found in word/lemma list: 21487 ; Number of lines found in word/lemma list: 184620 ; Evaluation: ; ----------- ; Lemmatization results for all data in the training set. ; For pruning threshold 0 there may be no errors (diff%%). ; prun. thrshld. 0 1 2 3 4 5 ; rules 10825.000000 5819.000000 1694.000000 972.000000 717.000000 570.000000 ; rules% 18.065754 9.711282 2.827103 1.622163 1.196595 0.951268 ; same% 91.775701 85.719292 81.738985 80.165220 79.243992 78.636515 ; ambi1% 3.998665 3.916889 2.701936 2.359813 2.253004 2.099466 ; ambi2% 3.998665 3.573097 2.137850 1.797397 1.652203 1.572096 ; ambi3% 0.226969 0.128505 0.028371 0.000000 0.000000 0.000000 ; diff% 0.000000 6.662216 13.392857 15.677570 16.850801 17.691923 ; same%stdev 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ; ambi1%stdev 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ; ambi2%stdev 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ; ambi3%stdev 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ; diff%stdev 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ; ;Evaluation of prediction of ambiguity (whether a word has more than one possible lemma) ;--------------------------------------------------------------------------------------- ; amb.rules% 8.224299 8.117490 5.657543 4.864820 4.629506 4.407543 ; false_amb% 0.000000 2.069426 1.969292 1.812417 1.735648 1.645527 ; false_not_amb% 0.000000 2.176235 4.536048 5.171896 5.330441 5.462283 ; true_amb% 8.224299 6.048064 3.688251 3.052403 2.893858 2.762016 ; true_not_amb% 91.775701 89.706275 89.806409 89.963284 90.040053 90.130174 ; precision 1.000000 0.593709 0.483589 0.457136 0.454641 0.456300 ; recall 1.000000 0.735390 0.448458 0.371144 0.351867 0.335836 ; Evaluation: ; ----------- ; Lemmatization results for data that is not part of the training data. ; prun. thrshld. 0 1 2 3 4 5 ; rules 10671.444444 5747.000000 1678.555556 953.777778 711.222222 565.222222 ; rules% 18.079397 9.736479 2.843783 1.615876 1.204942 0.957591 ; same% 77.617687 78.251149 78.437461 78.201466 77.965470 77.617687 ; ambi1% 2.943734 2.509005 2.235747 1.900385 1.813439 1.801019 ; ambi2% 1.937647 1.887964 1.689231 1.478077 1.403552 1.217240 ; ambi3% 0.024842 0.037262 0.012421 0.012421 0.000000 0.000000 ; diff% 17.476090 17.314619 17.625140 18.407651 18.817538 19.364054 ; same%stdev 2.073212 2.106211 2.181531 2.103388 2.281293 2.221874 ; ambi1%stdev 0.597689 0.486210 0.602384 0.697649 0.589163 0.537629 ; ambi2%stdev 0.334158 0.554779 0.500750 0.537387 0.490869 0.361299 ; ambi3%stdev 0.050571 0.057972 0.036271 0.036271 0.000000 0.000000 ; diff%stdev 1.949122 1.903882 2.075545 1.971523 2.085303 2.172228 ; ;Evaluation of prediction of ambiguity (whether a word has more than one possible lemma) ;--------------------------------------------------------------------------------------- ; amb.rules% 6.297354 5.477580 4.757173 3.987082 3.825612 3.688983 ; false_amb% 0.322941 0.298100 0.260837 0.136629 0.161471 0.173891 ; false_not_amb% 0.919140 1.080611 1.105453 1.142715 1.179978 1.179978 ; true_amb% 0.484412 0.322941 0.298100 0.260837 0.223575 0.223575 ; true_not_amb% 9.688237 9.713079 9.750342 9.874550 9.849708 9.837287 ; precision 0.428571 0.351351 0.363636 0.488372 0.409091 0.391304 ; recall 0.345133 0.230088 0.212389 0.185841 0.159292 0.159292 ; ; Power law relating the number of rules in the decision tree to the number of examples in the training data ;---------------------------------------------------------------------------------------------------------- ; #rules = 0.860*N^0.857 0.471*N^0.855 0.148*N^0.847 0.113*N^0.822 0.095*N^0.810 0.091*N^0.792 ; Postscriptum ; The number of rules can be estimated from the number of training examples by ; a power law. See the last line in the table above, which is based on 7 ; different samples from the total available training data mass varying in size ; from 1.54 % to 98.56 %