Evaluation

Test scenarios and test results

Test Scenarios

Four different test scenarios were used. In test 0, the Stanford POS Tagger was applied off-the-shelf - i.e., unmodified and trained on contemporary English - on both EEPFECF1 and ECF2NCF.

In test 1, fivefold cross-validation was conducted on ECF2NCF. In this way, it was tested how well the tagger works on historical language that is different from, but still relatively close to Present Day English.

In test 2, ECF2NCF was used as training set and EEPFECF1 was used as test set. This scenario was chosen in order to isolate the influence of language change on tagging accuracy, since training set and test set cover different historical periods.

In test 3, fivefold cross-validation was conducted on EEPFECF1, with ECF2NCF as additional training material. This scenario was chosen in order to measure the effect of in-domain training material. Furthermore, this scenario was expected to yield the best results for EEPFECF1.

Scenario	Training set	Test set
Test 0	Pre-trained on contemporary English	EEPFECF1, ECF2NCF
Test 1	ECF2NCF	ECF2NCF
Test 2	ECF2NCF	EEPFECF1
Test 3	ECF2NCF + EEPFECF1	EEPFECF1

Results

Results of Training on Goldstandard

In a first step, the effect of training on the self-provided golstandard was tested and the results compared to the results of using a pre-trained tagger.

For ECF2NCF, the results of the pre-trained tagger are already pretty good. Training on the goldstandard reaches almost the same tagging accuracy, although the self-provided training material is much smaller than the one used by the Stanford NLP Group. While the sparseness of training material has indeed a negative effect in regard to dictionary coverage (cf. the higher percentage of unknown words), this effect is almost counterbalanced by the positive effects of in-domain training (cf. the slightly higher tagging accuracy for both known and unknown words).

For EEPFECF1, the results of the pre-trained tagger are much worse. However, test 2 and test 3 show the positive effect of in-domain training material. The sequence of the results shows very well how tagging accuracy increases with the various training sets. The improvement is most striking in regard to unknown words.

Results of Using a Modified Tagger

In a second step, the modified tagger was tested and the results were compared to the results of using an unmodified tagger. The following charts always show the best combination of modifications for each test scenario.

In test 1, the unmodified tagger is already pretty good. However, using the self-provided clusters for generating features and for selecting candidate tags, as well as extended deterministic tag expansion, leads to a small improvement.

In test 2, considerable improvement can be achieved by using all of the described modifications except for the lowercasing wordfunction. Most notable is the improvement in tagging accuracy on unknown words.

In test 3, using the self-provided clusters for generating features and for selecting candidate tags, as well as extended deterministic tag expansion, leads to considerable improvement in tagging accuracy on unknown words.