Selecting Candidate Tags for Unknown Words

Background

The POS tagger calculates the most likely sequence of tags for a given sentence, which is a very costly operation. Therefore, not all the possible tags are considered for each word, but only a certain selection of candidate tags. For known words, the candidate tags are chosen from a tag dictionary which has been generated during training. For unknown words, the selection of candidate tags relies on the definition of open and closed class tags. Only open class tags are considered as candidate tags.

Open class tags represent word classes like noun, verb, or adjective, which include a great variety of different words and are still productive. Closed class tags, on the other hand, represent word classes like preposition, modal auxiliary, or conjunction, which include only a limited number of words.

For standard Present Day English, it makes sense to exclude closed class tags from being considered as candidate tags for unknown words because it is reasonable to assume that all of their possible word forms are already known from the training corpus. However, when dealing with historical language, this mechanism is more harm- than useful. Archaic word forms and spelling variants lead to a large number of unknown words that actually belong to these assumably non-productive word classes. Here are just some examples:

Archaic word form Present Day English equivalent Possible closed class tag(s)
eyther either CC, DT
euerie every DT
yt that DT, WDT
culde could MD
mee me PRP
theyr their PRP$
vp up RP
whiche which WDT

Note that the tagger has no chance to tag these words correctly, because only open class tags are considered as candidate tags for them.

Approach

The most straightforward solution for this problem would be to change the definition of closed class tags and define tags like CC, DT, MD etc. to be open class, too. This can be done easily by modifying the configuration of the Stanford POS Tagger. However, if there are too many open class tags, calculating the most likely sequence of tags can become too costly for sentences with a lot of unknown words. So I tested two different approaches.

First, the clusters were used to select candidate tags for unknown words. If a word is unknown, it is looked up in the clusters. If it is found in a cluster, all other words in the same cluster are looked up in the tag dictionary and, if they are known, their possible tags are added to the candidate tags.

Second, an external tag dictionary was used. If a word is unknown from the tag dictionary generated at training, a second dictionary is consulted, and only if the word is not found there either, all open class tags are chosen as candidate tags.

Results

Using the clusters for selecting candidate tags improves the tagging accuracy considerably. The external dictionary also shows a small positive effect, but since the tag dictionary used was composed from contemporary English, this is only true for ECF2NCF. When it is used on the older part of the goldstandard, EEPFECF1, the external dictionary even has a negative effect.

Furthermore, it turns out to be difficult to combine both approaches. Only in one test scenario, using both the clusters and the dictionary has a positive effect. In the other test scenarios, using the clusters alone is the best choice.