Deterministic Tag Expansion

Background

The POS tagger calculates the most likely sequence of tags for a given sentence, which is a very costly operation. Therefore, not all the possible tags are considered for each word, but only a certain selection of candidate tags. For known words, the candidate tags are chosen from a tag dictionary which has been generated during training.

However, there is a certain risk that words have not been seen with all their possible tags during training, especially if training data are sparse (like in our case). In order to reduce that risk in certain well-defined cases, the Stanford POS Tagger applies a mechanism called deterministic tag expansion.

Deterministic tag expansion is based on the idea that in some cases it is reasonable to assume that a word which has been seen with tag x might also have tag y, even if it has not been seen with it in the training data. Therefore, if a word is listed with x but not with y in the tag dictionary, y is automatically added to the candidate tags.

The Stanford Tagger implements deterministic tag expansion in two cases: VB is expanded by VBP and vice versa, and VBD is expanded by VBN and vice versa. This reflects the linguistic fact that VB and VBP as well as VBD and VBN share the same form for regular verbs.

In this project, deterministic tag expansion was extended to further cases: -ing participles are usually tagged VBG, but they can also be used as JJ or NN. In the same way, -ed participles, which are usually tagged as VBN, can also be used as JJ. When training data are sparse, there is a considerable risk that specific participle forms do not occur more often than once in the training set. If they are tagged as JJ or NN in these instances, the information that -ing forms are usually VBG and -ed forms are usually VBN would be lost to the tagger. Thus, the idea was to provide this information via deterministic tag expansion.

Approach

Deterministic tag expansion was modified in order to add the tag VBG to all JJ and NN with the ending -ing or -yng, and to add the tag VBD to all JJ with the ending -ed or -’d.

On the other hand, it did not seem reasonable to add the tags JJ and NN to all words with the tag VBG, and to add JJ to all words with VBN respectively. Experiments showed that this would have a negative effect on tagging accuracy.

Result

Using extended deterministic tag expansion produces mixed results. In some cases, errors like pledged_JJ in the following sentence were avoided:

In_IN a_DT year_NN from_IN the_DT time_NN when_WRB the_DT Moonstone_NNNNP was_VBD pledged_JJVBN ,_, the_DT Indians_NNPNNPS will_MD be_VB on_IN the_DT watch_NN for_IN their_PRP$ third_JJ chance_NN ._.

On the other hand, in some cases new errors like striking_VBG in the following sentences were introduced:

The_DT divine_NN rapidly_RB drew_VBD striking_VBGJJ and_CC fearful_JJ pictures_NNS of_IN these_DT rustic_JJ crimes_NNS ._.

But in total, extended deterministic tag expansion has a small positive effect on tagging accuracy.