Using Wordfunctions for Normalization

Lowercasing and normalization of the verbal ending -'d and the underscore

Background

One of the most frequent tagging errors that can be attributed to language change is the false assignment of the proper noun tag NNP due to the capitalization of common nouns like in the following example:

This_DT Booke_NNPNN (_( as_CS were_VBD both_DTPDT the_DT rest_NN )_) was_VBD nothing_NN els_VBZRB but_RBIN a_DT deceipt_NN of_IN the_DT Inchaunteresse_NNPNN ,_, to_TO drawe_VB thether_RB one_CD of_IN those_DT Brethren_NNS ;_: where_WRB (_( in_IN stead_NN of_IN releeuing_VBG their_PRP$ Sister_NNPNN )_) they_PRP might_MD inthrall_VB themselues_NNSPRP ._.

Another source of error is the frequent use of the verbal ending -‘d (instead of -ed):

I_PRP am_VBP divided_VBN from_IN Mankind_NNPNN ,_, a_DT Solitaire_NN ,_, one_CDNN banish’d_NNVBN from_IN humane_JJ Society_NN ._.

In the corpus EEPF, the letters n and m are sometimes replaced by an underscore. In the following example, the tagger does not recognize the word women because of the underscore.

For_CC by_IN those_DT gifts_NNS of_IN Nature_NN and_CC Fortune_NN (_( being_VBG in_IN all_DT places_NNS acceptable_JJ )_) he_PRP creepes_VBZ ,_, nay_NNRB (_( to_TO say_VB truely_RB )_) he_PRP flies_VBZ so_RB into_IN the_DT fauour_NN of_IN poore_FWJJ sillie_FWJJ wome__FWNNS ,_, that_CS I_PRP would_MD be_VB too_RB much_RB ashamed_JJ to_TO confesse_VB ,_, if_CS I_PRP had_VBD not_RB reuenge_VBNN in_IN my_PRP$ hande_NN ,_, as_RB well_RB as_CS shame_NN in_IN my_PRP$ cheekes_NNS ._.

All of these issues can be addressed by applying wordfunctions to the text prior to POS tagging.

Approach

The Stanford POS Tagger allows the use of wordfunctions by embedding them into the configuration file. In order to address the issue of false NNP assignment, a simple lowercasing function was used. The verbal ending -‘d and the underscore were replaced by -ed (-ld in the case of would, should, and could) or by the letter n by another wordfunction.

Results

Lowercasing does indeed reduce the number of false assignments of NNP. But at the same time, it also reduces the recognition rate of proper nouns. In total, applying the lowercasing function has a negative effect on tagging accuracy.

Applying the normalization function for -‘d and the underscore showed a minor positive effect in some test scenarios.