Language Change as a Challenge

Objective

In the field of Natural Language Processing, part-of-speech (POS) tagging is generally considered to be a solved problem. State-of-the-art taggers like the Stanford POS Tagger report tagging accuracies of around 97% for the English language.

However, these accuracy rates do not apply to historical texts. Language change leads to a high variance in spelling, morphology, syntax, and vocabulary. Thus, POS tagging historical texts poses a great challenge to off-the-shelf taggers.

In this project, a corpus of English prose fiction from the 16th to 19th century was to be POS tagged using the Stanford POS Tagger. The following example sentences highlight some of the problems that arise when the Stanford POS Tagger is used off-the-shelf, i.e. unmodified and trained on contemporary English (hover over the errors to see the correct tags):

The_DT Servant_FWNN convey’d_FWVBD the_DT Letter_NNPNN into_IN the_DT hand_NN of_IN Alcipus_NNP ,_, and_CC return’d_NNVBD back_RB to_TO tell_VB us_PRP they_PRP were_VBD coming_VBG ._.

What_WP ,_, wil_NNMD you_PRP not_RB take_VB my_PRP$ word_NN qd_FWVBD Iarrat_FWNNP ?_. sir_FWNN qd_FWVBD the_DT Catchpole_NNPNN ,_, if_CS t_NNPRP were_VBD for_IN any_DT matter_NN in_IN hel_NN ,_, I_PRP would_MD take_VB your_PRP$ word_NN as_INRB soone_NNRB as_CS any_DT diuell_JJNN s_NNPOS in_IN that_DT place_NN ,_, but_CC seeing_VBG it_PRP is_VBZ for_IN a_DT matter_NN on_IN earth_NN ,_, I_PRP would_MD gladly_RB haue_VB a_DT surety_NN ._.

For_INCC mary_JJNNP s_NNPOS aunte_FWNN hylde_FWVBD on_IN the_DT yonge_NNJJ duke_NN s_NNPOS party_NN and_CC afterwarde_NNRB Murdered_VBNVBD hyr_FWPRP$ selfe_FWNN when_WRB that_CS she_PRP knewe_VBPVBD that_CS the_DT olde_NNJJ duke_NN was_VBD conueyed_VBN out_IN of_IN pryso__NN by_IN ye_PRPDT Iayler_FWNN therof_FWRB as_CS more_RBR playnly_RB here_RB after_INRB foloweth_NNVBZ ._.

Approach

Creating a Goldstandard

In order to improve tagging accuracy on our corpus, a reasonable quantity of manually tagged in-domain training material was provided.

Modifying the Tagger

For further improvement, several modifications were applied to the Stanford POS Tagger which were meant to optimize it for the use on historical language.

Evaluation

Finally, the modified tagger was tested in several test scenarios.