Modifying the Tagger

Modifying the Stanford POS Tagger for the use on historical English

The Stanford POS Tagger is based on a log-linear Maximum Entropy Markov Model (MEMM), improved by the use of bidirectional dependency networks (Toutanova et al. 2003). It includes feature templates for various tag and word combinations, as well as templates for various word shape features and distributional similarity features.

For this project, the feature sets described in Toutanova et al. (2003) and Manning (2011) were used. Several modifications were added to the tagger in order to improve tagging accuracy on our historical corpus. For each modification, I will explain in more detail the background and basic ideas, the approach, and the results.

Modifying the Tagger

Using Wordfunctions for Normalization

Clustering

Selecting Candidate Tags for Unknown Words

Deterministic Tag Expansion

Arabic and Roman Number Detector