Modifying the Tagger

The Stanford POS Tagger is based on a log-linear Maximum Entropy Markov Model (MEMM), improved by the use of bidirectional dependency networks (Toutanova et al. 2003). It includes feature templates for various tag and word combinations, as well as templates for various word shape features and distributional similarity features.

For this project, the feature sets described in Toutanova et al. (2003) and Manning (2011) were used. Several modifications were added to the tagger in order to improve tagging accuracy on our historical corpus. For each modification, I will explain in more detail the background and basic ideas, the approach, and the results.

Using Wordfunctions for Normalization

Lowercasing and normalization of the verbal ending -‘d and the underscore

Clustering

Using Brown clusters for generating distributional similarity features

Selecting Candidate Tags for Unknown Words

Using clusters and an external dictionary for selecting candidate tags for unknown words

Deterministic Tag Expansion

Extending deterministic tag expansion to VBG, VBN, and JJ

Arabic and Roman Number Detector

Adding a new feature template for detecting Arabic and Roman numbers