Modifying the Tagger
The Stanford POS Tagger is based on a log-linear Maximum Entropy Markov Model (MEMM), improved by the use of bidirectional dependency networks (Toutanova et al. 2003). It includes feature templates for various tag and word combinations, as well as templates for various word shape features and distributional similarity features.
For this project, the feature sets described in Toutanova et al. (2003) and Manning (2011) were used. Several modifications were added to the tagger in order to improve tagging accuracy on our historical corpus. For each modification, I will explain in more detail the background and basic ideas, the approach, and the results.
Using Wordfunctions for Normalization
Lowercasing and normalization of the verbal ending -‘d and the underscore
Clustering
Using Brown clusters for generating distributional similarity features
Selecting Candidate Tags for Unknown Words
Using clusters and an external dictionary for selecting candidate tags for unknown words
Deterministic Tag Expansion
Extending deterministic tag expansion to VBG, VBN, and JJ
Arabic and Roman Number Detector
Adding a new feature template for detecting Arabic and Roman numbers