Arabic and Roman Number Detector

Adding a new feature template for detecting Arabic and Roman numbers

Background

When the tagger is trained on the younger part of the goldstandard (ECF2NCF) and tested on the older part (EEPFECF1), tagging accuracy on numbers is strikingly bad. Like in the following example, numbers are hardly recognized and tagged correctly as CD at all:

LONDON_NNNNP ,_, Printed_VBN for_IN R._NNP Bentley_NNP and_CC S._NNP Magnes_NNP ,_, at_IN the_DT Post-house_NN in_IN Russel-street_NNNNP in_IN Covent-Garden_NNP ,_, 1685_NNPCD ._.

The reason for this phenomenon is that ECF2NCF contains only one instance of an Arabic number. Although the Stanford POS Tagger includes suitable feature templates concerning the use of digits, this single instance is not enough to generate features from them.

However, while ECF2NCF does not contain more than a single instance of an Arabic number, it contains several instances of Roman numbers. So the idea was to create a new feature template which applies to Roman numbers and Arabic numbers alike, increasing the amount of relevant instances to be found in ECF2NCF.

Approach

The Number Detector was implemented as a feature template for rare words. It generates features whenever a word matches one of two regular expressions: \\.?\\d+\\.? for Arabic numbers, or [IVXLCDM]+\\.? for Roman numbers.

Result

When the Number Detector is used, there are very small improvements in the recognition of numbers. In the example sentence above, the number is tagged correctly now:

LONDON_NNNNP ,_, Printed_VBN for_IN R._NNP Bentley_NNP and_CC S._NNP Magnes_NNP ,_, at_IN the_DT Post-house_NN in_IN Russel-street_NNNNP in_IN Covent-Garden_NNP ,_, 1685_CD ._.

But at the same time, applying an additional feature template also leads to unintended side effects, some of them introducing new errors.