Clustering

Background

The major challenge for POS tagging historical English are spelling variants that lead to an unusual high percentage of unknown words. In the following example, the tagger is not able to identify the words suche, faithfull, friendes, and togither as variants of the known words such, faithful, friends, and together. Instead, they are treated like unknown words.

It_PRP grieues_VBZ me_PRP that_CS suche_NNJJ faithfull_RBJJ friendes_JJNNS For_CCIN aye_JJRB togither_NNRB may_MD not_RB dwell_VB ._.

One way to deal with this problem is the use of external knowledge about distributional similarity. The basic idea of using distributional similarity is to conduct an unsupervised clustering on a large corpus of unannotated text. The clustering algorithm is meant to group together word forms that have a similar distribution across the corpus, i.e., which are used in similar contexts. Subsequently, the information which word belongs to which cluster is used by the tagger to generate additional features.

The Stanford POS Tagger already includes feature templates for distributional similarity, and there are clusters that can be downloaded from the Stanford website. However, these clusters are extracted from contemporary English sources. Therefore, a Brown clustering was conducted on our historical corpus in order to provide the Stanford POS Tagger with clusters of historical English.

Approach

Brown Clustering (Brown et al. 1992) works on the assumption that similar words are likely to be used in similar contexts, or, more specifically, to have similar neighbouring words. Brown et al. give the example of the words Thursday and Friday: both words might appear after words like on or last, and they might be followed by words like morning or night. From that, it can be assumed that Thursday and Friday are similar words and should be put in the same cluster. The objective of the Brown Clustering is to find a partitioning of all words into a certain number of clusters, which guarantees that the words that are most similar to each other are in the same cluster.

For this project, a flat version of the Brown algorithm was applied to our historical text corpus. 91,586 words were clustered into 640 clusters.

Results

The resulting 640 word clusters proved indeed helpful for handling spelling variants. When the Stanford POS Tagger is provided with these clusters instead of the default ones, tagging accuracy on unknown words is improved considerably.

To give an impression, here are the clusters of the words from the example above (abridged after 100 words):

suche: such such-and-such sike sutche sech zuch sitch siccan somuch sich sutch sic actaeons somuche whaten noscitur cales

faithfull: vigilant lowly ferocious predominant wofull heauie philosophical bewitching soueraigne devout chast stubborn godly surly crafty bloudy curteous blunt sympathetic submissive mightie sprightly courtly rebellious disconsolate diuine malignant genial masculine capricious artless hardy luxurious despairing considerate meek candid judicious fervent loyal skilful sordid sorrowfull hospitable sentimental zealous watchful brutal vigorous chearful sanguine chaste trusty hopeful sensitive dutiful vicious disinterested rigid discreet homely braue benevolent dignified confidential charitable malicious refined childish timid respectful princely manly hearty courteous valiant haughty pious ardent stately graceful bloody vertuous virtuous sincere modest lively passionate gracious gallant friendly affectionate delicate simple gentle generous tender noble humble …

friendes: accomplices seruauntes grandchildren contemporaries persecutors finances dependants benefactors accusers correspondents successors landes dependents neighbors tutors employers freendes freinds lordships colleagues subiectes frendes nephews kinsmen hornes enimies confederates vassals opponents inferiors pursuers enemyes parentes husbandes predecessors wittes mates clients adversaries maiesty maisters forefathers auditors parishioners nieces equals aunts superiors devotions betters guardians rivals sakes fellow-creatures fellowes patients uncles favourites subiects hearers mistresses associates sonnes soules wiues customers foes admirers relatives seruants comrades sermons creditors acquaintances countrymen kindred brains ancestors followers cousins brethren readers mothers cloaths wits shoes husbands quarters masters prayers neighbours lodgings relations fathers clothes companions enemies parents friends …

togither: vnespied hungerly alowde drows unreconciled forthwardes pointblank forrards seawards scathless a-fishing unbless’d encaged cozily lengthwise a-breast hard-by wythall unfilled above-ground peccaui ankle-deep d’espagne above-stairs thickens cap-a-pee downeward togyder skelter hitherward scotfree togethers uncomplainingly broadcast diagonally home-along vertically consecutively a-pace unpunish’d vpwards alofte lengthways undetected 1780 a-walking head-foremost a-shore a-horseback agape below-stairs eastwards underfoot amidships togyther rent-free a-hunting knee-deep scot-free togeather vpward thitherward anonymously out-right aground froid unscathed undiscover’d pell-mell vnreuenged amaine vnpunished hand-in-hand unrewarded horizontally outwards hereabout bareheaded forrard therefrom northwards unpunished barefoot in-doors abrode awry arm-in-arm unmolested abroade unperceived apace indoors unnoticed overhead unobserved aloft asunder abroad together …