Creating a Goldstandard

Dataset, sample sets, preprocesssing, and manual tagging

Dataset and Samples

The dataset for this project consists of three corpora of English prose fiction: Early English Prose Fiction (EEPF), containing texts with publication dates ranging from 1508 to 1700; Eighteenth Century Fiction (ECF; 1705-1780); and Nineteenth Century Fiction (NCF; 1782-1903). For the goldstandard, a random sample of sentences was selected by picking one sentence and one per 1,500 sentences from each text.

The resulting sample sentences were grouped in two sets: ECF2NCF, consisting of sample sentences from texts of authors born after 1700, and EEPFECF1, consisting of sample sentences from texts of authors born before 1700.

Sample set	Number of sentences	Number of tokens	Birth date of authors
ECF2NCF	1,474	39,737	1700-1869
EEPFECF1	408	20,216	1460-1699

Tagging

Before sampling, the corpora were tokenized using the Stanford PTB Tokenizer, and split into sentences using a custom sentence segmentation algorithm. For the sample sentences, tokenization and sentence segmentation were subsequently corrected by hand.

Finally, the sample sentences were manually annotated with POS tags, using a slightly adapted version of the PennTreebank (PTB) tagset (Santorini 1990).