Creating a Goldstandard

Dataset and Samples

The dataset for this project consists of three corpora of English prose fiction: Early English Prose Fiction (EEPF), containing texts with publication dates ranging from 1508 to 1700; Eighteenth Century Fiction (ECF; 1705-1780); and Nineteenth Century Fiction (NCF; 1782-1903). For the goldstandard, a random sample of sentences was selected by picking one sentence and one per 1,500 sentences from each text.

The resulting sample sentences were grouped in two sets: ECF2NCF, consisting of sample sentences from texts of authors born after 1700, and EEPFECF1, consisting of sample sentences from texts of authors born before 1700.

Sample set Number of sentences Number of tokens Birth date of authors
ECF2NCF 1,474 39,737 1700-1869
EEPFECF1 408 20,216 1460-1699

Tagging

Before sampling, the corpora were tokenized using the Stanford PTB Tokenizer, and split into sentences using a custom sentence segmentation algorithm. For the sample sentences, tokenization and sentence segmentation were subsequently corrected by hand.

Finally, the sample sentences were manually annotated with POS tags, using a slightly adapted version of the PennTreebank (PTB) tagset (Santorini 1990).