Creating a Goldstandard
Dataset and Samples
The dataset for this project consists of three corpora of English prose fiction: Early English Prose Fiction (EEPF), containing texts with publication dates ranging from 1508 to 1700; Eighteenth Century Fiction (ECF; 1705-1780); and Nineteenth Century Fiction (NCF; 1782-1903). For the goldstandard, a random sample of sentences was selected by picking one sentence and one per 1,500 sentences from each text.
The resulting sample sentences were grouped in two sets: ECF2NCF, consisting of sample sentences from texts of authors born after 1700, and EEPFECF1, consisting of sample sentences from texts of authors born before 1700.
Sample set | Number of sentences | Number of tokens | Birth date of authors |
---|---|---|---|
ECF2NCF | 1,474 | 39,737 | 1700-1869 |
EEPFECF1 | 408 | 20,216 | 1460-1699 |
Tagging
Before sampling, the corpora were tokenized using the Stanford PTB Tokenizer, and split into sentences using a custom sentence segmentation algorithm. For the sample sentences, tokenization and sentence segmentation were subsequently corrected by hand.
Finally, the sample sentences were manually annotated with POS tags, using a slightly adapted version of the PennTreebank (PTB) tagset (Santorini 1990).