Corpus selection

Posted on

squarebooksA crucial part of digital history lies in selecting corpuses. Corpuses can be created to avoid biases (see for example Blaxhill’s paper: Blaxhill constructs a corpus to be defensible – ie non-biased and complete enough – so that he can defend using computer analysis done on it). Our claim, in the case of a newspaper archive, not to need to do this rests on comprehensiveness – we have so much data that biases due to say newspaper men’s opinions will cancel out. Even if one paper doesn’t print a news item, someone else will. This seems not only problematic, but we don’t even have ‘all’ the data that would be needed for such a claim. Using a total corpus may be appealing from a computer science point of view, but we risk drowning out signals that we’re interested in by stronger signals that more newspaper editors were interested in printing (say football scores).

We’re currently (or a sister project, Translantis, is) looking at how we can automatize the creation of sub-corpuses so that one can limit the texts that you’re searching for a given analysis. This is a crucial step towards being able to do critical source analysis.