What is the black box here? Perhaps an unfortunate use of term because in each field it brings different baggage with it. But I was referring to the whole system (ie. tools used to search historical newspaper archives). A search query goes in, and you’re not sure what’s processed to turn that search query into the result that comes out perhaps as a list perhaps as a visualization. A historian can understand the search algorithm, sure, this is a known sub-component if you will, but the assumptions behind each step – even the quality of the archives that go in — are unquantifiable unknowns. The selection of sources has often been compared to the fact that people make choices in archives according to a logic that is never known to the researcher, but here there’s an extra layer: you don’t know what archives are being included in a given search.
There’s an extra step of translation between search query and computer processing of digital archives, which can’t be accessed by anyone else – including the researcher. No one knows if these methods will produce results that we can use, so unlike many things, we have to invest the time and effort into developing them before we know if they’re any good. Having to invest so much time and effort naturally inclines us to want to continue the work and prove that it is useful. Otherwise, it’s very difficult to motivate yourself to work further on a project that will benefit no one and never see the light of day.
And its true that the digital is giving us unprecedented access to methods and so we’re asked to constantly explain our methods, unlike in papers that use ‘traditional’ methods. But before we complain that the standards for defending truth claims are higher for digital humanists, consider that people have to defend these methods because they’re new and they’re problematic.
I was worried here about the fact that people will generalize from their experience with digital things in general to their use in history. Even historians, are thus liable to fall prey, I fear, to lazy assumptions that digital tools are as well founded as say, Google’s searches. One of the things lost in digitizing archives is the physical feeling of being in an archive, of being somewhere out of the ordinary. Instead, you can do a search from your armchair at home, between episodes of The Wire. The physical cues belonging to an archive or a library are lost and cannot be recreated digitally because the very advantage of digital is being able to access archives from anywhere.
Consider: no one going into an archive would ever claim to have read all of the boxes in an archive. Yet people will claim to have read (distantly, admittedly) all of the sources available – in the tool or in archives in general. A user will easily make the same assumptions about say a digital tool as they do about how Google or Amazon works. Yet the tool is not necessarily operating on the most appropriate sources or even all of them, and OCR errors mean it may not even be analyzing all of the sources that you think it’s analyzing. Furthermore, these tools (these imperfect, unfinished tools) have the same sleek look as other tools on the web in part because people expect this, but it also implies a level of finish of certainty that these tools don’t have and can’t offer.
People have a good idea of what the lower numbers mean, say 1-20, then maybe 100, 1000, and then perhaps million, and then infinity (a lot). So saying that you have 18 million newspaper pages here, is not an attempt to be clear about your source material, it’s an attempt, without explicitly claiming something you can’t support – namely any claim to comprehensiveness – to argues that because you have a big number of sources, you therefore can make a greater claim to ‘superior’ service, and ‘superior’ conclusions. But a big number like 18,000 doesn’t mean anything to the average person, apart from that its different than say 40,000. And it certainly doesn’t mean anything absolutely; who knows how much news is in 18,000 pages? I have no idea how many newspaper pages there are so all I can do is use that number relatively. 18 million is bigger than 10. But for all I know, there could be 18 million sheets of newspaper all of one ideological persuasion.
Far from adding ‘honesty’ or ‘specificity’, quantifying sources in this way is just providing a false sense that the results of the search are somehow trustworthy. The results are, after all, not qualified on the surface, although perhaps the qualifications are there somewhere if you look for them. As people have complained before, the uncertainty is unclear. The complaint is not that such a search is uncertain, its that you can’t be certain of the uncertainty.
Yes, archives have always been incomplete, always depended on someone choosing sources according to a logic that may be unknown to the user. If you can only peruse a few books then it doesn’t matter how comprehensive your archive is (though we have seen historical examples where censorship of an archive results in particular articles being written), but if you’re able to analyze a lot, then the limits of a given collection take on a whole new meaning.
It seems to me that historians are not exempt from making the same kinds of assumptions about digital tools as anyone else. After all, in all other digital realms, they are amateurs, consumers and users, so in a sense their relationship to the digital is formed outside of digital humanities. In my talk, I divided my concern into two parts (a) the public and (b) the professional historian. The public is liable always to have misconceptions about how historians work and how they claim to produce knowledge (indeed they have misconceptions about natural scientists too). Explaining the uncertainty of our knowledge creation processes tends to result in the lay public simply having another reason to dismiss what we say.
More difficult is the second category, that of professional historians. Because these people are liable to be no more knowledgeable about the digital than the public, it doesn’t necessarily help that they are critical historians and understand the historical process. Like many innovations, people tend to minimize or simplify the bits that they don’t understand (well) at the expense of making difficult contortions in the bits that they do know (well) – in this case writing history. How else can we explain the fact that otherwise good, critical historians are transmuted into naive, trusting individuals when it comes to the digital? So can we take the trust in a given historian that we formed from his non-digital work and apply it to work based on digitally produced evidence?
In so far as judging a person’s output depends on trust – you have to trust that the person you’re dealing with is a critical scholar who has been careful about drawing conclusions has bothered to visit the archive etc.; trust built from a long track record; digital history is necessarily at a disadvantage. We in a sense know who to trust in the field of history, hence the importance of teachers and colleagues, but we don’t yet know this for digital humanists. And we have every reason to fear that colleagues are not being critical about digital sources, which inclines us yet again not to trust this sort of research.
I was asked what are the values of history that digital history seems to violate. This has been addressed in other places, so I’ll be brief. I think we can usefully start again with traditional divisions that have been used to characterize the knowledge seeking actives of humanities versus sciences: objectivity vs. subjectivity (and relativity doesn’t mean there are no standards and everything is equally good). Another big one is the need to categorize things, which historians are reluctant to do yet computers require to function.
We can add to that a value that absolutely motivates digital research but was never a problem before: efficiency. Obviously the cost of doing digital research in people and machines is huge. So more efficient solutions are better. But efficiency has never before been a value in history. The movement to quantify research and teaching output in the humanities is moving this towards center stage – you have to produce or you lose your job – but efficiency has never been valued in producing good or better history. Indeed, a bit of inefficiency was how one came across unexpected things. How much time an individual spent in an archive didn’t matter to the archive, but it does matter to your computer.