Mariona Coll Ardanuy
Historical text mining is subject to certain requirements that are not necessarily met in traditional text mining. When working with historical texts, the time dimension has to be taken into consideration. We propose an entity-centric approach that takes time into account. We focus in the two most relevant types of entities: people and locations. Entities are very often the structural reference points in news; in most cases events surround people and are based on locations. Entity-centric text mining can help reducing the search space to fit the interests of the historians and can easily offer a visualization of relations that would otherwise be hidden by the vast amounts of text. This more sophisticated search approach is meant to be eventually integrated in our tool.
The notion of culture cannot be understood independently of the temporary and spatial framework in which it originated and developed. The ASYMENC project aims at tracing the changing influence between cultures in contact. It can be said that cultures sit in places, and it is therefore critical that our tool can capture and correctly identify the locations referred in the sources. We distinguish between two kinds of locations: base location (where the piece of news was written or published), and target or referring location (any location that is mentioned in the text). We identify the locations mentioned in a text by means of named entity recognition and we disambiguate them against a knowledge base in order to obtain the pairs or coordinates that are unique to them. Disambiguation of historical locations is not a trivial task: some places do not exist anymore, some others are not as relevant as they were in the past, some others are today known by a different name, etc. Assigning a pair of coordinates to each mentioned location allows us to place them in a map. We are currently working on a project using Dutch data from the end of the 19th Century and beginning of the 20th Century. Our goal is to obtain a sort of cognitive map of the Europe that was known to the Dutch citizens of that time.
The Europe that we want to show in a map would not only come with frequency counts of mentioned locations, but would be a container of information relating to each location, such as the most relevant or common concepts related to it. A timeline would enable us to see how the image of Europe has changed through time according to our texts.
Our people-centric approach mines newspapers based on the presence of people and their acquaintances in a collection of documents. This strategy puts its focus into the social dimension of the texts. In particular, we investigate a methodology that allows us building social networks from unstructured news stories. Historical research has benefited from social network representations before: one definite advantage of representing historical text as a social network is that it produces a very clear visualization a certain period from the social point of view and can unveil hidden relationships between different social actors. Until now, networks were extracted by hand or from structured text. Our method enables us to automatically extract them from unstructured text, weight them according to their significance in news and distribute them according to co-occurrences of their nodes. The accompanying figure shows a social network extracted from the articles from the Dutch newspaper De Telegraaf from the year 1950 in which the word ‘Europa’ occurs.
A social network can be much more than just a visualization. Again, it can be a container of easy access to otherwise hidden information. Among the many potentialities that such approach offers, we can provide for each person (node) or interaction of two persons (edge) the most relevant concepts discussed in the texts in which they occur. Each node has as attribute the list of news articles in which the person is mentioned, thus allowing the historian to close-read the articles. Dynamic networks are successions of monthly or yearly networks that show the evolution of the social network and its nodes. The paper “Laboratories of Community: how Digital Humanities can further new European Integration History” (Coll Ardanuy, van den Bos, and Sporleder 2014, in HistoInformatics 2014) uses this method in order to further European Integration research. With our approach, we could confirm several expected outcomes and propose some interesting hypotheses that might shed new light to the topic. We are currently enhancing the method by introducing name disambiguation, in order to differentiate between entities that share the same name.