Tuesday, 21 January 2014

There's Pliny of Room at the Bottom1 - Introducing Recogito Pt. 2

In our last post, we introduced Recogito, a tool we built to verify and correct the results of our automatic text-to-map conversion process. Last time, we've focused primarily on Recogito's map-based interface, in which we clean up the results of geo-resolution - the step that automatically assigns gazetteer IDs to toponyms.

In this post, we want to talk about Recogito's second view: the text annotation interface. And as usual, we'd like to seize the opportunity to introduce our next Early Geospatial Document along with it: the Natural History by Pliny the Elder.

Naturalis Historia

The Natural History (Naturalis Historia) by Pliny the Elder is an encyclopedia published ca. AD 77–79. This amazing work covers the Roman civilization's knowledge about astronomy, geography, zoology, botany, medicine and mineralogy. In total, it consists of 37 books, and builds on more than 400 sources from the Latin and Greek worlds. Books 3, 4, 5 and 6 focus on geography. In these books, Pliny describes the known world from the Atlantic to the Near East, and from the North of Europe to Africa. He records all the peoples and cities known, with all the geographic features prominent in each territory, such as rivers, mountains, gulfs, or islands.

Fig. 1. Pliny Books 3 and 4 - work in progress in Recogito.

Recogito Text Annotation UI

The Natural history is the largest text we have addressed so far. Fig.1 shows our current progress with it. (In numbers, we're through the toponyms of Book 3 by 98%, and have just started Book 4 - now at 5.5%). It also differs from our previous itinerary texts, in the sense that it's prose, and not structured into an almost 'tabular' format. Time to enter our 'reading view' in Recogito: the text annotation interface.

Fig. 2. Recogito text annotation interface.

The text annotation interface (see Fig. 2) is the place where we inspect and correct the results of geo-parsing - the automatic processing step that identifies toponyms in our source texts. Initially, when we start off with a new document, this view shows us our source text, marked up with grey 'highlights' wherever the geoparser thinks it has identified a toponym. We can then remove false matches, annotate toponyms the geoparser has missed, or modify things the geoparser got wrong (e.g. merge multiple identifications into one, turning separate consecutive identifications such as 'Mount' and 'Atlas' into a single toponym 'Mount Atlas').

Going through the source texts is a time-consuming task, and we have made every attempt to make the process as quick and painless as possible. The video above shows how the interface works in practice. Select text in the user interface as you would normally (using click and drag with your mouse, or double click), and confirm the action in the dialog window that pops up. Depending on what you select, the tool will automatically perform the appropriate action: either create a new annotation, delete one, or modify the annotation(s) in the selection. To speed up work even further, there is also an 'advanced' mode that skips the confirmation step.

There is one more thing you can see in Fig. 2: annotations are coloured to indicate their 'sign-off status'. We have already talked about this briefly in our previous post. It's a consequence of our practice to manually check every annotation before releasing it to the wild. Green annotations are those we have verified, and where we have confirmed a valid gazetteer ID). Yellow are the ones we've verified as valid toponyms - but for whatever reason we were yet unable to identify a suitable gazetteer ID for them. Grey are the ones we've either not looked at yet; or they are still 'work in progress' and we just haven't verified their gazetteer mapping.

Combined with the map-based interface you can think of this as creating the two parts of an annotation. The text annotation interface presents us with a reference to a place in a document (the 'target' of the annotation in Open Annotation terminology), while the map interface identifies a place in a gazetteer (the 'body' of the annotation). Although there are two steps to the process, they are fairly quick and easy. Maybe even fun!

1 "There's Plenty of Room at the Bottom" was a lecture given by physicist Richard Feynman in 1959. The talk is considered to be a seminal event in the history of nanotechnology, as it inspired the conceptual beginnings of the field decades later.

Monday, 13 January 2014

From Bordeaux to Jerusalem and Back Again: Introducing Recogito (Pt. 1)

Welcome back to another update from our Infrastructure Workpackage 2 - "Annotation Toolkit", affectionately known as IWP2. In our previous IWP2 post, we talked a little bit about the basics of annotating place references in early geospatial documents. We also presented a first sample dataset based on the Vicarello Beakers. What we did not talk about yet, however, is how we actually annotate our documents in the first place.

The general plan behind the Pelagios annotation workflow is this:

  1. We use Named Entity Recognition (NER) to identify a first batch of place names automatically in our source texts. This step is also called "geo-parsing", and tells us which toponyms there are in our text, and where in the text they occur. We implemented NER using the open source Stanford NLP Toolkit, and presently restrict this step to English translations of our documents. In a later project phase, we intend to cross-match the data gathered from the English translations to the original language versions, which is likely more feasible within the lifetime of the project, than trying to attempt latin-language NER.
  2. NER gives us the toponyms. What it does not tell us anything about, however, is which places they represent, or where these places are located. Next, we therefore look up the toponyms in our gazetteer, and determine the most plausible match. This step is called "geo-resolution", and - like NER - is also fully automated.
  3. Naturally, neither geo-parsing nor geo-resolution work perfectly. Therefore, we need to manually verify the results of our automatic processes, correct erroneous NER or geo-resolution matches, and fill gaps where NER or geo-resolution have failed to produce a result at all. And this is where our new Tool Recogito comes in.

Fig. 1: data from the Bordeaux Itinerary in Recogito (interactive version in Latin and English).

The Itinerarium Burdigalense

The first document we've tackled entirely in Recogito is the Itinerarium Burdigalense: the Itinerarium Burdigalense (or Bordeaux Itinerary) is a travel document that records a Pilgrim route between the cities of Bordeaux and Jerusalem. It is considered the oldest Christian pilgrimage document, dated in 333 AD - which is just 20 years after the Edict of Milan from 313, when the Emperor Constantine granted the religious liberty to Christians (and other religions). Formally, this document is very similar in some aspects to the Itinerarium Provinciarum Antonini Augusti: both of them are compiled as a list of places with the distances between them. Additionally, the Itinerarium Burdigalense also marked all the places as mutatio, mansio or civitas (change, halt or city) in a similar way as the Peutinger Table. The format of the document changes when the travel arrives to Judea, where it offers detailed descriptions of important places to Christian Pilgrims. So we can consider it an itinerarium in the tradition of Greek and Roman writing, except for its Christian emphasis. (We've compiled a detailed bibliography for the Itinerarium Burdigalense here. The text of an English translation can be found, for example, on this Website.)

Annotating the Bordeaux Itinerary with Recogito

Recogito presents the results of our automatic processing steps in two flavours: in a text-based user interface, which is primarily designed to inspect and correct what the geoparser has done; and in a map-based interface which is used to work with the results from the geo-resolution step. A screenshot of the latter is shown in Fig.2, and we will explore it in more detail below. The former interface (which benefits from a little pre-knowledge of the map-based interface) we will disucss in a separate blog post.

Geo-Resolution Verification & Correction

The map-based interface separates the screen into a table listing the toponyms, and a map that shows how they are mapped to places. The primary work area for us in this interface is the table: here, we can scroll through all the toponyms and quickly check the gazetteer IDs they were mapped to. As a matter of policy, we want to explicity keep track of which toponyms have been looked at by someone, and which haven't. To that end, each entry in the table can be 'signed off' as either a verified gazetteer match, an unknown place, or a false NER detection. (In addition, there is also a generic 'ignore' flag, for toponyms that may be correctly identified in a technical sense, but which we don't want to appear in the map for whatever reason.)

Fig. 2: Recogito map-based geo-resolution correction interface.

Double-clicking an entry in the table opens a window with details for the toponym (Fig.3): the window shows the previous automatic gazetteer match (if any), the latest manual correction, and a text snippet showing the toponym in context. A lists of suggestions for other potential gazetteer matches, along with a small search widget allows us to quickly re-assign the gazetteer match in case it is incorrect. The change history for each toponym is recorded so we know who has change what (and when), or whether there are places that may see substantially more edits than others in the long run. Furthermore, manual changes are recorded separately from the initial automatic results. This way we will be able to benchmark the performance of NER and automatic geo-resolution later on. Detailed figures for the Bordeaux Itinerary are not yet out - but our initial figures suggest that NER has caught about 2/3 of all toponyms; and that approx. 80% of NER results were correct detections. The automatic geo-resolution correctly resolved between 30%-40% of the toponyms.

Fig. 3: toponym details.

While Recogito is still under heavy construction, Pau is already deeply buried in the next document - which we will present in one of our next blogposts, together with an overview of the text-based interface.

Tuesday, 7 January 2014

The day of Pelagios: Berlin 11.12.13

Before the seasonal break of mince pies and Glühwein, the Pelagios team held a meeting in Berlin to address a range of issues relating to geospatial data aggregation and analysis. The fact that we were holding this in Berlin reflected the fortunate co-presence there of a number of different digital humanities initiatives. Our hosts were the German Archaeological Institute (or DAI), the ICT Director, Reinhard Förtsch, along with his researchers Philipp Gerth and Wolfgang Schmidle. Others joining us were:
The meeting presented us with the opportunity to talk first about Pelagios and its evolution. The Pelagios model of phases 1 and 2 uses annotations to facilitate linking (in our case through common references to places) rather than trying to unify different models. By enabling linking, each partner’s site also serves as a gateway to another, thereby maximizing the potential discoverability of these resources and avoiding fruitless attempts at creating individual portals that are supposed to do everything. Yet, even if we are decentralized, for linking to be facilitated we need a lightweight structure.

In Pelagios phase 3 work is concentrating on three areas. Since we are extending our model into new regions and time periods, gazetteers - essentially databases of place names - are crucial. Again our approach is to enable the linking between resources rather than trying to build a super gazetteer that contains all place names over time. With the aim of aligning gazetteers, we are currently investigating interoperability: What might a gazetteer 'ecosystem' look like? Options include using popular gazetteers as a backbone, though each come with drawbacks (the Getty Thesaurus of Geographic Names is heavily curated, minimizing community involvement, while Geonames includes extraneous information like every hotel in Berlin), and the SKOS vocabulary 'close match' label to enable links between gazetteers. For the meeting we've brought along a first preview of our 'cross gazetteer search', which runs on top of the linkages between the datasets from Pleiades and DARE. A screenshot of the user interface to the system is shown below.

Figure 1. Cross-Gazetteer Search Preview UI

Our second task is to enable annotations to be made on primary data (both textual and visual), so that place names can be identified. Initial attempts at building a toolkit for annotating texts will be discussed in forthcoming posts on this blog. As for the challenge of annotating maps, two questions are particularly relevant: where can we get computers to do the heavy lifting? And where do humans have to come into the loop? Finally, we are also investigating ways of visualizing the resources in our network. Our heat map provides an early indication not only of the spatial spread but also the intensity of the resources.

These three areas—relating to gazetteer interoperability, annotation methods and visualization—were the subjects of discussion.

The DAI started work in May to build a gazetteer of the Institute’s archaeological and bibliographical records. They have also been working with Wikidata and Wikimedia to explore how knowledge about the Roman frontier (the ‘Limes’) can be aggregated and used. One such example is an interactive timeline (seen below), showing how the border changed over time. Markus Schnöpf is currently working on a gazetteer for the Islamic world, which could help provide the basis for future Pelagios activity with Islamic texts. Meanwhile, at Stanford, Josh Ober’s team are developing a digital version of Mogen Hansen’s Polis inventory, which will not only provide a comprehensive dataset of settlements in ancient Greece, but also allow them to be searched in various ways using a simple browser plug in map. (Watch this space for developments.) These projects join a list that includes Pleiades, the Digital Atlas of the Roman Empire, Chinese Historical GIS, and Past Place, as the key protagonists taking the first steps towards creating a gazetteer ecosystem.

Figure 2. An interactive timeline of the Roman ‘Limes’ (frontier)

Annotation methods
With Greg Crane’s Humboldt Professorship at the University of Leipzig, various new initiatives are being launched with the aim of utilizing digital resources for the study of the ancient world. One of these, the Historical Languages eLearning Project, is experimenting with e-learning strategies for teaching ancient Greek and Latin based around annotation. Pelagios could work with this team to help in cases of disambiguating names that prove too challenging for our automated workbench, or to experiment with using games to scale up annotation over larger number of documents. The ARIADNE project, here represented by Martin Doerr and Gerald Hiebel, is laying the foundations for inferencing over data rather than just data retrieval (which is what Pelagios focuses on). In particular, the CIDOC-CRM model adopted by ARIADNE uses a formal structure for describing concepts and relationships that, while more complex semantically, is compatible with the Pelagios annotation model; moreover, the results of Pelagios can be used as the basis for CRM-compliant data.

Throughout the discussion, we were also concerned about visualization developments that can help in the understanding and analysis of potentially massive datasets. Dirk Wintergrün presented on GeoTemCo, a platform for visualising spatio-temporal data. This potentially looks very powerful, and will be especially interesting once temporal content (derived from e.g. publication dates, person references and other sources) are combined with place annotations. We give one example below, since it provides a new way of looking at data that members of the Pelagios team have produced in a previous project, GAP. Figure 3 shows GAP data from Herodotus and Pausanias in GeoTemCo, enabling the analysis and comparison of geographical referencing of these different books. In particular, Marian Dörk demonstrated a wide range of exciting visualization possibilities that could answer specific research questions and more generally appeal to the general public.

Figure 3. A comparison of places in Herodotus and Pausanias, using GAP data in GeoTemCo