For the annotation of names in Thucydides' text we use four categories, i.e. four elements following the Standard of the Text Encoding Initiative (TEI):
<persName> for historical Persons,
<orgName> for organisations and political groups, like the citizens of Athens ("The Athenians") or alliances,
<geogName> for topographies not built or modified by human beings like rivers, mountains etc.,
<placeName> for man made topographies and places like cities but also walls or harbours.

See the legend in both the reading and the book summary view for the highlighting of the four categories in our annotation.

There a few exceptions where not only names are annotated. In 1.93 whole sentences are tagged as <placeName>. In this way, the somehow detailed description of the ancient wall of Athens as well as the one of Piraeus by the ancient author is highlighted and provided with a link to the corresponding Arachne entity of the artifact. In other cases, Thucydides refers to the city of Athens just as "the city" (polis). In this case, the word "polis" is annotated as "Athens" (e. g. in 1.93). There are similar cases for references to the helots, the alliances and others.

Manual event extraction is based on historical secondary literature. As one example for secondary literature on Thucydides we choose Hornblower, S.: A Commentary on Thucydides. Oxford: Oxford University Press, p. 133-193. Here we take his titles that structure the main content of the Pentecontaetia and get information about involved actors, places and chronology from his analysis of the text.

In contrast, we add events from a summary list of Deane, Ph.: Thucydides' Dates 465 - 431 B.C. Don Mills: Longman Canada Ltd, p. 137-139. This list can then be compared to the events extracted from Hornblower, especially regarding the time line of reconstructed history between ca. 480 and 430 BC. There are no links from Deane's events to the ancient text, since references are taken from all over the work of Thucydides and not only from the chapters of the Pentecontaetia.

Introduction

During the last stage of Hellespont, we performed a series of experiments aimed to ascertain how the available resources for Natural Language Processing (NLP) of Ancient Greek could be leveraged and expanded in order to help the historical research on Thucydides, 1.89-118.

Thanks to NLP technologies, linguists can extract the events that are mentioned or narrated, as well as the actors and circumstances that take part in them, directly from the texts [1]. In comparison with other high-resourced languages, such as English or German, Ancient Greek lacks lexical and semantic resources that are commonly used in order to classify and describe the event chains (see e.g. [2]). Therefore, we decided to exploit the guidelines and tools for manual annotation of Ancient Greek literary texts that are available, and to enrich them with further layers of annotation to encode some of the most important semantic information.

Thanks to the 4-layer scenario that we used (where the different levels of linguistic information are connected and analyzed in their relation, as represented in the figure), we were able to capture some of the most important aspects of Thucydides' language, such as the valency frames of verbs, adjectives and nouns, along with all the relevant circumstantial of time and place that are evoked, and with the discursive links (like implication, cause, simultaneity etc.) that are established between them.

Linguistic annotation

Corpora with word-by-word annotation on part-of-speech, morphological features, and syntax (and potentially many other linguistic aspects) are called treebanks [3]. At the moment, two treebanks of Ancient Greek literary texts are available to the scientific community:
  • the PROIEL Corpus (University of Oslo) [4]
  • The Ancient Greek and Latin Dependency Treebank (AGLDT, The Perseus Project) [5]
The two collections are both inspired by the "morphological" and "analytical" (i.e. surface syntax) annotation layer of the Prague Dependency Treebank (PDT) [6]. In each of them, every word is enriched with information on:
  • lemma
  • part of speech
  • syntactic relations (annotated according to a version of Dependency grammar)

Morphological and syntactical annotation

The task of pos-tagging and syntactic parsing of Thucydides' text was conducted according to the guidelines of the Greek version of AGLDT (often referred to as Ancient Greek Dependency Treebank, AGDT). Also, for both the morphological and syntactic annotation, we have used the tools provided by the treebanking environment hosted at Perseus and developed by Alpheios [7].

Pos-tagging and morphological annotations are performed semi-automatically. The sentence is automatically tokenized, and the tokens are pre-processed using the morphological analyzer and lemmatizer Morpheus [8]. Since Morpheus, in the current implementation on the annotation service of Perseus, doesn't perform any statistically-based disambiguation, the task of the annotator is to manually choose between the different combination of lemma and morphological analysis output by the software.

Syntactic annotation is carried out entirely manually, with the help of a graphical tool. The AGLDT adopts a dependency grammar, which is more appropriate than any constituency-based formalism to account for a strongly non-projective language with rich inflection and free word-order like Greek. According to this formalism, the syntactic structure of the sentence is represented by a series of non-cyclical graphs that link terminal nodes (the words) directly in head-dependent relations, without intervening non-terminal constituents.

The dependency relations are described by a group of 21 basic labels.

Further readings:

Tectogrammatical annotation

Information on discursive and semantic features (e.g. anaphora resolution or semantic roles in verb-argument structure), which are crucial for event recognition, are not included in the AGDT. Perseus' treebank must therefore be integrated with a supplementary level of linguistic analysis.

The PDT, which as we saw is the model followed by the AGDT, is inspired by a theoretical frame known as Functional Generative Description [9, 6]. According to this approach, the semantic aspects of a sentence are captured ad a deeper level called "tectogrammatics", which is linked with the analytic layer of the "surface" syntax.

Adopting this model of representation has also one practical advantage: the tectogrammatical annotation of Thucydides' text can profit from and, in its turn, contribute to the further development of the existing software for automatic extraction of relevant semantic and pragmatic information from the analytical layer. Treex, the modular NLP software system implemented by the Institute of Formal and Applied Linguistics of Prague, provides solutions for developing and sharing solutions for the automatic generation of tectogrammatical trees.

In the Tectogrammatical Sentence Representation, the meaning is encoded by the interplay of 5 different elements, all of which are annotated and represented in the trees:
  1. the semantically relevant words and the dependency structure formed by them; a detailed set of semantic-syntactic labels (the so-called functors) is used to describe these relations; special attention is paid to the relations implied by the verb valency
  2. the semantic information that is expressed through the morphology (number, gender, aspect etc) and is represented as properties of the words (with the so-called grammatemes
  3. ellipsis resolution: linguistic items that are implied by reference to the context (e.g. the subject and other required valency arguments in Greek) are represented by artificially reconstructed node in a tectogrammatical tree
  4. the coreference of the linguistic "pointers" (pronouns and reconstructed nodes)
  5. topic-focus articulation and deep word-order (i.e. reordering of the nodes according to their communicative dynamism)

Further readings:

Methodology and current status of the annotation of Thuc. 1.89-118:

Tectogrammatical annotation is a very complex task. The representation of a sentence meaning and the interplay with different surface phenomena is language-specific: the abundant material that has been produced for the TGSR of Czech [10] and (partially) for English [11] are not immediately applicable to Ancient Greek without thorough investigation. The process of annotation and, especially, the writing of a set of guidelines for tectogrammatical annotation is therefore still in progress.

Users that are interested can download and see the current status of annotation. Refer below for instructions.

Grammatemes

The grammatemes annotation in Thucydides' text is in progress. Only the information that is immediately recoverable from the morphology and the disambiguation of semantic POS for adjectives and pronouns (i.e. the cases where adjectives and pronouns are used as nouns or attributes) is currently recorded.

Functors and Valency

Functors are fully annotated; the tag list and the guidelines for Czech were used also for Ancient Greek.

For verb valency, we have also created a valency lexicon (ThucVal), inspired by the Vallex dictionary for Czech [12]. For the ThucVal, we started from the Liddel-Scott-Jones Greek-English Lexicon (LSJ) available for download on the Perseus Project [13]. On a preliminary step, each meaning distinguished in the LSJ lexicon was converted into a valency frame.

While annotating Thucydides' text, we selected the appropriate frame (or we created a special valency frame if those distinguished in the LSJ weren't sufficient). Each frame is defined by a unique identifier (ID) which corresponds to the ID attribute of each definition in Perseus' LSJ (a new id was assigned for the newly created frames). Each frame has also a special field named "gloss" that contains the English translation taken from the specific verb meaning in the LSJ; for the newly created frames, we selected the correct English translation.

Valency arguments for each frame are recorded according to the same formalism used for the Czech Vallex. In particular, for each frame we listed:
  • the required arguments and free modifiers, i.e. those arguments and modifiers that are necessarily required to complete the verb meaning (e.g. a direction complementation with a movement verb)
  • the free modifiers that, although not strictly required to complete the meaning, are most often expected with a specific verb are recorded as "typical"; we believe that this information will be useful for automatic annotation of valency frames

Each node of the tectogrammatical trees that governs valency arguments has a special attribute "val_frame". The ID of the appropriate valency frame from the ThucVal lexicon is recorded there.

The valency annotation and the ThucVal annotation are completed for verbs. Nouns and adjectives that have a valency frames remain to be annotated.

Ellipsis and Coreference resolution

Annotation on ellipsis and coreference resolution is complete for Thuc. 1.89-118.

Topic-Focus Articulation and Deep Word Order

Topic-focus annotation is in progress. Reordering of tectogrammatical nodes according to the FGD-formalism was not undertaken.

Special information

Every node at the analytic layer has a special attribute called "cite". We stored there the CTS-URN of the word generated by the Perseus CTS API, in order to link every node represented in the analytical treebank to the exact location of the word in the digital edition of Thucydides stored in the Perseus library.

Every tectogrammatical node has also three special boolean attributes: "is_name", "is_name_of_person" and "is_name_of_place". The first one is set to 1 when the node is annotated as a Named Entity in the Hellespont NE annotation. The type of NE (location or person) is then further specified with the help of the other two boolean values.

Event Analysis Based on the Treebank

As Tesnière wrote [10: 102], adjuncts and arguments recorded in the valency frame are the actors in the "drama" (“pétit drame”) expressed by the governing word. In our first experiments with the annotated treebank, we decided to concentrate on this "drama" and to retrieve all the information that is available. Therefore, we selected and analyzed all the tectogrammatical nodes that have a valency frame, and whose subtree included a Named Entity (or a node in coreference with a NE) at any hierarchical level.

Of each extracted node, we reconstructed the effective antecedent of the valency-required complementations, as well as some relevant circumstantial information. Thanks to coreference resolution, we were able to retrieve the effective lexical items that were referred, even in the case where e.g. a subject was left out because it was carried out by the preceding sentence, and even if we have to do with very long coreference chains (in such cases as: "the Athenians... they... [(implied) they]... etc.).

We also used the information in the ThucVal (see above) to: a) provide a label to the extracted "event" (by using the LSJ English gloss) b) to reconstruct a minimal (pseudo-)Greek sentence by putting together the valency arguments of the governing nodes (where this was possible); we used this description for basic orientation. E.g. in case of a participle like ἐλθών ("having gone"), if the direction is inferable from the context and if the frame of the verb ἔρχομαι labelled in ThucVal as "n42827.3" is used, with the meaning "to go somewhere" and two required arguments: Actor and Direction to), the event is defined by: a) type "go" b) a minimal description like: ἔρχομαι + the word or the lemma of the subject and direction extracted from the treebank. If the lemma (in the nominative) of the arguments is used, as it is the case with reconstructed nodes or with coreference resolution, this will often yield pseudo and incorrect Greek, that should be enough, however, to get a quick idea of what is happening in the referred part of the text.

Some of the linguistic information that we were able to extract from the treebank could be intuitively aligned with the CIDOC-CRM properties of events and activities. In particular, time circumstantials (TWHEN.after, TWHEN.before and TPAR) could be tentatively associated with the properties: Occurs before, after or during (P117, P120); nodes annotated with the functor for "cause" were associated to the property "was motivated by" (P17); locations other than directions mentioned in relation to an action can be tentatively identified as places an activity "took place at" (P7); finally, NE's (other than places) that are involved are likely identifiable as Participants (P11).

With this set of equivalence, we were able to generate a partial comparison between manually annotated events and possible event-candidates extracted from the treebank.

Naturally, by selecting all the items that have a valency frame we set a very broad scope for our query: a lot of noise (and a low precision) should be expected. Our main goal was to show how linguistic annotation can be used to extract a vast collection of material. We also believe, however, that by leveraging other linguistic features in the tectogrammatical representation (e.g. by selecting only those items that are contextually not bound, or only the main predicates of each sentence) a better precision can be reached.

Users are encouraged to use different definitions of "events" and to refine our list!

The Wordclouds

As an example of the kind of representations that can be generated by our data, we chose to include two wordclouds that represent the actions performed by (respectively) the Athenians and the Spartans (verbs only).

Note that:
  • the wordclouds use the glosses of the ThucVal lexicon.

It is important to remember that the gloss is not the first meaning of the LSJ, but a manually-selected meaning of each verb; the gloss is therefore very accurate for the specific meaning of each verb in its context.

  • the verbs selected are those were the Athenians and Spartans are the effective semantic Actors.

By "effective" we mean that coreference resolution is performed! Even those verbs where the Athenians and Spartans are the implied subjects or are referred by pronouns are captured!

  • only the role of semantic Actor (functor ACT) is relevant (note also the functor ACT is not always the same as the semantic actor).

This last points means that those cases where (e.g.) the Athenians are the syntactic subjects of a passive verb (e.g. "to be defeated"), which corresponds to the functor PAT and the argument "patient", are not included. That my give the false impression (e.g.) that all is well with the Athenians in Thuc. 1.89-118, since the active verb νικάω (LSJ gloss "to conquer") is so predominant. The other side of the coin, the phrases where the Athenians "were defeated", is not captured there.

This last question ("how many times do the Athenians win VS the cases in which they are defeated") is a good example of questions that can be tackled with a multilayer treebank that include tectogrammatical annotation. Others can be:

  • what is the role of individual (e.g. Pericles, Themistocles) VS collective (the Athenians, the Spartans...) actors?
  • how are Athenians and allies associated in the actions? Does the relation change in time or in the course of the narration?

The users are encouraged to play with the tools and these or other questions!

Instructions on how to download and view the treebank

In addition to browsing as SVG dump of the trees, users can download the multilayer treebank as well as the valency lexicon and the xml schemata from the Download area of Hellespont.

The treebank file is in the xml native format used by the PDT nlp-environment Treex. Users need the software TrEd with the EasyTreex extension to visualize, browse and search the treebank. Instructions on how to install TrEd and the other add-on's can be found here: http://ufal.mff.cuni.cz/tred/

In addition, users who are interested to modify our treebank or generate Treex files for Ancient Greek, must install a full distribution of Treex. For instructions and an introduction to NLP with Treex, visit this page: http://ufal.mff.cuni.cz/treex/

References

[1] http://wp.chs.harvard.edu/chs-fellows/2012/11/26/treebanking-in-the-world-of-thucydides/
[2] Burchardt, A., A. Frank, and M. Pinkal (2005). Building text meaning representations from contextually related frames. a case study. In Proceedings of the 6th International Workshop on Computational Semantics (IWCS-6).
[3] Abeillé, Anne, ed. (2003). Treebanks. Building and Using Parsed Corpora. Dordrecht and Boston: Kluwer Academic Publishers.
[4] Haug, D., and M.L. Jøhndal (2008). "Creating a Parallel Treebank of the Old Indo-European Bible Translations." In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), 27–34. Marrakech, Morocco: European Language Resources Association (ELRA).
[5] Bamman, David, Francesco Mambrini, and Gregory Crane (2009). “An Ownership Model of Annotation: The Ancient Greek Dependency Treebank.” In Proceedings of the Eighth International Workshop on Treebanks and Linguistic Theories (TLT 8), 5–15. Milan: EDUCatt.
[6] Bohmová, Alena, Jan Hajič, Eva Hajičová, and Barbora Hladká. 2001. “The Prague Dependency Treebank: A Three-Level Annotation Scenario.” In Treebanks: Building and Using Syntactically Annotated Corpora, edited by Anne Abeillé, 103–127. Boston: Kluwer Academic Publishers.
[7] http://alpheios.net/
[8] Crane, Gregory. 1991. "Generating and Parsing Classical Greek." Literary and Linguist Computing 6 (4): 243–245.
[9] Sgall, Petr, Eva Hajičová, and Jarmila Panevová. 1986. The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Dodrech: Academia/Reidel Publishing Company.
[10] Tesnière, Lucien (1959). Éléments de syntaxe structurale. Paris: Klinksieck.

The entity annotations of the Thucydides texts were done as inline annotations in the TEI XML. We wanted to be able to (1) make it possible to disseminate these annotations separate from the text itself and (2) link them to the related citation and linguistic annotations.

To facilitate this, we developed an XSLT transformation attachment:inline_to_oac.xsl to convert the inline TEI Annotations to standalone annotations adhering to the OA Annotation data model. Perseus CTS-enabled stable URIs were used to identify the text passages as targets of the annotations.

The transformation was specific to the format of the annotations used for Thucydides, but could be extended to be a more general purpose tool. Some notes on the current functionality and limitations of the transformation:

  • The code converts the following tags to oac:annotations: geoName,persName,placeName,orgName
  • For the moment, it takes the values of the @key and @type field and makes them both part of an rfds:label for the annotation. This probably needs refinement. If it's important to capture the @type, it might make sense to make that a separate annotation on the same target.
  • The OAC 1.0 specification recommends use of the oac:motivatedBy property to classify annotations. It may be that the various annotations in Hellespont should use different motivations (e.g. oa:classifying vs oa:linking). The transformation for the TEI annotations uses oa:linking for them all for now.
  • The stylesheet has a few parameters which can be supplied when you run the transformation and which impact how the resource URIs are generated. These are:

<xsl:param name="e_baseCTSUri" select="'http://data.perseus.org/citations/'"/>
<xsl:param name="e_work" select="'urn:cts:greekLit:tlg0003.tlg001.perseus-grc1'"/>
<xsl:param name="e_baseAnnotationUri" select="'http://hellespont.dainst.org/annotations/'"/>

  • The passage urns for these annotations are all at the level of the section (they are taken from the xml:id values on the annotations). The tectogrammatical annotation urns are at the chapter level because the sections overlap with the sentence boundaries. The interface code which uses them to coordinate the display will have to know how to traverse the urns. However, it may also be necessary to switch the TEI annotations to the chapter level if any of them cross section boundaries.

Perseus

In order to support integration of Perseus and Arachne resources, we needed to make the Perseus data linkable. We started by thinking carefully about the URIs that we are using to name and address the Perseus texts, catalog metadata, and other data objects from the Perseus Digital Library, so that we could feel reasonably confident in ensuring that these URIs will be stable and properly dereferenceable. We solicited and took into account feedback from members of the digital classics community on our approach to the definition of our URIs. Once we completed the step of defining our URI schemes, our next priority has been to embark on publishing stable URIs for the various pre-existing resources in the library. Future steps will alter the way in which the resource content is represented to advertise its linkable features via RDF-A.

Textual Data

For the textual data, we need to be able to supporting linking to the texts as a whole, as well as to individual passages of the texts. We use using CTS Passage URNs in combination with URI prefixes to identify text passages, as described further at http://sites.tufts.edu/perseusupdates/beta-features/perseus-stable-uris/.

However, due to some limitations in the way the text chunking algorithms work in the current version of the Perseus Digital Library, we needed in many cases to modify the markup structure of the texts to facilitate retrieval of the XML for individual text passages. This is also part of the larger ongoing effort to convert the Perseus TEI XML to the TEI 5 and Unicode, and to fully document the markup practices for the Perseus texts. Ongoing work in this area can be seen in the PerseusDL GitHub canonical text repository.

Catalog Data

http://sites.tufts.edu/perseuscatalog/documentation/user-guide/catalogdata-uris/

Using the CIDOC CRM, it is possible to bridge together structured archaeological data and structured textual data. Since CIDOC CRM is an abstraction model created for object databases, the first question has been if it can be used in order to encode metadata of a written source such as the events of the Pentecontaetia of Thucydides.
For this, the historical events of the Pentecontaetia Thuc. 1, 89-118 first have been collected in an own workbench. Following the plot and narrative structure of the text, the collection summarizes those events which are pointed out by Thucydides for the period of the Pentecontaetia with manifold connections among themselves. It is subjective in a double sense since it represents not just Thucydides’ point of view, but an interpretation of it.
The extraction of events as described in the text of Thucydides is done in two different ways, one based on historical secondary literature about the text and the other based on linguistic analysis. As one example for secondary literature on Thucydides we chose Hornblower, S.: A Commentary on Thucydides. Oxford: Oxford University Press, p. 133-193. In contrast, we added events from a summary list of Deane, Ph.: Thucydides' Dates 465 - 431 B.C. Don Mills: Longman Canada Ltd, p. 137-139. For more details see here.

Simultaneously, we undertook a comprehensive linguistic annotation of Thucydides' text, in order to enrich the chapters of the Pentecontaetia with the basic syntactical, semantical and pragmatic information that we needed for a data-driven approach to the study of events. More information on the theoretical framework and annotation schema that we used can be found in the section dedicated to Linguistic Annotation.
After we concluded a preliminary stage of manual linguistic annotation, we extracted a list of all the elements that were assigned a valency frame (see here for more information on valency). The relevant sections of the text, and the structured information that can be gathered from our multi-layer annotated corpus, were then converted to the model of our workbench and stored there for comparison with the event list based on modern scholarship: more detail are offered here.

In a second step the historical events collected before have been modelled in a CIDOC CRM abstraction layer, e. g.

      E7.Activity
          P7.took_place_at
              E53.Place
          P11.had_participant
              E21.Person OR E74.Group
          P86.falls_within
              E4.Period
                  P4.has_time-span
                      E52.Time-Span
                          P78.is_identified_by
                              E50.Date

The crucial point using the CIDOC CRM is the definition of an event vs an activity. Only the latter "comprises actions intentionally carried out by instances of E39 Actor" (cf the CIDOC CRM ontology for E7 Activity). This means that most of the events of the Pentecontaetia in fact are E7 Activities in terms of the CIDOC CRM. However, our application of model is able to represent instances of both E5 Event and E7 Activity based on the same structure.

The aim of this type of event modelling is to represent the history of the Pentecontaetia as a consecution of depending events according to Thucydides. The original CIDOC CRM ontology of classes and properties gives often a very adequate solution, even if in some cases it is not possible to express the whole complexity of the history written by Thucydides. The original ontology is neither modified nor extended. The modelling itself has been developed in close cooperation with Arachne.
Furthermore, some of the properties that describe an event are close to the types of semantic relations that our linguistic annotation identify in the text of Thucydides. It is therefore possible to attempt a conversion between some of the semantic labels that we used to define the relation between head-word and dependents in our model and the relevant CIDOC CRM properties: more details are discussed in the section on Linguistic Annotation.

Our model includes information about actors, places, chronology, relations between the events themselves as well as references both to the ancient textual source and to modern literature where historical information about events is taken from. References from the events back to the ancient text are given using CTS-URNs and can link to any other annotation of the text, e. g. named entity annotation (TEI), linguistic annotation and extracting related secondary literature. The named entities are provided with the same URIs (Arachne IDs) as in the TEI annotation of the text. Using the CTS-URNs for referencing the ancient text and URIs for the named entities we are also able to link back from the CIDOC mapping to our data in GapVis.
CIDOC Schema Hellespont
CIDOC Schema ThucDB

The ThucDb workbench is built following this schema of event modelling. As a result, each single historical event can be exported as RDF/XML Cidoc triples as well as the whole content of the database. You can read about the technical considerations and details of the ThucDb here.

From this mapping, a bridge is built to the CIDOC CRM mapping of both the Arachne database and Perseus Art & Archaeology collection in order to match the information about entities such as places and people (topography and persons in the archaeological database) which are involved in the historical events. Therefore, digitally structured data we provide so far of both written and archaeological sources are combined in an own interface, by automatically extracted secondary literature from the web.

Coming soon ...

The Perseus Art & Archaeology (A&A) collection includes detailed information on artifacts, buildings, coins, gems, sculptures, building sites, and vases, with data that was created over 20 years ago in some cases. Our goal for this project was to be able to provide all this data encoded in RDF/XML according to the CIDOC Specification to provide a common interchange format compatible with the object data from the Arachne database. Mapping the Perseus data to CIDOC proved to be a far more significant undertaking than initially expected due to a large number of inconsistencies in the A&A legacy data and a number of areas where the CIDOC CRM did not contain easily equivalent classes or properties to represent the Perseus data. In particular, the large variety of ways to express both time and place within the CIDOC CRM and the number of slightly inconsistent fields for date types, bibliography, and locations within the A&A database proved to be the most challenging issues to resolve.

We began by using a Google spreadsheet to track the mapping process, listing all of the properties of the Perseus data and suggested mappings to the CIDOC terminology along with notes and examples from the data. This allowed multiple colleagues to work simultaneously, and to share the process and explanations for potential mappings with colleagues in Germany. Perseus staff utilized the Arachne database mapping to the CIDOC CRM (available both as a large XML file and through an online browsing tool), the CIDOC CRM documentation, and a number of examples where Arachne staff had already mapped objects for which the Perseus A&A database also contained records to the CIDOC CRM.

We went through several iterations of review and discussion cycles using the spreadsheet, and then encoded these mappings as an XSLT stylesheet attachment:perseusaa_to_cidoc.xsl used to transform the Perseus A&A data from its legacy format to CIDOC XML. The Perseus and Arachne staff went through additional review iterations of the transformed output and from this point forward the XSLT stylesheet became the authority for the Perseus mappings. The initial CIDOC output was in plain XML, subsequently modified to output RDF/XML. The DAI team developed a RelaxNG schema (attachment:cidoc-schema-rdf.rnc) for the project's implementation of the CIDOC ontology, and the Perseus stylesheet produces output compliant with this schema.

The use of RDF/XML required the identification of a stable URIs for references and vocabulary terms within the Perseus data. Not all subjects and objects of the Perseus RDF triples are addressable by URI, due to the nature of the data. In some cases the use non-controlled vocabulary for data properties presented a challenge, as did long text strings for bibliographic references and citations. Wherever possible we mapped to existing URI identifiers.

For place names we used a process which extracted all of the names for each place identified in the Pleiades database, along with the minimum and maximum dates associated with that name, and the Pleiades URI identifier for the place. This datafile (attachment:allplaces.xml is an input to the Perseus CIDOC transformation spreadsheet, which employs a simple algorithm to try to find a match on the name referenced in the Perseus data, filtered by the production dates associated with the artifact in the Perseus data to filter incorrect matches. It's an imperfect algorithm - we ended up with just under 80% of the places mapped to Pleiades URIs and in many cases multiple potential Pleiades matches were found. This is an opportunity for further improvement in a future project.

We did construct new URI identifiers for some controlled vocabulary terms in the Perseus data, in particular artifact types and date periods. The artifact type URIs are under the prefix "http://data.perseus.org/rdfvocab/artifacts/". For date periods, we used the period sort strings to create URIs under the "http://data.perseus.org/aaperiods" prefix. An outstanding goal is to enable those URIs to resolve RDF-A compatible description of the vocabulary terms.

The final transformation has been included in the Perseus code repository and a feature will soon be added to the Perseus web interface to enable on-demand export of the Perseus A&A data into this format.

GapVis Frontend with Hellespont specific modifications
GitHub repository

History of GapVis

The "Gap" in GapVis originally stands for Google Ancient Places.
The aim of the project was to automatically extract place names from ancient texts that are available in Google Books and to map these places to the Gazetteer Pleiades to have an URI, coordinates and additional information for each place.
To visualize the results of the project, the interface GapVis was created by programmer Nick Rabinowitz in the JavaScript framework Backbone.js.

GapVis offers the possibility to read a text page by page and see each mentioned place both highlighted in the text itself and on a map next to it as well.

Each place has its own detail page where you can see related places (that are mentioned on the same pages) and links to more information in other interfaces.

Because GapVis is Open Source software and because Hellespont in part has similar data (A text by Thucydides and annotated things inside the text) and similar visualization requirements, it was decided to use the GapVis code and modify it for Hellespont needs.
A big difference is that Hellespont data is not based on automatically generated place references but on TEI files that were manually tagged. Not only places were tagged but also persons and organisations. Furthermore a Hellespont visualization needs to be able to show completely different data as well: linguistic trees, secondary literature and both historical and linguistic event data.

The first challenge was to create a backend to communicate with the Backbone.js based GapVis frontend. The original GapVis runs on static JSON files, but for Hellespont we wanted to use a dynamic API that could on the one hand be extended for all our data needs in the future and on the other be able to aggregate data from different sources.
This API was then built based on the Java framework Spring. Because the Thucydides text was tagged with the Arachne entity URIs in the TEI files, it was easy to extract all information needed for visualization directly from the Arachne MySQL database, this includes a title, latitude and longitude coordinates, and corresponding URIs in other databases for each entity, for example the place URIs of the iDAI.gazetteer. Where it was necessary, all person records were assigned a birth place (at least where this information was known). This birth place could then be used in GapVis to visualize the persons on the map alongside with the annotated places.
The Greek text of the individual chapters themselves comes directly from the TEI files which are transformed into HTML with the help of XSLT and then displayed in the text portion of the interface. The different kinds of entities can be distinguished by their corresponding colours.
While like in the original GapVis there is always a complete chapter of Thucydides on display, the user has the possibility to select singular passages. When a passage is selected, the currently active CTS-URN changes along with it. In the event and secondary literature view the data displayed is always in sync with the current CTS-URN, there is also at every step a link back to the Thucydides passage in Perseus.

Reading View

The Reading View always shows the currently active passage of the Pentecontaetia with annotated entities in different colors on the left hand side. The user can decide if he or she wants to concentrate on the entities themselves, the events, or the secondary literature for the current passage. There is also an English translation available for the Thucydides text.

Time Map View

The Time Map View shows a selected entity from the text on the map, along with an intratextual timeline of all entities mentioned throughout the chapters. There is the possibility to zoom into the map, to go to the Entity Detail View, or to jump to another chapter that also mentions the chosen entity.

Event List View

The Event List View shows all events for the currently active passage, manually created historical events as well as automatically generated linguistic events. It is possible to click on each event to go to the Event Detail View, for linguistic events there are also links to the Tree View in which the particular event node is then highlighted. To narrow down the list of events the user can also choose a subsection from the chapter on the left.

Secondary Literature View

For those text passages where there is extracted secondary literature, this information can also be visualized on the right side. There is a list of papers that mention this particular text passage and for each there is a link to the full text of the paper on JSTOR and an extendable list of other citations occurring in the paper. These citations are again connected to Perseus with CTS-URNs.

Entity Detail View

If an entity has been selected it can be viewed in the Entity Detail View. Here the user can find links to Arachne and the iDAI.gazetteer and also to other databases like Pleiades, DBpedia and dictionaries hosted on Perseus. There is also a list of co-referenced entities (this means entities that are frequently mentioned in the same chapters) with links and a visualization on the map. Additionally, there are sometimes pictures from Arachne and Flickr.

Event Detail View

The Event Detail View offers information about a singular event. This information is retrieved from the ThucDb and includes related events, actors, places and the secondary literature where this event has been documented.

Tree View

The Tree View offers the currently active text passage without entity annotations on the left and two types of linguistic trees - syntactical and tectogrammatical - on the right. The images of the trees are based on SVG files created from Treebank data. The user can hover over each word or node to see it highlighted in both trees and the text passage. Clicking makes the selection more permanent, this is useful for moving around inside the trees without losing the highlighted node.

The purpose of the "Linked Data Browser for Hellespont" is to provide search and navigation of data that has been aggregated from different cultural heritage information systems. The browser has been used to investigate the procedure of conveying cultural heritage Semantic Web data in a fast and user friedly way. A more technical view on the data has been provided by installing Pubby, a linked data frontend for SPARQL endpoints. [Pubby for Hellespont installation]

In order to provide fast searching and browsing for the aggregated CIDOC CRM data, an indexing layer has been implemented. By using SPARQL, selected metadata is extracted from the CIDOC CRM and written to an Elasticsearch index. [Elasticsearch Website] A back end has been implemented that controls the data import and processes frontend request with the help of this Elasticsearch index. Hellespont makes extensive use of global indentifies (URIs) for referring to cultural heritage and geopgraphical entities. Thus, it is prepared to become part of a linked data infrastructure for cultural heritage and geographical entities.

After submitting a search request, the user is presented with a list of results and a number of layout elements for faceted search. The latter can be used to further narrow down the search result for example by choosing category or place attribution. The list of search result shows a short description, the type and provenance (name of data source) for each search result. By selecting a search result one can reveal a single entity view that provides additional information about each described entity. This information comprises a permalink, additional metadata, links to other presentation systems and links to contextual entities.

Due to limited development resources, the "Linked Data Browser for Hellespont" has been implemented as a "proof of concept". A number of layout elements require additional work, the available data needs to be completed so that it fully represents the providing data sources and a number of additional bugs need to be fixed.

To get access to the source code and a short developer documentation write an email to the Hellespont list.

Try the "Hellespont Linked Data Browser".

You can also directly access the SPARQL endpoint.

The ThucDb workbench was created to enter event data into a database with the later possibility of an export.

The Idea

The events are not only defined by their corresponding text passages, but also by additional information that can be abstracted from the text - for example which actor executed the event and where it took place, but also event relations to each other - and expressed in a standardized way.
The standard that is used throughout the Hellespont project is of course the CIDOC CRM ontology (see also: Hellespont CIDOC CRM Mapping), and this is also what we modelled the database structure on: The ThucDb was developed as a Grails app that contains actual CIDOC CRM classes and properties in its domain model.
The input of events is made very easy by this, and the data itself is stored in a MySQL database.

While historical events based on secondary literature were inserted into the database manually, the linguistic Treebank-based events could be automatically imported from the Treebank export. (see also: Linguistic Annotation)

The Export

Probably the most useful thing about the ThucDb is its export function:

Each event that has been entered into the database can be exported as a valid CIDOC-CRM RDF/XML file.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:crm="http://www.cidoc-crm.org/cidoc-crm/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:owl="http://www.w3.org/2002/07/owl#">
  <crm:E7_Activity rdf:about="http://hellespont.dainst.org/event/1" rdfs:label="The Siege of Sestos">
    <crm:P2_has_type>
      <crm:E55_Type rdf:about="http://hellespont.dainst.org/type/manual_annotation" rdfs:label="manual annotation">
        <rdf:value xml:lang="en">manual annotation</rdf:value>
        <crm:P2_has_type>
          <crm:E55_Type rdf:about="http://hellespont.dainst.org/type/event_type" rdfs:label="event type">
            <rdf:value xml:lang="en">event type</rdf:value>
          </crm:E55_Type>
        </crm:P2_has_type>
        <crm:P3_has_note xml:lang="en" rdf:datatype="http://www.w3.org/2001/XMLSchema#string">historical literature</crm:P3_has_note>
      </crm:E55_Type>
    </crm:P2_has_type>
    <crm:P1_is_identified_by>
      <crm:E41_Appellation>
        <rdf:value>The Siege of Sestos</rdf:value>
      </crm:E41_Appellation>
    </crm:P1_is_identified_by>
    <crm:P3_has_note rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Hornblower</crm:P3_has_note>
    <crm:P9i_forms_part_of>
      <crm:E5_Event rdf:about="http://hellespont.dainst.org/event/58" rdfs:label="The Pentekontaetia or 'Fifty Years'">
        <crm:P1_is_identified_by>
          <crm:E41_Appellation>
            <rdf:value>The Pentekontaetia or 'Fifty Years'</rdf:value>
        </crm:E41_Appellation>
        [...]

It is also possible to export all events, actors, places and so on at once into singular large file. This file can then be ingested by the Triple Store along with the Arachne and Perseus data already in there. The CIDOC CRM events are therefore a direct link between the text-based and the Linked Data part of the project: information is extracted from the Thucydides text and secondary literature by S. Hornblower and P. Deane and turned into Linked Data that is again connected to Arachne and Perseus entities.

The export is realized by XML marshalling inside of a controller in the Grails app. Grails offers automatic converters for this kind of task and it's possible to exactly determine every detail of the tag structure so that it is aligned to the Hellespont CIDOC-CRM mapping.