Tuesday, March 12, 2013

BiSciCol, Triples, and Darwin Core

A big part of what we want to accomplish with BiSciCol is supporting biodiversity collections data from lots of different sources.  These data are often organized using a standard called "Darwin Core" (DwC), and Darwin Core-based data are commonly transmitted in a specific format known as a "Darwin Core Archive" (DwCA).  So recently, we've been devoting a lot of thought and effort to figuring out how we can best support DwC and DwCAs in BiSciCol and the Triplifier.  (The "Triplifier" is a tool we are building to make it easy to convert traditional, tabular data formats into RDF triples for use in BiSciCol and the Semantic Web.  DwCAs are just such a format.)  Representing DwC data in RDF triples and "triplifying" DwCAs presented a number of challenges, and in this post we want to discuss one of these challenges:  Figuring out how to use our relations terms to capture the connections found in DwC data.

Darwin Core includes six discrete categories of information: Occurrence, Event, dcterms:Location, GeologicalContext, Identification, and Taxon.  DwC does not formally describe relationships between these categories of information, though.  Formally defining the relationships that join categories, or classes, of information is common practice in standards development, but DwC's developers deliberately choose not to do this in order to make the standard as flexible as possible.

Before proceeding, we should note that in the previous paragraph, we were careful to make a distinction between the words “class” and “category.”  “Class” is a special word typically used to describe categories of information present in a formal ontology (which DwC is not).  However, since we’re describing a method for working towards formalizing DwC content, we’ll use the word “class” hereafter to refer both to the formal model and the original DwC categories.

So, to represent DwC data as RDF triples, we needed a way to relate DwC class instances to one another.  This sounds fancy, but it's really a matter of using a common-sense approach to describe relationships between entities, much as people have been doing with relational databases for decades.  In fact, the darwin-sw project has already developed a complete ontology for representing DwC data in the Semantic Web.  However, because BiSciCol is limited to a small set of generic relations terms, we needed a new approach for handling DwC data.  Plus, by building on the core BiSciCol relations, such a solution could easily include not just DwC, but concepts from other domains such as media, biological samples, genetic material, and the environment.

To make this all a bit more concrete, let's take a look at an example.  Suppose we have a single instance of Occurrence (a specimen in a collection, say) that originated from a particular collecting expedition, which is represented in DwC as an instance of the Event class.  Using RDF and BiSciCol's relations predicates, how should we make the required connection between the Occurrence instance and the Event instance?  More generally, how should the six core DwC classes be related to one another using BiSciCol's relations terms?


The image above illustrates our answer to this question.  Recall that we are using only four relations predicates in BiSciCol: derives_from, depends_on, alias_of, and related_to (see the previous post for much more information).  The diagram should be fairly self-explanatory.  Some relationships are naturally described by depends_on.  For example, an Identification can only exist if there is an Occurrence (e.g., a specimen) to identify and a Taxon to identify it as.  On the other hand, a GeologicalContext gives us information about a collecting Event, but in at least some sense, the collecting event is independent of the geological context.  Thus, the relationship between these two instances is described by related_to.

So far so good, but when dealing with real data, this solution turns out to be insufficient because DwC data sets often do not include all six core classes.  What should we do if a data set includes Occurrence and Taxon, but not Identification?  This scenario is not uncommon, so to deal with all possibilities, we added a few more relations to handle the cases where a class (either Identification or Event) that acts as a bridge connecting Occurrence to other classes is missing.  The following diagram illustrates the complete set of relations, with the dashed, gray lines representing the relations that are used if either Identification or Event are missing.

 


And that's it!  With this set of eight relationship triples, we should be able to handle all possible combinations of the six core DwC classes.

- Brian Stucky, John Deck, Rob Guralnick, and Tom Conlin