Friday, April 6, 2012

Making our System Smarter

Computers are amazing at following instructions. So amazing, in fact, that a seemingly harmless instruction can potentially lead to an entirely false conclusion.
At our recent BiSciCol meeting at the University of Florida, we had a discussion about just such a case.

At its core, BiSciCol is all about connecting objects to each other. In order to accomplish connecting object identifiers to other objects, we have been using a simple relationship expression called “leadsTo” that indicates a direction in the nature of the relationship between one object and another. To illustrate how “leadsTo” works, lets provide a simple example. Suppose we have a collecting event, which we join to a specimen object using our relationship predicate “leadsTo”. The specimen object could then “lead to” a taxonomic determination, which could in turn “lead to” a scientist, and so on.

This is certainly useful as we can express an endless chain of objects and their derivatives , even if they exist in different databases. However, what if we extended the above example just a bit further, using our “leadsTo” relationship?

Uh oh--- By successively following the leadsTo relationships, we could now erroneously conclude that spec2-t1 came from spec3! This is not good! Fortunately, there is a solution.

We realized that the directional “leadsTo” relationship simply doesn't make very much sense in some situations, such as the connection between spec3 and person1 in the diagram above. Consequently, instead of the single “leadsTo” relationship, we actually need two relationship terms: one that has a distinct direction and one that implies no direction. Two terms are in use currently that do just this from the Dublin core standard: 1) relation (no direction) and 2) source (has direction).

In the first example above, we could avoid the problem entirely by describing the link between the taxonomic determination and the scientist as a non-directional relation. Using our new terminology, the graph would look something like this:

The computers involved in figuring out how to traverse the graph of relationships would know not to follow non-directional relationships and we would no longer infer that spec2-t1 came from spec3. Problem solved!

This post written by John Deck and Brian Stucky with input from Hilmar Lapp, Steve Baskauf, Andrea Thomer, Rob Guralnick, Lukasz Ziemba, Tim Robertson, Reed Beaman, and Nico Cellinese, summarizes a discussion that took place at the BiSciCol development meeting held on the March 31, 2012 at the University of Florida.


  1. Interesting initiative! I came accross this blog through a tweet from @ncellinese
    I have a small question though. I am a bit puzzled by the fact that in the 2nd graph the source arrow goes from CE1 to spec2 on one hand, but from spec3 to CE2. I'd be interested to learn why this is the case.

    1. Hi, Aaike,

      Thanks for your comment. The reason for your puzzlement was exactly what Tim suspected -- I made a mistake in the original diagram, and had the arrows pointing the wrong way! I've since fixed the image.


  2. Aaike, I suspect that is actually a typo, unless they are trying to say that they actually collected something from the specimen (e.g. took some leaves from an existing plant specimen). Since the CollectionEvent points to 2 specimens though, I suspect it is simply a typo.

    Others, I still question the rationale for such generic relationships when this is used in practice. Suppose I want to ask the question "How often are specimens identified by an agent who is not the collecting agent?", I can't really do it. The dc:relation to the agents could conceptually mean "identifiedBy", "curatedBy", "preparedBy", "verifiedBy" etc. so I can't reliably infer anything from the dc:relation. Maybe I am missing something obvious, but I sense all that will be possible is loose connection across systems and basic graph navigation. As a community should we not be aiming to answer some specific questions?

    To give you an idea of what we are trying to answer from the GBIF registry, someone working on the political level trying to get a country to join the network wants to know "What is the participation of country X in GBIF?". To this they want to know things like "Y datasets are published by institutions in country X, and shared through the international networks A,B and C but hosted by datacenter A (which resides in Country B). Country Z has Y datasets with specimen information from country X. etc etc". Other questions we are trying to address relate to the attribution chain for publishing data onto the network. We are not quite there yet with out system but are getting closer. In this modeling process we found that we needed a small handful of specific relationship types (owns, hosts, hasConstituentPart, serves etc) to answer the questions, and couldn't just rely on the types of entities at the ends of the edges. It was only when we thought of the questions we wanted to answer though, that this became apparent.


    1. The internal system allows sub-properties of the generic relations but we haven't gone so far as allowing those sub-properties to be designated in the triplifier. Thus, the specific types of relations you expressed would work in our system if they referenced either source or relation as super-property. The reason for this order of priority is that the generic terms discussed above solve the core BiSciCol use cases for tracking objects and we wanted to deliver a simple solution that can be implemented in a shorter timeframe to solve our use cases. That said, we'll have time to explore implementation of relation/source sub-properties this summer.

  3. John-
    Do you have to develop purpose-built reasoners for this graph traversal? Are we to understand these are not RDF graphs? If they are RDF graphs, what RDF is meant by the bidirectional arrows?
    --Bob Morris

    1. Apologies for following up on this comment a bit late. In summary, we are implementing graph traversal in code and following all source predicates as far as they can go. Relation does not imply a distinct direction in our system and thus is not followed in our graph traversal functions. Also, BiSciCol makes no attempt to define domains/ranges so in a sense its up to the user to apply these to their own data or not. Hence-- we are definitely dealing with a graph, however, whether one would call this an Ontology or proper RDF is open to discussion!

  4. This comment has been removed by the author.

  5. Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, evolution, distribution, and taxonomy
    biological science
    molecular biology