Thursday, August 22, 2013

A sequence, a specimen and an identifier walk into a bar ….

As biodiversity scientists, we rely on a breadth of data living in various domain-specific databases to assemble knowledge on particular ecoregions, taxa, or populations.  We visit genetic and genomic, morphological, phylogenetic, image, and specimen databases in search of up-to-date information to answer our questions.   What we find, in general, is a morass of data, disconnected between systems, as if each database were a walled garden.   The main goal of BiSciCol is linking digital objects and datasets, breaking down walled gardens, and building a true network of biodiversity data where we can ask questions across domains, and following object instances where-ever they are referenced.  

How do we enable linking data, on a global scale, across domains? For BiSciCol, this comes down to two approaches: 1) build better ontologies and vocabularies so we can use the same language when talking about the same thing, and 2) enable identifiers that are robust enough to persist in time and allow linking across walled gardens.  For this blog-post, we’ll focus on issues we have with identifiers as they’re currently used in practice, and how they can be improved to enable better linking.




To provide a point of reference for our discussion, let’s look at two databases that contain some overlap of references to the same objects: VertNet and Genbank.   VertNet is a project that aggregates vertebrate specimen data housed in collections and Genbank is a project that houses sequences, often containing references to specimen objects housed in those same museums which are part of VertNet.   For our exercise in linking, we’ll use a popular method for identifying museum objects, the Darwin Core (DwC) triplet; a DwC triplet is composed of an institution code, collection code, and catalog number, separated by a colon.

In the VertNet database, the DwC triplet fields are stored separately, and the DwC triplet can be constructed programmatically by appending each field value together with colons separating them. The INSDC standard, which Genbank adopts, specifies the specimen_voucher qualifier value should be provided with the full DwC triple: that is the institution code, collection code, and catalog number separated by colons as one field.  Since these approaches are very similar, it should be a simple task to map the VertNet format to Genbank.  Harvesting all relevant institutions from Genbank that map to VertNet institutions, and which also has a value in the specimen_voucher field gives us over 38,000 records.  VertNet itself has over 1.4 million records (at the time we harvested data) that we also have access to.  We can assume that the Genbank records containing voucher specimen information should match well with VertNet data since the institutions providing data to Genbank are the same ones that are providing data to VertNet.

In fact, on our first pass, we found only 483 matches using the DwC triplet method of linking.  That is 1% of the potential number of matches we would expect!  If we toss the collection code field and match only on institution code plus catalog number we get 2,351 matches (6% match rate).  Although we need to look a little more closely, tossing the collection code does not seem to cause collisions between collections within an institution (but, in other instances could lead to false positive links).  If we combine removing collection code with a suite of data parsing tools we can can increase this a bit further to 3,153 matches (8% match rate).  This is still a dismal match rate.  

The primary reason for the low match rate is that Genbank will take any value in the specimen_voucher qualifier field and providers will respond in kind, inserting any value they choose. Consequently, field values are notoriously noisy, making it difficult to parse.  VertNet data was much higher quality for two reasons: 1) records were curated according to distinct values for institution, collection, and catalog_number, with clear instructions for recommended values for each field, and 2) there is a process of data cleaning and customer support for VertNet data publishers.  However, we can see that the DwC triplet method suffers from some serious issues when used in practice. Clearly we need a better strategy if we’re going to take identifiers seriously.   

Are there other options for unique identifiers?  Lets looks at identifier options currently in play, and the ramifications of each; but first, let’s take a look at what RDF tells us about identifiers.  Currently, the TDWG-RDF group is discussing what constitutes a valid URI for linking data on the semantic web. The only hard and fast recommendation in RDF is that the identifier must be an HTTP URI.  After all, this is the semantic web, built on the world-wide web, which uses the HTTP protocol to transfer data, so what can go wrong here?  Nothing, except that we must have persistence if we want to ensure identifiers are linkable in the future, and simply being an HTTP URI says nothing about persistence. It may be available today and next month, but what about in a year? In 10 years? In 50 years?  Will machine negotiation be through HTTP in 50 years? There are some work-arounds to ensure long-term persistence such as casting the identifier as a literal or using proxies to point to an HTTP resolver for identifiers.  However, its clear that RDF by itself does not answer our need for identifier persistence.  We need more specialized techniques.  

So.  Some strategies:

DwC Triplets:  We’ve talked about this strategy here and some of the drawbacks.  Also, they are not guaranteed to be globally unique, encode metadata into the identifier itself, which is bad practice, leading to persistence issues into the future. Worse: they are not resolvable, and they can be constructed in various, slightly different ways leading to matching problems.  

LSIDs: LSIDs (http://en.wikipedia.org/wiki/LSID) have not solved the persistence question either and resolvers are built on good-will and volunteer effort.  More backbone needs to be provided to make these strong persistent identifiers. For example, requiring identifiers to be resolvable rather than merely recommending resolution.  

UUIDs: Programmers love UUIDs (http://en.wikipedia.org/wiki/Universally_unique_identifier) since they can be created instantly, are always globally unique (for all practical purposes), and can be built directly as database keys. However, by themselves we don’t know where to resolve them.  A vanilla UUID sitting in the wild tells us practically nothing about the thing it is representing.  Solutions advocating UUIDs can be a great option, as long as there is a plan for resolution, usually requiring another solution to be implemented along with it.

DOIs: DOIs (http://www.doi.org/) were designed for publications and contain built in metadata protocols, and are used the world over by many, in fact most publishers.  There is an organization behind it, the International DOI Foundation, which is geared towards persisting for a long time.  There is a network of resolvers which can resolve any officially minted DOI.  DOIs are available at minimal costs through Datacite or Crossref.

EZIDs: EZIDs (http://n2t.net/ezid/) support Archival Resource Keys (ARKs) and DOIs through Datacite.  By registering with EZID you can mint up to 1 million identifiers per year at a fixed rate.  Subscription costs are reasonable.  EZIDs are supported by the California Digital Library, which not only helps assure persistence, but also provides useful services that are hard to build into homebrew resolvers.  

BCIDs: BCIDs (e.g. see http://biscicol.org/bcid/) are an extension of EZIDs, and use a hierarchical approach (using a technique called suffixPassthrough) to simultaneously resolve to dataset and record-level entries.  Since identifier registration is done for groups, and extended using locally unique suffixes it enables rapid assignment of identifiers that are keyed to local databases while offering global resolution and persistence.  With this solution, we can also sidestep the 1 million identifiers per year limit.

We conclude by noting that each aggregator out there seems to want to mint its own flavor of GUIDs, perhaps as much to “brand” an identifier space as for any other reason.  We wonder if this strategy of proliferating such spaces is a great idea.  A huge advantage of DOIs and EZIDs is abstraction. You know what they mean and how to resolve them because they are well-known and have organizations with specific missions to support identifier creation.  This strategy ensures that identifiers can persist and resolve well into the future, and be recognizable not just within the biodiversity informatics community but any other community we interoperate with: genomics, publishing, ecology, earth sciences.  This is what we’re talking about when we want to break down walled gardens.

-John Deck, Rob Guralnick, Nico Cellinese, and Tom Conlin

Monday, May 20, 2013

Sneak Peeks, BiSciCol Style



Our blog has been quiet lately, as we coded and tested and waited out the cold, short days of winter and early Spring.  With Spring now firmly here, we are ready to give you the opportunity to directly test some fruits of that labor.  First, a quick review of where we have been.  BiSciCol, and all those interested in bringing biodiversity data into the semantic web, has been plagued by a chicken and egg problem.   In order for the semantic web to be a sensible solution, there needs to be a way to associate permanent, resolvable globally unique identifiers to specimens and their metadata.   There ALSO needs to be a community-agreed sematic framework for expressing concepts and how they link together.  You can't move forward without BOTH pieces and unfortunately the biodiversity community basically has had neither.  So BiSciCol decided to tackle both problems simultaneously.  

The solution we developed leverages one thing that was already in place --- a community developed and agreed-upon biodiversity metadata standard called the Darwin Core.  We talked about how we have leveraged the Darwin Core in our last blog post, and how we have formalized Darwin Core "categories" (or classes), and derived relationships between them.  With this piece of the puzzle complete, we now have a working tool called the Triplifier.  The Triplifier takes a Darwin Core Archive, which contains some self-describing metadata about the document along with data, and converts those data to RDF.    Darwin Core Archives are particularly useful because all the data in such archives is already in a standard form.  

Darwin Core Archives are available for download from sources such as the VertNet IPT (http://ipt.vertnet.org), or the Canadensys IPT (http://data.canadensys.net/ipt/).  Just download any Darwin Core Archive you want, load the archive zip file into the Triplifier (which we have yet to deploy to production yet, but try out the development server here: http://geomuseblade.colorado.edu/triplifier/ ) via the "File Upload:" link, click "auto-generate project for" link and select Darwin Core Archive.  Load the file, get information about class and property structures, and then click "Get Triples" at the very end.  You should be able to then save the RDF.  For more information on how the DwC Archive Reader plugin works see the related JavaDoc page.

So what does this all mean?  First, this is a working tool for creating Darwin Core data in RDF format.  It may not be perfect yet, but its been stress tested, and it does the job. This is a big step forward in our opinion. We are currently Triplifying a lot of Darwin Core Archives and putting all the results into a data store for querying.  Next blog post, we'll explain how valuable this can be, especially when looking for digitial objects linked to specimens, such as published literature, or gene sequences.  

The other part of the chicken-egg problem is this persistent, and challenging, GUID problem.  Here we also have a working prototype of a service we are calling BCIDs, which are a form of identifier that is scaleable, persistent, and leverages community standards.  BCIDs are a form of EZIDs with a couple small tweaks to work for our community at scale.  It represents a lot of hard thinking by John Kunze and John Deck.  Here is the general idea: The BCID Resolution system resolves BCID identifiers that are passed through the Name-to-thing resolver (http://n2t.net/). All BCID group identifiers are registered with EZID, describing related categories of information such as Collecting Event, Occurrence, or Tissue. EZID then uses its suffix passthrough feature to pass the suffix back to the BCID resolver. At this point, a series of decisions are made based on the identifier syntax to determine how to display returned content. Element-level identifiers, with registered suffixes in the BCID system, also containing targets, can be resolved to a user-specified homepage. Un-registered suffixes, or where there is no defined target associated with the identifier, or when machine resolution is specifically requested will return an HTML rendering of the identifier with embedded RDF/XML syntax describing the identifier. Machine resolution can be specifically requested to any identifier by appending a "?" to the identifier.  See the diagram below for extra-clarity.  And check out the BCID home-page and BCID codepage.


How does this all work in practice?  Suppose we have group ID = ark:/21547/Et2 (resource=dwc:Event) and do not register any elements. Now, suppose someone passes in a resolution request for ark:/21547/Et2_UUID; the system will still tell you that this is some event (dwc:Event), date it was loaded, a title and if there is a DOI/ARK associated with it.  Now, suppose we decide to register those UUIDs associated with ark:/21547/Et2 and also provide web pages that have some HTML content to look at (targets) then we can show a nicely formatted, human readable page of the collecting event itself and some formatted human readable text (HTML).  However, what if we're a machine and we don't want to look at all the style sheets and extraneous, difficult to parse text; rather, we just want to know when this record was loaded and the resourceType (regardless if there is some target or not). This is where "?" comes in... if the "?" is appended on the end of the ark like: ark://21547/Et2_UUID? then we automatically get RDF/XML. Minimalist but predictable and a convention in use for EZIDs currently.

Soon you will be able to call the BCID service for any dataset, whether its in RDF format or not.  For datasets, one can register an ARK or DOI and associated metadata and for more granular elements, BCIDs will help assign the pass-through suffixes.  We think this represents a very elegant system for dealing with the very challenging problem of guids in the biodiversity informatics community.  It leverages existing tools and communities and it creates new ones needed for those involved in biocollections.   If you want to try creating and using BCIDs now, talk to us and we'll work with you to get this started.  

We will be presenting more about BiSciCol in meetings this Summer, at iEvoBio (http://ievobio.org/) and TDWG (http://www.tdwg.org/conference2013) , showing off what amounts to solutions that cover those chickens and eggs.   In the next post we'll finally link all of this up and show how it can be used for some neat discoveries.  Before winding down, BiSciCol owes a gigantic thanks to Brian Stucky who has put in a tremendous amount of effort developing the Triplifier.  He is off in Panama working on his dissertation research, and will be teaching classes next Fall.  We couldn't have come nearly as far as we have without him.

- Rob Guralnick, Nico Cellinese, Tom Conlin, John Deck, and Brian Stucky

Tuesday, March 12, 2013

BiSciCol, Triples, and Darwin Core

A big part of what we want to accomplish with BiSciCol is supporting biodiversity collections data from lots of different sources.  These data are often organized using a standard called "Darwin Core" (DwC), and Darwin Core-based data are commonly transmitted in a specific format known as a "Darwin Core Archive" (DwCA).  So recently, we've been devoting a lot of thought and effort to figuring out how we can best support DwC and DwCAs in BiSciCol and the Triplifier.  (The "Triplifier" is a tool we are building to make it easy to convert traditional, tabular data formats into RDF triples for use in BiSciCol and the Semantic Web.  DwCAs are just such a format.)  Representing DwC data in RDF triples and "triplifying" DwCAs presented a number of challenges, and in this post we want to discuss one of these challenges:  Figuring out how to use our relations terms to capture the connections found in DwC data.

Darwin Core includes six discrete categories of information: Occurrence, Event, dcterms:Location, GeologicalContext, Identification, and Taxon.  DwC does not formally describe relationships between these categories of information, though.  Formally defining the relationships that join categories, or classes, of information is common practice in standards development, but DwC's developers deliberately choose not to do this in order to make the standard as flexible as possible.

Before proceeding, we should note that in the previous paragraph, we were careful to make a distinction between the words “class” and “category.”  “Class” is a special word typically used to describe categories of information present in a formal ontology (which DwC is not).  However, since we’re describing a method for working towards formalizing DwC content, we’ll use the word “class” hereafter to refer both to the formal model and the original DwC categories.

So, to represent DwC data as RDF triples, we needed a way to relate DwC class instances to one another.  This sounds fancy, but it's really a matter of using a common-sense approach to describe relationships between entities, much as people have been doing with relational databases for decades.  In fact, the darwin-sw project has already developed a complete ontology for representing DwC data in the Semantic Web.  However, because BiSciCol is limited to a small set of generic relations terms, we needed a new approach for handling DwC data.  Plus, by building on the core BiSciCol relations, such a solution could easily include not just DwC, but concepts from other domains such as media, biological samples, genetic material, and the environment.

To make this all a bit more concrete, let's take a look at an example.  Suppose we have a single instance of Occurrence (a specimen in a collection, say) that originated from a particular collecting expedition, which is represented in DwC as an instance of the Event class.  Using RDF and BiSciCol's relations predicates, how should we make the required connection between the Occurrence instance and the Event instance?  More generally, how should the six core DwC classes be related to one another using BiSciCol's relations terms?


The image above illustrates our answer to this question.  Recall that we are using only four relations predicates in BiSciCol: derives_from, depends_on, alias_of, and related_to (see the previous post for much more information).  The diagram should be fairly self-explanatory.  Some relationships are naturally described by depends_on.  For example, an Identification can only exist if there is an Occurrence (e.g., a specimen) to identify and a Taxon to identify it as.  On the other hand, a GeologicalContext gives us information about a collecting Event, but in at least some sense, the collecting event is independent of the geological context.  Thus, the relationship between these two instances is described by related_to.

So far so good, but when dealing with real data, this solution turns out to be insufficient because DwC data sets often do not include all six core classes.  What should we do if a data set includes Occurrence and Taxon, but not Identification?  This scenario is not uncommon, so to deal with all possibilities, we added a few more relations to handle the cases where a class (either Identification or Event) that acts as a bridge connecting Occurrence to other classes is missing.  The following diagram illustrates the complete set of relations, with the dashed, gray lines representing the relations that are used if either Identification or Event are missing.

 


And that's it!  With this set of eight relationship triples, we should be able to handle all possible combinations of the six core DwC classes.

- Brian Stucky, John Deck, Rob Guralnick, and Tom Conlin