Thursday, August 22, 2013

A sequence, a specimen and an identifier walk into a bar ….

As biodiversity scientists, we rely on a breadth of data living in various domain-specific databases to assemble knowledge on particular ecoregions, taxa, or populations.  We visit genetic and genomic, morphological, phylogenetic, image, and specimen databases in search of up-to-date information to answer our questions.   What we find, in general, is a morass of data, disconnected between systems, as if each database were a walled garden.   The main goal of BiSciCol is linking digital objects and datasets, breaking down walled gardens, and building a true network of biodiversity data where we can ask questions across domains, and following object instances where-ever they are referenced.  

How do we enable linking data, on a global scale, across domains? For BiSciCol, this comes down to two approaches: 1) build better ontologies and vocabularies so we can use the same language when talking about the same thing, and 2) enable identifiers that are robust enough to persist in time and allow linking across walled gardens.  For this blog-post, we’ll focus on issues we have with identifiers as they’re currently used in practice, and how they can be improved to enable better linking.




To provide a point of reference for our discussion, let’s look at two databases that contain some overlap of references to the same objects: VertNet and Genbank.   VertNet is a project that aggregates vertebrate specimen data housed in collections and Genbank is a project that houses sequences, often containing references to specimen objects housed in those same museums which are part of VertNet.   For our exercise in linking, we’ll use a popular method for identifying museum objects, the Darwin Core (DwC) triplet; a DwC triplet is composed of an institution code, collection code, and catalog number, separated by a colon.

In the VertNet database, the DwC triplet fields are stored separately, and the DwC triplet can be constructed programmatically by appending each field value together with colons separating them. The INSDC standard, which Genbank adopts, specifies the specimen_voucher qualifier value should be provided with the full DwC triple: that is the institution code, collection code, and catalog number separated by colons as one field.  Since these approaches are very similar, it should be a simple task to map the VertNet format to Genbank.  Harvesting all relevant institutions from Genbank that map to VertNet institutions, and which also has a value in the specimen_voucher field gives us over 38,000 records.  VertNet itself has over 1.4 million records (at the time we harvested data) that we also have access to.  We can assume that the Genbank records containing voucher specimen information should match well with VertNet data since the institutions providing data to Genbank are the same ones that are providing data to VertNet.

In fact, on our first pass, we found only 483 matches using the DwC triplet method of linking.  That is 1% of the potential number of matches we would expect!  If we toss the collection code field and match only on institution code plus catalog number we get 2,351 matches (6% match rate).  Although we need to look a little more closely, tossing the collection code does not seem to cause collisions between collections within an institution (but, in other instances could lead to false positive links).  If we combine removing collection code with a suite of data parsing tools we can can increase this a bit further to 3,153 matches (8% match rate).  This is still a dismal match rate.  

The primary reason for the low match rate is that Genbank will take any value in the specimen_voucher qualifier field and providers will respond in kind, inserting any value they choose. Consequently, field values are notoriously noisy, making it difficult to parse.  VertNet data was much higher quality for two reasons: 1) records were curated according to distinct values for institution, collection, and catalog_number, with clear instructions for recommended values for each field, and 2) there is a process of data cleaning and customer support for VertNet data publishers.  However, we can see that the DwC triplet method suffers from some serious issues when used in practice. Clearly we need a better strategy if we’re going to take identifiers seriously.   

Are there other options for unique identifiers?  Lets looks at identifier options currently in play, and the ramifications of each; but first, let’s take a look at what RDF tells us about identifiers.  Currently, the TDWG-RDF group is discussing what constitutes a valid URI for linking data on the semantic web. The only hard and fast recommendation in RDF is that the identifier must be an HTTP URI.  After all, this is the semantic web, built on the world-wide web, which uses the HTTP protocol to transfer data, so what can go wrong here?  Nothing, except that we must have persistence if we want to ensure identifiers are linkable in the future, and simply being an HTTP URI says nothing about persistence. It may be available today and next month, but what about in a year? In 10 years? In 50 years?  Will machine negotiation be through HTTP in 50 years? There are some work-arounds to ensure long-term persistence such as casting the identifier as a literal or using proxies to point to an HTTP resolver for identifiers.  However, its clear that RDF by itself does not answer our need for identifier persistence.  We need more specialized techniques.  

So.  Some strategies:

DwC Triplets:  We’ve talked about this strategy here and some of the drawbacks.  Also, they are not guaranteed to be globally unique, encode metadata into the identifier itself, which is bad practice, leading to persistence issues into the future. Worse: they are not resolvable, and they can be constructed in various, slightly different ways leading to matching problems.  

LSIDs: LSIDs (http://en.wikipedia.org/wiki/LSID) have not solved the persistence question either and resolvers are built on good-will and volunteer effort.  More backbone needs to be provided to make these strong persistent identifiers. For example, requiring identifiers to be resolvable rather than merely recommending resolution.  

UUIDs: Programmers love UUIDs (http://en.wikipedia.org/wiki/Universally_unique_identifier) since they can be created instantly, are always globally unique (for all practical purposes), and can be built directly as database keys. However, by themselves we don’t know where to resolve them.  A vanilla UUID sitting in the wild tells us practically nothing about the thing it is representing.  Solutions advocating UUIDs can be a great option, as long as there is a plan for resolution, usually requiring another solution to be implemented along with it.

DOIs: DOIs (http://www.doi.org/) were designed for publications and contain built in metadata protocols, and are used the world over by many, in fact most publishers.  There is an organization behind it, the International DOI Foundation, which is geared towards persisting for a long time.  There is a network of resolvers which can resolve any officially minted DOI.  DOIs are available at minimal costs through Datacite or Crossref.

EZIDs: EZIDs (http://n2t.net/ezid/) support Archival Resource Keys (ARKs) and DOIs through Datacite.  By registering with EZID you can mint up to 1 million identifiers per year at a fixed rate.  Subscription costs are reasonable.  EZIDs are supported by the California Digital Library, which not only helps assure persistence, but also provides useful services that are hard to build into homebrew resolvers.  

BCIDs: BCIDs (e.g. see http://biscicol.org/bcid/) are an extension of EZIDs, and use a hierarchical approach (using a technique called suffixPassthrough) to simultaneously resolve to dataset and record-level entries.  Since identifier registration is done for groups, and extended using locally unique suffixes it enables rapid assignment of identifiers that are keyed to local databases while offering global resolution and persistence.  With this solution, we can also sidestep the 1 million identifiers per year limit.

We conclude by noting that each aggregator out there seems to want to mint its own flavor of GUIDs, perhaps as much to “brand” an identifier space as for any other reason.  We wonder if this strategy of proliferating such spaces is a great idea.  A huge advantage of DOIs and EZIDs is abstraction. You know what they mean and how to resolve them because they are well-known and have organizations with specific missions to support identifier creation.  This strategy ensures that identifiers can persist and resolve well into the future, and be recognizable not just within the biodiversity informatics community but any other community we interoperate with: genomics, publishing, ecology, earth sciences.  This is what we’re talking about when we want to break down walled gardens.

-John Deck, Rob Guralnick, Nico Cellinese, and Tom Conlin