Saturday, August 25, 2012

News Update: how do we 'GUID'?

The BiSciCol project team has had a busy summer that included a presentation at the Annual Meeting of the Society for the Preservation of Natural History Collections (SPNHC) in New Haven, CT, and a presentation at the iEvoBio 2012 meeting in Ottawa, Canada.  Additionally, on the 13-15 August John Deck, Nico Cellinese, Rob Guralnick and Neil Davies convened at the University of California, Berkeley in order to meet with a few key partners and discuss the next steps forward for the project (meeting summary).  Before we report more about the meeting with our partners, here is some background information.  

BiSciCol's main goal is to break down the walled gardens between databases storing different kinds of biodiversity data such as specimens or samples from collecting events, sequences, images, etc. generated from those specimens or samples.  Doing so requires overcoming two separate community challenges. First, there must be a mechanism to associate globally unique identifiers (GUIDs) to collections records (Note, we are using the RSS specification GUID definition).  Second, the collections records must be expressed such that the terms used to define those records and their relationships are well understood by humans and computers.  This brings us into the “semantic web” and RDF “triples”.    

As BiSciCol has evolved, two key questions related to these challenges have emerged.  The first is whether GUIDs and creating "triples" should happen at the level of individual provider databases, or instead at the level of "aggregators" that enforce a standardized schema and encoding.  In the case of biological collections, an example of standardized schema is Darwin Core, usually encoded into a Darwin Core Archive.  Example aggregators are GBIF, VertNet and Map of Life. The second question is equally thorny and deals primarily with the content that the identifier is describing: is the identifier describing a physical object, a digital surrogate of a physical object, and is it a primary digital surrogate or a copy?  An example would be provided by specimen metadata attached to a photo record in Morphbank, which contains a copy of specimen metadata which in turn references a physical object.  

So, lets turn back to the meeting in Berkeley. That meeting included two key partners with whom we want to further develop and test ways forward given the two huge questions above.  We spent part of the time with the California Digital Library (CDL) folks, who have built a set of excellent tools that may be part of the solution to the problem of GUID assignment.  CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs.  John Kunze from CDL gave a great rundown on EZIDs and how they work, and was kind enough to meet us again on a couple of separate occasions, formal and informal.  Metadata encoding in the EZID itself may also be used to indicate use restrictions and provenance (John Kunze’s powerpoint presentation on EZIDs).  

The other key partner with whom we met was VertNet and Aaron Steele, the lead systems architect on the project.  The idea behind meeting with VertNet was to test out how we might do EZID assignment and triplification utilizing the same approach by which VertNet data is being processed from Darwin Core archives into a set of tables that can be visualized, queried and replicated.  Aaron was kind enough to participate to our hackathon and start up this process.  We set up a readme file about the Hackathon to describe our expected outputs.  Yes, the project is called "Bombus" which reflects the fact that although a bit wobbly, our goal is to have data flying around to "pollinate" other data.  Happily, the hackathon was very much a success!  We were able to tap into some existing code generated by Matt Jones (NCEAS) to mint EZIDs and VOILA, we had an output file ready for the semantic web (e.g. an output file that shows relationships between occurrences, localities and taxa based on the EZIDs).   We weren't quite able to get to the last step of querying the results, but we're very close.  More work (and reports) are to follow on this so stay tuned on the Bombus/pollinator link above.  

We have been testing a variety of solutions for identifier assignment, including: supporting user-supplied GUIDs, aggregator GUIDs, dataset DOIs, community standard identifiers (e.g. DwC Triplet), and creating QUIDs (Quasi Unique Identifiers) from hashed content.  EZID technology will play a significant role in the implementation of a number of these approaches.  None of these approaches offer a complete solution, but taken together, we can begin to build an intelligent system that can provide valuable services to aggregators, data providers, and users.  Services we will be supporting include: GUID tracking, identifying use restrictions, and GUID reappropriation.  Integrating our existing triplifier and biscicol java codebases with a scalable database back-end will fulfill most of the technical requirements needed.

We are still building our Triplifier to support those who want to take their own datasets and bring them into the semantic web framework, but BiSciCol can operate much more "at scale" with a very simple interface that accepts Darwin Core Archives or other standardized data such as those generated from Barcode of Life, Morphbank, or iDigBio, and assemble these into a shared triplestore or set of commonly accessible triplestores.  We think the issues we're tackling right now are at the sociotechnical heart of BiSciCol.  We use the term heart knowingly because it is going to be the desire and will of the community, along with the resources such as BiSciCol, that can help motivate and excite, and that will get us at least moving in the right direction. If you have any thoughts, criticisms, suggestions, we'd of course love to hear them.  


John Deck, Rob Guralnick and Nico Cellinese