Monday, May 20, 2013

Sneak Peeks, BiSciCol Style



Our blog has been quiet lately, as we coded and tested and waited out the cold, short days of winter and early Spring.  With Spring now firmly here, we are ready to give you the opportunity to directly test some fruits of that labor.  First, a quick review of where we have been.  BiSciCol, and all those interested in bringing biodiversity data into the semantic web, has been plagued by a chicken and egg problem.   In order for the semantic web to be a sensible solution, there needs to be a way to associate permanent, resolvable globally unique identifiers to specimens and their metadata.   There ALSO needs to be a community-agreed sematic framework for expressing concepts and how they link together.  You can't move forward without BOTH pieces and unfortunately the biodiversity community basically has had neither.  So BiSciCol decided to tackle both problems simultaneously.  

The solution we developed leverages one thing that was already in place --- a community developed and agreed-upon biodiversity metadata standard called the Darwin Core.  We talked about how we have leveraged the Darwin Core in our last blog post, and how we have formalized Darwin Core "categories" (or classes), and derived relationships between them.  With this piece of the puzzle complete, we now have a working tool called the Triplifier.  The Triplifier takes a Darwin Core Archive, which contains some self-describing metadata about the document along with data, and converts those data to RDF.    Darwin Core Archives are particularly useful because all the data in such archives is already in a standard form.  

Darwin Core Archives are available for download from sources such as the VertNet IPT (http://ipt.vertnet.org), or the Canadensys IPT (http://data.canadensys.net/ipt/).  Just download any Darwin Core Archive you want, load the archive zip file into the Triplifier (which we have yet to deploy to production yet, but try out the development server here: http://geomuseblade.colorado.edu/triplifier/ ) via the "File Upload:" link, click "auto-generate project for" link and select Darwin Core Archive.  Load the file, get information about class and property structures, and then click "Get Triples" at the very end.  You should be able to then save the RDF.  For more information on how the DwC Archive Reader plugin works see the related JavaDoc page.

So what does this all mean?  First, this is a working tool for creating Darwin Core data in RDF format.  It may not be perfect yet, but its been stress tested, and it does the job. This is a big step forward in our opinion. We are currently Triplifying a lot of Darwin Core Archives and putting all the results into a data store for querying.  Next blog post, we'll explain how valuable this can be, especially when looking for digitial objects linked to specimens, such as published literature, or gene sequences.  

The other part of the chicken-egg problem is this persistent, and challenging, GUID problem.  Here we also have a working prototype of a service we are calling BCIDs, which are a form of identifier that is scaleable, persistent, and leverages community standards.  BCIDs are a form of EZIDs with a couple small tweaks to work for our community at scale.  It represents a lot of hard thinking by John Kunze and John Deck.  Here is the general idea: The BCID Resolution system resolves BCID identifiers that are passed through the Name-to-thing resolver (http://n2t.net/). All BCID group identifiers are registered with EZID, describing related categories of information such as Collecting Event, Occurrence, or Tissue. EZID then uses its suffix passthrough feature to pass the suffix back to the BCID resolver. At this point, a series of decisions are made based on the identifier syntax to determine how to display returned content. Element-level identifiers, with registered suffixes in the BCID system, also containing targets, can be resolved to a user-specified homepage. Un-registered suffixes, or where there is no defined target associated with the identifier, or when machine resolution is specifically requested will return an HTML rendering of the identifier with embedded RDF/XML syntax describing the identifier. Machine resolution can be specifically requested to any identifier by appending a "?" to the identifier.  See the diagram below for extra-clarity.  And check out the BCID home-page and BCID codepage.


How does this all work in practice?  Suppose we have group ID = ark:/21547/Et2 (resource=dwc:Event) and do not register any elements. Now, suppose someone passes in a resolution request for ark:/21547/Et2_UUID; the system will still tell you that this is some event (dwc:Event), date it was loaded, a title and if there is a DOI/ARK associated with it.  Now, suppose we decide to register those UUIDs associated with ark:/21547/Et2 and also provide web pages that have some HTML content to look at (targets) then we can show a nicely formatted, human readable page of the collecting event itself and some formatted human readable text (HTML).  However, what if we're a machine and we don't want to look at all the style sheets and extraneous, difficult to parse text; rather, we just want to know when this record was loaded and the resourceType (regardless if there is some target or not). This is where "?" comes in... if the "?" is appended on the end of the ark like: ark://21547/Et2_UUID? then we automatically get RDF/XML. Minimalist but predictable and a convention in use for EZIDs currently.

Soon you will be able to call the BCID service for any dataset, whether its in RDF format or not.  For datasets, one can register an ARK or DOI and associated metadata and for more granular elements, BCIDs will help assign the pass-through suffixes.  We think this represents a very elegant system for dealing with the very challenging problem of guids in the biodiversity informatics community.  It leverages existing tools and communities and it creates new ones needed for those involved in biocollections.   If you want to try creating and using BCIDs now, talk to us and we'll work with you to get this started.  

We will be presenting more about BiSciCol in meetings this Summer, at iEvoBio (http://ievobio.org/) and TDWG (http://www.tdwg.org/conference2013) , showing off what amounts to solutions that cover those chickens and eggs.   In the next post we'll finally link all of this up and show how it can be used for some neat discoveries.  Before winding down, BiSciCol owes a gigantic thanks to Brian Stucky who has put in a tremendous amount of effort developing the Triplifier.  He is off in Panama working on his dissertation research, and will be teaching classes next Fall.  We couldn't have come nearly as far as we have without him.

- Rob Guralnick, Nico Cellinese, Tom Conlin, John Deck, and Brian Stucky

2 comments:

  1. Folks, this is all great to read, but I still don't see a canonical specimen ID resolver and URI provider.

    http://biscicol.org/id/ark:/21547/R2_MBIO56 works, but what is the sustainability of this server, and one is left to wonder what 21547 and R2 encode, i.e., what specimen IDs for other collections than Biocode would look like. (And who will register them? and presumably a specimen whose iD isn't registered won't be resolved. That's going to take a lot of adoption - who's working on that?)

    http://n2t.net/ark:/21547/R2 after a long time stalls out at http://noid.cdlib.org/ark:/21547/R2, with an unavailable error.

    http://n2t.net/ark:/21547/R2_MBIO56 seems to result in a randomly chosen 20-50MB PDF document. Not very encouraging.

    I'm left wondering with all the talk about specimens in your RDF graphs, how in the world are you identifying them using HTTP URIs if even this basic stuff doesn't work.

    ReplyDelete
    Replies
    1. Sorry for the late response on this... for some reason I wasn't notified of this comment and just stumbled on it 2 months later! If you have technical questions/comments you can contact the developers or post issues on the development site.

      At any rate, to answer your questions:

      question: What is the specimen ID resolver?
      answer: The specimen ID resolver is using CDL's name-to-thing resolver.

      question: sustainability of the biscicol.org server:
      answer: Currently suffix Passthrough, a central feature of this system only works on the BiSciCol server, which is funded by an NSF grant and we don't expect this to be the resolver of the future. We are waiting for this feature to be put in place and live on the Name-to-Thing resolver, hosted by CDL. The BiSciCol server is there so you can test the functionality of the system.

      question: what about stalling out of http://n2t.net/ark:/21547/R2 ?
      answer: currently the Name-to-thing resolver is forwarding this request to the BiSciCol server. It should return with a metadata response from the BiSciCol server and i do not know why this stalled out for you. This is configurable based on metadata you provide for the identifier.

      question: http://n2t.net/ark:/21547/R2_MBIO56 seems to result in a randomly chosen 20-50MB PDF document. Not very encouraging.
      answer: as is explained on the BCID page and on this blogpost and again at the top of this reply, the suffix passthrough feature on the name-to-thing server is not yet implemented!!

      question: I'm left wondering with all the talk about specimens in your RDF graphs, how in the world are you identifying them using HTTP URIs if even this basic stuff doesn't work.
      answer: Be patient. As we explain in this post, these features are in development but the core mechanism of ARK plus suffixPassthrough is what we are building on.

      Delete