Thursday, August 22, 2013

A sequence, a specimen and an identifier walk into a bar ….

As biodiversity scientists, we rely on a breadth of data living in various domain-specific databases to assemble knowledge on particular ecoregions, taxa, or populations.  We visit genetic and genomic, morphological, phylogenetic, image, and specimen databases in search of up-to-date information to answer our questions.   What we find, in general, is a morass of data, disconnected between systems, as if each database were a walled garden.   The main goal of BiSciCol is linking digital objects and datasets, breaking down walled gardens, and building a true network of biodiversity data where we can ask questions across domains, and following object instances where-ever they are referenced.  

How do we enable linking data, on a global scale, across domains? For BiSciCol, this comes down to two approaches: 1) build better ontologies and vocabularies so we can use the same language when talking about the same thing, and 2) enable identifiers that are robust enough to persist in time and allow linking across walled gardens.  For this blog-post, we’ll focus on issues we have with identifiers as they’re currently used in practice, and how they can be improved to enable better linking.




To provide a point of reference for our discussion, let’s look at two databases that contain some overlap of references to the same objects: VertNet and Genbank.   VertNet is a project that aggregates vertebrate specimen data housed in collections and Genbank is a project that houses sequences, often containing references to specimen objects housed in those same museums which are part of VertNet.   For our exercise in linking, we’ll use a popular method for identifying museum objects, the Darwin Core (DwC) triplet; a DwC triplet is composed of an institution code, collection code, and catalog number, separated by a colon.

In the VertNet database, the DwC triplet fields are stored separately, and the DwC triplet can be constructed programmatically by appending each field value together with colons separating them. The INSDC standard, which Genbank adopts, specifies the specimen_voucher qualifier value should be provided with the full DwC triple: that is the institution code, collection code, and catalog number separated by colons as one field.  Since these approaches are very similar, it should be a simple task to map the VertNet format to Genbank.  Harvesting all relevant institutions from Genbank that map to VertNet institutions, and which also has a value in the specimen_voucher field gives us over 38,000 records.  VertNet itself has over 1.4 million records (at the time we harvested data) that we also have access to.  We can assume that the Genbank records containing voucher specimen information should match well with VertNet data since the institutions providing data to Genbank are the same ones that are providing data to VertNet.

In fact, on our first pass, we found only 483 matches using the DwC triplet method of linking.  That is 1% of the potential number of matches we would expect!  If we toss the collection code field and match only on institution code plus catalog number we get 2,351 matches (6% match rate).  Although we need to look a little more closely, tossing the collection code does not seem to cause collisions between collections within an institution (but, in other instances could lead to false positive links).  If we combine removing collection code with a suite of data parsing tools we can can increase this a bit further to 3,153 matches (8% match rate).  This is still a dismal match rate.  

The primary reason for the low match rate is that Genbank will take any value in the specimen_voucher qualifier field and providers will respond in kind, inserting any value they choose. Consequently, field values are notoriously noisy, making it difficult to parse.  VertNet data was much higher quality for two reasons: 1) records were curated according to distinct values for institution, collection, and catalog_number, with clear instructions for recommended values for each field, and 2) there is a process of data cleaning and customer support for VertNet data publishers.  However, we can see that the DwC triplet method suffers from some serious issues when used in practice. Clearly we need a better strategy if we’re going to take identifiers seriously.   

Are there other options for unique identifiers?  Lets looks at identifier options currently in play, and the ramifications of each; but first, let’s take a look at what RDF tells us about identifiers.  Currently, the TDWG-RDF group is discussing what constitutes a valid URI for linking data on the semantic web. The only hard and fast recommendation in RDF is that the identifier must be an HTTP URI.  After all, this is the semantic web, built on the world-wide web, which uses the HTTP protocol to transfer data, so what can go wrong here?  Nothing, except that we must have persistence if we want to ensure identifiers are linkable in the future, and simply being an HTTP URI says nothing about persistence. It may be available today and next month, but what about in a year? In 10 years? In 50 years?  Will machine negotiation be through HTTP in 50 years? There are some work-arounds to ensure long-term persistence such as casting the identifier as a literal or using proxies to point to an HTTP resolver for identifiers.  However, its clear that RDF by itself does not answer our need for identifier persistence.  We need more specialized techniques.  

So.  Some strategies:

DwC Triplets:  We’ve talked about this strategy here and some of the drawbacks.  Also, they are not guaranteed to be globally unique, encode metadata into the identifier itself, which is bad practice, leading to persistence issues into the future. Worse: they are not resolvable, and they can be constructed in various, slightly different ways leading to matching problems.  

LSIDs: LSIDs (http://en.wikipedia.org/wiki/LSID) have not solved the persistence question either and resolvers are built on good-will and volunteer effort.  More backbone needs to be provided to make these strong persistent identifiers. For example, requiring identifiers to be resolvable rather than merely recommending resolution.  

UUIDs: Programmers love UUIDs (http://en.wikipedia.org/wiki/Universally_unique_identifier) since they can be created instantly, are always globally unique (for all practical purposes), and can be built directly as database keys. However, by themselves we don’t know where to resolve them.  A vanilla UUID sitting in the wild tells us practically nothing about the thing it is representing.  Solutions advocating UUIDs can be a great option, as long as there is a plan for resolution, usually requiring another solution to be implemented along with it.

DOIs: DOIs (http://www.doi.org/) were designed for publications and contain built in metadata protocols, and are used the world over by many, in fact most publishers.  There is an organization behind it, the International DOI Foundation, which is geared towards persisting for a long time.  There is a network of resolvers which can resolve any officially minted DOI.  DOIs are available at minimal costs through Datacite or Crossref.

EZIDs: EZIDs (http://n2t.net/ezid/) support Archival Resource Keys (ARKs) and DOIs through Datacite.  By registering with EZID you can mint up to 1 million identifiers per year at a fixed rate.  Subscription costs are reasonable.  EZIDs are supported by the California Digital Library, which not only helps assure persistence, but also provides useful services that are hard to build into homebrew resolvers.  

BCIDs: BCIDs (e.g. see http://biscicol.org/bcid/) are an extension of EZIDs, and use a hierarchical approach (using a technique called suffixPassthrough) to simultaneously resolve to dataset and record-level entries.  Since identifier registration is done for groups, and extended using locally unique suffixes it enables rapid assignment of identifiers that are keyed to local databases while offering global resolution and persistence.  With this solution, we can also sidestep the 1 million identifiers per year limit.

We conclude by noting that each aggregator out there seems to want to mint its own flavor of GUIDs, perhaps as much to “brand” an identifier space as for any other reason.  We wonder if this strategy of proliferating such spaces is a great idea.  A huge advantage of DOIs and EZIDs is abstraction. You know what they mean and how to resolve them because they are well-known and have organizations with specific missions to support identifier creation.  This strategy ensures that identifiers can persist and resolve well into the future, and be recognizable not just within the biodiversity informatics community but any other community we interoperate with: genomics, publishing, ecology, earth sciences.  This is what we’re talking about when we want to break down walled gardens.

-John Deck, Rob Guralnick, Nico Cellinese, and Tom Conlin

Monday, May 20, 2013

Sneak Peeks, BiSciCol Style



Our blog has been quiet lately, as we coded and tested and waited out the cold, short days of winter and early Spring.  With Spring now firmly here, we are ready to give you the opportunity to directly test some fruits of that labor.  First, a quick review of where we have been.  BiSciCol, and all those interested in bringing biodiversity data into the semantic web, has been plagued by a chicken and egg problem.   In order for the semantic web to be a sensible solution, there needs to be a way to associate permanent, resolvable globally unique identifiers to specimens and their metadata.   There ALSO needs to be a community-agreed sematic framework for expressing concepts and how they link together.  You can't move forward without BOTH pieces and unfortunately the biodiversity community basically has had neither.  So BiSciCol decided to tackle both problems simultaneously.  

The solution we developed leverages one thing that was already in place --- a community developed and agreed-upon biodiversity metadata standard called the Darwin Core.  We talked about how we have leveraged the Darwin Core in our last blog post, and how we have formalized Darwin Core "categories" (or classes), and derived relationships between them.  With this piece of the puzzle complete, we now have a working tool called the Triplifier.  The Triplifier takes a Darwin Core Archive, which contains some self-describing metadata about the document along with data, and converts those data to RDF.    Darwin Core Archives are particularly useful because all the data in such archives is already in a standard form.  

Darwin Core Archives are available for download from sources such as the VertNet IPT (http://ipt.vertnet.org), or the Canadensys IPT (http://data.canadensys.net/ipt/).  Just download any Darwin Core Archive you want, load the archive zip file into the Triplifier (which we have yet to deploy to production yet, but try out the development server here: http://geomuseblade.colorado.edu/triplifier/ ) via the "File Upload:" link, click "auto-generate project for" link and select Darwin Core Archive.  Load the file, get information about class and property structures, and then click "Get Triples" at the very end.  You should be able to then save the RDF.  For more information on how the DwC Archive Reader plugin works see the related JavaDoc page.

So what does this all mean?  First, this is a working tool for creating Darwin Core data in RDF format.  It may not be perfect yet, but its been stress tested, and it does the job. This is a big step forward in our opinion. We are currently Triplifying a lot of Darwin Core Archives and putting all the results into a data store for querying.  Next blog post, we'll explain how valuable this can be, especially when looking for digitial objects linked to specimens, such as published literature, or gene sequences.  

The other part of the chicken-egg problem is this persistent, and challenging, GUID problem.  Here we also have a working prototype of a service we are calling BCIDs, which are a form of identifier that is scaleable, persistent, and leverages community standards.  BCIDs are a form of EZIDs with a couple small tweaks to work for our community at scale.  It represents a lot of hard thinking by John Kunze and John Deck.  Here is the general idea: The BCID Resolution system resolves BCID identifiers that are passed through the Name-to-thing resolver (http://n2t.net/). All BCID group identifiers are registered with EZID, describing related categories of information such as Collecting Event, Occurrence, or Tissue. EZID then uses its suffix passthrough feature to pass the suffix back to the BCID resolver. At this point, a series of decisions are made based on the identifier syntax to determine how to display returned content. Element-level identifiers, with registered suffixes in the BCID system, also containing targets, can be resolved to a user-specified homepage. Un-registered suffixes, or where there is no defined target associated with the identifier, or when machine resolution is specifically requested will return an HTML rendering of the identifier with embedded RDF/XML syntax describing the identifier. Machine resolution can be specifically requested to any identifier by appending a "?" to the identifier.  See the diagram below for extra-clarity.  And check out the BCID home-page and BCID codepage.


How does this all work in practice?  Suppose we have group ID = ark:/21547/Et2 (resource=dwc:Event) and do not register any elements. Now, suppose someone passes in a resolution request for ark:/21547/Et2_UUID; the system will still tell you that this is some event (dwc:Event), date it was loaded, a title and if there is a DOI/ARK associated with it.  Now, suppose we decide to register those UUIDs associated with ark:/21547/Et2 and also provide web pages that have some HTML content to look at (targets) then we can show a nicely formatted, human readable page of the collecting event itself and some formatted human readable text (HTML).  However, what if we're a machine and we don't want to look at all the style sheets and extraneous, difficult to parse text; rather, we just want to know when this record was loaded and the resourceType (regardless if there is some target or not). This is where "?" comes in... if the "?" is appended on the end of the ark like: ark://21547/Et2_UUID? then we automatically get RDF/XML. Minimalist but predictable and a convention in use for EZIDs currently.

Soon you will be able to call the BCID service for any dataset, whether its in RDF format or not.  For datasets, one can register an ARK or DOI and associated metadata and for more granular elements, BCIDs will help assign the pass-through suffixes.  We think this represents a very elegant system for dealing with the very challenging problem of guids in the biodiversity informatics community.  It leverages existing tools and communities and it creates new ones needed for those involved in biocollections.   If you want to try creating and using BCIDs now, talk to us and we'll work with you to get this started.  

We will be presenting more about BiSciCol in meetings this Summer, at iEvoBio (http://ievobio.org/) and TDWG (http://www.tdwg.org/conference2013) , showing off what amounts to solutions that cover those chickens and eggs.   In the next post we'll finally link all of this up and show how it can be used for some neat discoveries.  Before winding down, BiSciCol owes a gigantic thanks to Brian Stucky who has put in a tremendous amount of effort developing the Triplifier.  He is off in Panama working on his dissertation research, and will be teaching classes next Fall.  We couldn't have come nearly as far as we have without him.

- Rob Guralnick, Nico Cellinese, Tom Conlin, John Deck, and Brian Stucky

Tuesday, March 12, 2013

BiSciCol, Triples, and Darwin Core

A big part of what we want to accomplish with BiSciCol is supporting biodiversity collections data from lots of different sources.  These data are often organized using a standard called "Darwin Core" (DwC), and Darwin Core-based data are commonly transmitted in a specific format known as a "Darwin Core Archive" (DwCA).  So recently, we've been devoting a lot of thought and effort to figuring out how we can best support DwC and DwCAs in BiSciCol and the Triplifier.  (The "Triplifier" is a tool we are building to make it easy to convert traditional, tabular data formats into RDF triples for use in BiSciCol and the Semantic Web.  DwCAs are just such a format.)  Representing DwC data in RDF triples and "triplifying" DwCAs presented a number of challenges, and in this post we want to discuss one of these challenges:  Figuring out how to use our relations terms to capture the connections found in DwC data.

Darwin Core includes six discrete categories of information: Occurrence, Event, dcterms:Location, GeologicalContext, Identification, and Taxon.  DwC does not formally describe relationships between these categories of information, though.  Formally defining the relationships that join categories, or classes, of information is common practice in standards development, but DwC's developers deliberately choose not to do this in order to make the standard as flexible as possible.

Before proceeding, we should note that in the previous paragraph, we were careful to make a distinction between the words “class” and “category.”  “Class” is a special word typically used to describe categories of information present in a formal ontology (which DwC is not).  However, since we’re describing a method for working towards formalizing DwC content, we’ll use the word “class” hereafter to refer both to the formal model and the original DwC categories.

So, to represent DwC data as RDF triples, we needed a way to relate DwC class instances to one another.  This sounds fancy, but it's really a matter of using a common-sense approach to describe relationships between entities, much as people have been doing with relational databases for decades.  In fact, the darwin-sw project has already developed a complete ontology for representing DwC data in the Semantic Web.  However, because BiSciCol is limited to a small set of generic relations terms, we needed a new approach for handling DwC data.  Plus, by building on the core BiSciCol relations, such a solution could easily include not just DwC, but concepts from other domains such as media, biological samples, genetic material, and the environment.

To make this all a bit more concrete, let's take a look at an example.  Suppose we have a single instance of Occurrence (a specimen in a collection, say) that originated from a particular collecting expedition, which is represented in DwC as an instance of the Event class.  Using RDF and BiSciCol's relations predicates, how should we make the required connection between the Occurrence instance and the Event instance?  More generally, how should the six core DwC classes be related to one another using BiSciCol's relations terms?


The image above illustrates our answer to this question.  Recall that we are using only four relations predicates in BiSciCol: derives_from, depends_on, alias_of, and related_to (see the previous post for much more information).  The diagram should be fairly self-explanatory.  Some relationships are naturally described by depends_on.  For example, an Identification can only exist if there is an Occurrence (e.g., a specimen) to identify and a Taxon to identify it as.  On the other hand, a GeologicalContext gives us information about a collecting Event, but in at least some sense, the collecting event is independent of the geological context.  Thus, the relationship between these two instances is described by related_to.

So far so good, but when dealing with real data, this solution turns out to be insufficient because DwC data sets often do not include all six core classes.  What should we do if a data set includes Occurrence and Taxon, but not Identification?  This scenario is not uncommon, so to deal with all possibilities, we added a few more relations to handle the cases where a class (either Identification or Event) that acts as a bridge connecting Occurrence to other classes is missing.  The following diagram illustrates the complete set of relations, with the dashed, gray lines representing the relations that are used if either Identification or Event are missing.

 


And that's it!  With this set of eight relationship triples, we should be able to handle all possible combinations of the six core DwC classes.

- Brian Stucky, John Deck, Rob Guralnick, and Tom Conlin

Thursday, December 27, 2012

BiSciCol in Four Pictures

People always say a picture is worth a thousand words.   Given that, we want to present “4000+” words here.  That is, below are 4 images and some “captions” or explanatory text – that, while not fully inclusive of current efforts,  gives a  close to a complete update on progress and our next steps.



Figure the first.   One of the things that the BiSciCol crew has thought a lot about is how to express relationships among different kinds of physical and digital biological collection “objects”.  Our work is focused on tracking those relationships, which means following the linkages between objects as they move about on the Internet of Things (http://en.wikipedia.org/wiki/Internet_of_Things).  Early in the BiSciCol project, we had exactly one relationship, which we expanded a few blog posts ago, by adding a second predicate called “relatedTo” which is directionless and limits how searches could traverse our network.  We have now settled on what we hope is a final set of predicates, which also includes “derives_from” and “alias_of”.  “Derives_from” is important because it recognizes that properties of biological objects can be shared among its derivatives, such as saying that a tissue sample can be inferred to be collected in Moorea, French Polynesia because the specimen (whole organism) was defined as being collected there (“derives_from” is borrowed from the Relations Ontology and defined as transitive).  Finally, “alias_of” is a way of handling duplicate identifiers for the same object.



Figure The Second.  We know you love technical architecture diagrams during the holidays.  Although this looks a bit complicated, let’s take this apart and discuss the various parts, because it summarizes a lot of work we invested to deal with some challenging social and technical issues.  This diagram is really built on three main components:  the GetMyGuid service, the Triplifier Simplifier, and the Triplifier Repository.  The GetMyGUID service is used to mint EZIDs that can be directly passed to biocollections managers for using at the source, or that can be associated with data in the triplestore.    The Triplifier (Simplifier) is a tool for creating RDF from biocollections data, and pushing that to a user via web services or to a triplestore.  We are now working out the backend architecture to deal with storing a large number of triples.  We have developed this architecture to be flexible, simple, and based on understanding user needs (and concerns) with regards to permanent, unique identifiers and semantic web approaches.



Figure the third.  The Triplifier is a web-based software that takes input files and creates triples (http://en.wikipedia.org/wiki/N-Triples) from them.  The process for doing this involves multiple steps, starting with uploading a database or a spreadsheet to the Triplifier, specifying any known joins between tables that are uploaded, and mapping properties in those local files to known terms in an appropriate vocabulary, relating terms using predicates and then hitting “Triplify!”  For those not versed in ontologies and the semantic web, the whole process can be intimidating!  So we made it easier.  The Triplifier Simplifier can take any dataset in Darwin Core format, and we’ll do the work for you.  We’ll read the header rows, verify that they map to Darwin Core terms, and set it all up to Triplify correctly.  Voilà!  We have a bit more work to do here before the Simplifier is ready – the big challenge is taking these flat files “spreadsheets” and recreating a set of tables based on Darwin Core classes such as “occurrence”, “event”, “taxon”, etc.  We will spend more time discussing this in future blog posts!

Figure the Fourth.   This is another “in preparation” web interface for users to get Great and Useful EZIDs.  The options for doing so include pasting in a set of local identifiers, which could be set of catalog numbers or locally specific event Identifiers.  The GetMyGuid service creates a second column and makes an EZID per row linked to the local identifier.  A user can then import this right back into their database and have EZIDs on their source material.   The “Create GUIDs” link just mints a set of EZIDs for later use.  Some authentication will be required and we might put an expiration data on how long you can wait to use them.  The last option is “Mint a DOI for your dataset”.  You basically just type in the digital object location, and some key metadata and you get a DOI that can resolve to at least the metadata and link to the actual digital object.   As always, BiSciCol will accept any well-formed, valid URI, persistent identifiers expressed by clients.  We are working closely with the California Digital Library and extending their EZID API for use in this part of our project.

Summary:  We end 2012 on a BiSciCol high note, and not just because the meeting was in Boulder Colorado either (because of the elevation, people!  Not the legal cannabis!)  We have made a lot of progress based on productive meetings, a lot of input from various folks, and a lot of time and effort by our talented staff of programmers who work so hard to develop this and also canvas the community.  We should also take this opportunity to give a shout out to a new developer on the team, Tom Conlin, who is joining us as our backend database expert.  Great to have him on board!

- John Deck, Rob Guralnick, Brian Stucky, Tom Conlin, and Nico Cellinese

Friday, October 12, 2012

Making it 'EZ' to GUID

On Global Unique Identifiers (again) for Natural History Collections Data:  How to Stop People From Saying “You’re Doing It Wrong” (or conversely, “Yay! We’re Doing It Right!”)
'
From Gary Larsen and adapted by Barry Smith in Referent Tracking
presentation at the Semantics of Biodiversity Workshop, 2012.

The natural history collections community has been hearing about GUIDs (globally unique identifiers) for a long time.  However, what we’ve typically heard are comments like “ARGH!  These don’t work” or “These are hard to implement”; or we’ve been subjected to long “policy documents” that seem to be generated by computer scientists, not the people actually working in the collections.  So the bottom line is that it’d be nice to have some clear, short “thinky things” about GUIDs that help us understand the value a bit more clearly, and that provides a simple and clear way forward.  We want to take a stab at that here and VERY MUCH WELCOME feedback.  Lots of it.  We’ve thought a ton about it and we are ready!
 
A recurrent question we have gotten from people developing collections database (or at the level of aggregators such as Vertnet or GBIF) is why we need to go beyond the self-minted, internal GUIDs and why GUIDs need to resolve and be persistent.  We could envision a large data aggregator such as iDigBio or GBIF that keeps track of digital records by assigning UUIDs (universally unique identifiers --- which are very, very easy to mint!) to these but likely without any connection to the physical source objects stored in providers institutions, and/or any connection to the same objects stored in other institutional repositories or aggregators.  Yet, the ultimate value of assigning GUIDs to objects, their metadata and derivatives is that we can track all these back to their source and generate queries that imply semantic reasoning over a robust digital landscape. In such a landscape, answering those core-challenging questions generated by collaborative projects becomes possible.  Therefore, the digitization process acquires a much deeper meaning and value by going beyond the process of straightforward data capture and moves towards an environment where we can track relationships among physical objects, annotations, and descriptive metadata as part of a global network.  If as a community we agree on the benefit of assigning GUIDs, this is the opportunity to generate a strategy that would add long-term value to this effort.  In other words, if we need to invest our resources, let’s do it in ways that we can draw benefit now and in the future.

A big question is how to best implement such a vision.  GUID implementations within our community have proven problematic as evidenced by 20% of Rod Page’s blog posts.  After much vetting of possible solutions, we believe the right answer is to leverage existing expertise developing not just GUIDs, but a set of services built around them.   In particular, we have talked to the California Digital Library (CDL) about EZIDs, and the value of using EZIDs given that these elegantly solves a lot of community needs at once and nicely positions us for the future. Speaking of community needs, the solution we advocate is not just “go get EZIDs”.  BiSciCol was funded, in part, to help with this task of working with the community and foster the implementation of GUIDs as a necessary step towards bringing our digital resources into a Linked Open Data framework.  BiSciCol wants to build out services that support the community, working with CDL and you, to make that happen.

What are EZIDS and why do we love them?  
As we mentioned in a previous blog post (http://biscicol.blogspot.com/2012/08/the-biscicol-project-team-has-had-busy.html), CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs.  The big win is that there are a bunch of CDL services already developed to help with minting these GUIDs, and assure that these are resolvable, linkable, persistent and long-term sustainable.  EZIDs have some lovely features, including their flexibility to be associated with datasets and objects through the whole digital data life cycle.  Also, EZIDs allow us to mix and match DOIs, which are well understood and used in the publishing community, with ARKs, which were developed in the archives, library and museum community and provide a bit more flexibility and the ability to assign on a more granular level to individual data objects rather than datasets.  For more details, see John Kunze’s powerpoint presentation on EZIDs).   We can work with CDL and their EZID system to build a prototype collections community GUIDs service.

So you are thinking to yourself... how much does it cost?  The answer is:  Nothing to you,   very little to BiSciCol, and ultimately remarkably lower than what has already been spent in terms of people-hours trying to sort through this very complex landscape, and develop home-grown solutions.  Sustainability has costs --- and the goal is to scale those down to the point where they are orders of magnitude lower than where they have been before by leveraging economies of scale. We do that with this solution.  Big win.  

Our View on Best Practices:

  1. GUIDs must be globally unique.  The “Darwin Core Triplet” might not be good enough.  
  2. GUIDs must be persistent.  Most projects generating GUIDs have < 10 year lifespans.  Having persistent GUIDs means we need to think about strategies for resolution services (if required) that have a > 10 year lifespan and in the context of an institution that is designed to be persistent.
  3. GUIDs must be assigned as close to the source as possible.  For example, if data is collected in the field, the identifier for that data needs to be assigned in the field and attached to the field database with ownership initially stated by the maintainers of that database.  For existing data, assignment can be made in the source database.
  4. GUIDs propagate downstream to other systems.  Creating new GUIDs in warehouses that duplicate existing ones is bad practice, and thus aggregators need to honor well-curated GUIDs from providers.  
  5. Don’t conflate GUIDs for physical material with GUIDs for metadata about that physical material.  We promote the assignment of unique GUIDs to physical material; metadata about physical material will have a separate GUID.  While physical object IDs can be any type of GUID, we recommend EZIDs as they are short, unique, opaque, resolved by a persistent entity, and redirection to metadata can be stored with the identifier itself.  UUIDs can be used for this purpose as well BUT are not as robust as EZIDs since they lack redirection or resolution or require local solutions (see #2 above for problems with such solutions).
  6. GUIDs need to be attached in a meaningful way to semantic services.  Knowing semantically that a GUID is referring to either an information artifact, a process, or a physical thing is vital to understanding how to interpret the meaning of its relationship to other GUIDs expressed in other areas and to inform aggregators how to interpret content.

A prototype collections community guid service.
GetMyGUID Service - “Promoting GUID Standard Design Practices”.  We have blathered on long enough here, but want to just give a hint of where we are going.  We want to create a service that is built by natural history collections folks (and our computer science friends) for natural history collections folks, that taps into existing goodness already created.  That is, we tap into the existing services from EZIDs but then further develop a service that encodes best practices that work in this community.  In the near future, we are going to explain how the service works, how you can access it, why it does what it does.   We know how hard it is to get folks to make updates and additions to their databases, so we want to find out how to get over that barrier!  We want to find those early adopters (and hint hint, we are working with BiSciCol partners already to get this ball rolling!).  So, more soon.  Pass the word along!


- John Deck, Rob Guralnick, and Nico Cellinese

Saturday, August 25, 2012

News Update: how do we 'GUID'?

The BiSciCol project team has had a busy summer that included a presentation at the Annual Meeting of the Society for the Preservation of Natural History Collections (SPNHC) in New Haven, CT, and a presentation at the iEvoBio 2012 meeting in Ottawa, Canada.  Additionally, on the 13-15 August John Deck, Nico Cellinese, Rob Guralnick and Neil Davies convened at the University of California, Berkeley in order to meet with a few key partners and discuss the next steps forward for the project (meeting summary).  Before we report more about the meeting with our partners, here is some background information.  

BiSciCol's main goal is to break down the walled gardens between databases storing different kinds of biodiversity data such as specimens or samples from collecting events, sequences, images, etc. generated from those specimens or samples.  Doing so requires overcoming two separate community challenges. First, there must be a mechanism to associate globally unique identifiers (GUIDs) to collections records (Note, we are using the RSS specification GUID definition).  Second, the collections records must be expressed such that the terms used to define those records and their relationships are well understood by humans and computers.  This brings us into the “semantic web” and RDF “triples”.    

As BiSciCol has evolved, two key questions related to these challenges have emerged.  The first is whether GUIDs and creating "triples" should happen at the level of individual provider databases, or instead at the level of "aggregators" that enforce a standardized schema and encoding.  In the case of biological collections, an example of standardized schema is Darwin Core, usually encoded into a Darwin Core Archive.  Example aggregators are GBIF, VertNet and Map of Life. The second question is equally thorny and deals primarily with the content that the identifier is describing: is the identifier describing a physical object, a digital surrogate of a physical object, and is it a primary digital surrogate or a copy?  An example would be provided by specimen metadata attached to a photo record in Morphbank, which contains a copy of specimen metadata which in turn references a physical object.  

So, lets turn back to the meeting in Berkeley. That meeting included two key partners with whom we want to further develop and test ways forward given the two huge questions above.  We spent part of the time with the California Digital Library (CDL) folks, who have built a set of excellent tools that may be part of the solution to the problem of GUID assignment.  CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs.  John Kunze from CDL gave a great rundown on EZIDs and how they work, and was kind enough to meet us again on a couple of separate occasions, formal and informal.  Metadata encoding in the EZID itself may also be used to indicate use restrictions and provenance (John Kunze’s powerpoint presentation on EZIDs).  

The other key partner with whom we met was VertNet and Aaron Steele, the lead systems architect on the project.  The idea behind meeting with VertNet was to test out how we might do EZID assignment and triplification utilizing the same approach by which VertNet data is being processed from Darwin Core archives into a set of tables that can be visualized, queried and replicated.  Aaron was kind enough to participate to our hackathon and start up this process.  We set up a readme file about the Hackathon to describe our expected outputs.  Yes, the project is called "Bombus" which reflects the fact that although a bit wobbly, our goal is to have data flying around to "pollinate" other data.  Happily, the hackathon was very much a success!  We were able to tap into some existing code generated by Matt Jones (NCEAS) to mint EZIDs and VOILA, we had an output file ready for the semantic web (e.g. an output file that shows relationships between occurrences, localities and taxa based on the EZIDs).   We weren't quite able to get to the last step of querying the results, but we're very close.  More work (and reports) are to follow on this so stay tuned on the Bombus/pollinator link above.  

We have been testing a variety of solutions for identifier assignment, including: supporting user-supplied GUIDs, aggregator GUIDs, dataset DOIs, community standard identifiers (e.g. DwC Triplet), and creating QUIDs (Quasi Unique Identifiers) from hashed content.  EZID technology will play a significant role in the implementation of a number of these approaches.  None of these approaches offer a complete solution, but taken together, we can begin to build an intelligent system that can provide valuable services to aggregators, data providers, and users.  Services we will be supporting include: GUID tracking, identifying use restrictions, and GUID reappropriation.  Integrating our existing triplifier and biscicol java codebases with a scalable database back-end will fulfill most of the technical requirements needed.

We are still building our Triplifier to support those who want to take their own datasets and bring them into the semantic web framework, but BiSciCol can operate much more "at scale" with a very simple interface that accepts Darwin Core Archives or other standardized data such as those generated from Barcode of Life, Morphbank, or iDigBio, and assemble these into a shared triplestore or set of commonly accessible triplestores.  We think the issues we're tackling right now are at the sociotechnical heart of BiSciCol.  We use the term heart knowingly because it is going to be the desire and will of the community, along with the resources such as BiSciCol, that can help motivate and excite, and that will get us at least moving in the right direction. If you have any thoughts, criticisms, suggestions, we'd of course love to hear them.  


John Deck, Rob Guralnick and Nico Cellinese

Friday, April 6, 2012

Making our System Smarter

Computers are amazing at following instructions. So amazing, in fact, that a seemingly harmless instruction can potentially lead to an entirely false conclusion.
At our recent BiSciCol meeting at the University of Florida, we had a discussion about just such a case.

At its core, BiSciCol is all about connecting objects to each other. In order to accomplish connecting object identifiers to other objects, we have been using a simple relationship expression called “leadsTo” that indicates a direction in the nature of the relationship between one object and another. To illustrate how “leadsTo” works, lets provide a simple example. Suppose we have a collecting event, which we join to a specimen object using our relationship predicate “leadsTo”. The specimen object could then “lead to” a taxonomic determination, which could in turn “lead to” a scientist, and so on.

This is certainly useful as we can express an endless chain of objects and their derivatives , even if they exist in different databases. However, what if we extended the above example just a bit further, using our “leadsTo” relationship?


Uh oh--- By successively following the leadsTo relationships, we could now erroneously conclude that spec2-t1 came from spec3! This is not good! Fortunately, there is a solution.

We realized that the directional “leadsTo” relationship simply doesn't make very much sense in some situations, such as the connection between spec3 and person1 in the diagram above. Consequently, instead of the single “leadsTo” relationship, we actually need two relationship terms: one that has a distinct direction and one that implies no direction. Two terms are in use currently that do just this from the Dublin core standard: 1) relation (no direction) and 2) source (has direction).


In the first example above, we could avoid the problem entirely by describing the link between the taxonomic determination and the scientist as a non-directional relation. Using our new terminology, the graph would look something like this:


The computers involved in figuring out how to traverse the graph of relationships would know not to follow non-directional relationships and we would no longer infer that spec2-t1 came from spec3. Problem solved!


This post written by John Deck and Brian Stucky with input from Hilmar Lapp, Steve Baskauf, Andrea Thomer, Rob Guralnick, Lukasz Ziemba, Tim Robertson, Reed Beaman, and Nico Cellinese, summarizes a discussion that took place at the BiSciCol development meeting held on the March 31, 2012 at the University of Florida.