BiSciCol: Making it 'EZ' to GUID

Friday, October 12, 2012

Making it 'EZ' to GUID

On Global Unique Identifiers (again) for Natural History Collections Data: How to Stop People From Saying “You’re Doing It Wrong” (or conversely, “Yay! We’re Doing It Right!”)

From Gary Larsen and adapted by Barry Smith in Referent Tracking

presentation at the Semantics of Biodiversity Workshop, 2012.

The natural history collections community has been hearing about GUIDs (globally unique identifiers) for a long time. However, what we’ve typically heard are comments like “ARGH! These don’t work” or “These are hard to implement”; or we’ve been subjected to long “policy documents” that seem to be generated by computer scientists, not the people actually working in the collections. So the bottom line is that it’d be nice to have some clear, short “thinky things” about GUIDs that help us understand the value a bit more clearly, and that provides a simple and clear way forward. We want to take a stab at that here and VERY MUCH WELCOME feedback. Lots of it. We’ve thought a ton about it and we are ready!

A recurrent question we have gotten from people developing collections database (or at the level of aggregators such as Vertnet or GBIF) is why we need to go beyond the self-minted, internal GUIDs and why GUIDs need to resolve and be persistent. We could envision a large data aggregator such as iDigBio or GBIF that keeps track of digital records by assigning UUIDs (universally unique identifiers --- which are very, very easy to mint!) to these but likely without any connection to the physical source objects stored in providers institutions, and/or any connection to the same objects stored in other institutional repositories or aggregators. Yet, the ultimate value of assigning GUIDs to objects, their metadata and derivatives is that we can track all these back to their source and generate queries that imply semantic reasoning over a robust digital landscape. In such a landscape, answering those core-challenging questions generated by collaborative projects becomes possible. Therefore, the digitization process acquires a much deeper meaning and value by going beyond the process of straightforward data capture and moves towards an environment where we can track relationships among physical objects, annotations, and descriptive metadata as part of a global network. If as a community we agree on the benefit of assigning GUIDs, this is the opportunity to generate a strategy that would add long-term value to this effort. In other words, if we need to invest our resources, let’s do it in ways that we can draw benefit now and in the future.

A big question is how to best implement such a vision. GUID implementations within our community have proven problematic as evidenced by 20% of Rod Page’s blog posts. After much vetting of possible solutions, we believe the right answer is to leverage existing expertise developing not just GUIDs, but a set of services built around them.   In particular, we have talked to the California Digital Library (CDL) about EZIDs, and the value of using EZIDs given that these elegantly solves a lot of community needs at once and nicely positions us for the future. Speaking of community needs, the solution we advocate is not just “go get EZIDs”. BiSciCol was funded, in part, to help with this task of working with the community and foster the implementation of GUIDs as a necessary step towards bringing our digital resources into a Linked Open Data framework. BiSciCol wants to build out services that support the community, working with CDL and you, to make that happen.

What are EZIDS and why do we love them?
As we mentioned in a previous blog post (http://biscicol.blogspot.com/2012/08/the-biscicol-project-team-has-had-busy.html), CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs. The big win is that there are a bunch of CDL services already developed to help with minting these GUIDs, and assure that these are resolvable, linkable, persistent and long-term sustainable. EZIDs have some lovely features, including their flexibility to be associated with datasets and objects through the whole digital data life cycle. Also, EZIDs allow us to mix and match DOIs, which are well understood and used in the publishing community, with ARKs, which were developed in the archives, library and museum community and provide a bit more flexibility and the ability to assign on a more granular level to individual data objects rather than datasets. For more details, see John Kunze’s powerpoint presentation on EZIDs).   We can work with CDL and their EZID system to build a prototype collections community GUIDs service.

So you are thinking to yourself... how much does it cost? The answer is: Nothing to you,   very little to BiSciCol, and ultimately remarkably lower than what has already been spent in terms of people-hours trying to sort through this very complex landscape, and develop home-grown solutions. Sustainability has costs --- and the goal is to scale those down to the point where they are orders of magnitude lower than where they have been before by leveraging economies of scale. We do that with this solution. Big win.

Our View on Best Practices:

GUIDs must be globally unique. The “Darwin Core Triplet” might not be good enough.
GUIDs must be persistent. Most projects generating GUIDs have < 10 year lifespans. Having persistent GUIDs means we need to think about strategies for resolution services (if required) that have a > 10 year lifespan and in the context of an institution that is designed to be persistent.
GUIDs must be assigned as close to the source as possible. For example, if data is collected in the field, the identifier for that data needs to be assigned in the field and attached to the field database with ownership initially stated by the maintainers of that database. For existing data, assignment can be made in the source database.
GUIDs propagate downstream to other systems. Creating new GUIDs in warehouses that duplicate existing ones is bad practice, and thus aggregators need to honor well-curated GUIDs from providers.
Don’t conflate GUIDs for physical material with GUIDs for metadata about that physical material. We promote the assignment of unique GUIDs to physical material; metadata about physical material will have a separate GUID. While physical object IDs can be any type of GUID, we recommend EZIDs as they are short, unique, opaque, resolved by a persistent entity, and redirection to metadata can be stored with the identifier itself. UUIDs can be used for this purpose as well BUT are not as robust as EZIDs since they lack redirection or resolution or require local solutions (see #2 above for problems with such solutions).
GUIDs need to be attached in a meaningful way to semantic services. Knowing semantically that a GUID is referring to either an information artifact, a process, or a physical thing is vital to understanding how to interpret the meaning of its relationship to other GUIDs expressed in other areas and to inform aggregators how to interpret content.

A prototype collections community guid service.
GetMyGUID Service - “Promoting GUID Standard Design Practices”. We have blathered on long enough here, but want to just give a hint of where we are going. We want to create a service that is built by natural history collections folks (and our computer science friends) for natural history collections folks, that taps into existing goodness already created. That is, we tap into the existing services from EZIDs but then further develop a service that encodes best practices that work in this community. In the near future, we are going to explain how the service works, how you can access it, why it does what it does. We know how hard it is to get folks to make updates and additions to their databases, so we want to find out how to get over that barrier! We want to find those early adopters (and hint hint, we are working with BiSciCol partners already to get this ball rolling!). So, more soon. Pass the word along!

- John Deck, Rob Guralnick, and Nico Cellinese

12 comments:

Roderic PageOctober 13, 2012 at 1:06 AM
This sounds very encouraging. As always, a few comments.

Regarding DOIs and ARKs, without wishing to focus on details I'd much rather you go for DOIs as the default identifier. People "get" DOIs, plus there is a well known global resolver for them (http://dx.doi.org). ARKs don't have this infrastructure, and the ARKs I can find in the wild aren't recognised by http://n2t.net. The Gallica archive makes extensive use of them (http://gallica.bnf.fr/ark:/12148/cb34349289k/date, for example, yet http://n2t.net pleads ignorance about these identifiers. Nor does it know anything about Internet Archive ARKs such as ark:/13960/t00z7hg8f. My concern here is that ARKs are reminescent of LSIDs in that there's little obligation that make them actually work (they can be minted without any commitment to resolution). Once that happens, you are screwed. This, and the lack of a global ARK resolver is, for me, a deal breaker.

I like the emphasis on identifying the specimen, not the metadata. This is equivalent to CrossRef insisting that DOIs identify articles, and not the metadata, nor any particular physical representation of them. Had CrossRef chosen another approach (e.g., having separate identifiers for print and electronic versions it would have been a mess).

Regarding point 3 ("GUIDs must be assigned as close to the source as possible") this assumes everyone has the resources to do this. One thing I like about identifiers that lack domain names (e.g., DOIs, Handles, ARKs) is that enable an alternative approach, namely a body with more resources could assign and manage a namespace on another's behalf, until such time as the primary provider is ready to take over that namespace. Translation - someone like GBIF could mint identifiers for the contents of a museum's collection (using a unique namespace for that museum), then hand over running those identifiers to museum at a later date. From the user's perspective nothing happens, the identifier is unchanged.

Lastly, GUIDs are like fax machines, having just one user makes no sense. With DOIs for literature, their utility becomes almost immediately apparent once publishers start using them in their lists of literature cited. I suspect part of the problem for collections is there's been no compelling display of the possible benefits of having GUIDs. For me the immediate use cases are:

Persistent identification of records in GBIF so that GBIF avoids the occassionally massive duplication of records I've discussed elsewhere (How many specimens does GBIF really have?.

Reuse of GUIDs by GBIF makes it trivial to transmit corrections back to providers (GBIF needs to do something better than emails for this).

Collections could track citations in the literature (and GenBank) to their holdings. This task requires matching museum specimen codes to newly minted GUIDs, but would be doable (if tedious). I've done a little work on this using GBIF occurrence ids as a proxy for specimen GUIDs, but became frustrated by their lack of stability.

My concern is that unless GUIDs give immediate, tangible benefits people will ask (quite rightly) why they went through all the hassle of adopting them. And I suspect the big benefits will come from getting these GUIDs used by GBIF and GenBank in the first place, then cited in the primary literature.
ReplyDelete
Replies
rpgOctober 13, 2012 at 9:37 AM
Rod says: "Regarding DOIs and ARKs, without wishing to focus on details I'd much rather you go for DOIs as the default identifier. People "get" DOIs, plus there is a well known global resolver for them (http://dx.doi.org). ARKs don't have this infrastructure... My concern here is that ARKs are reminescent of LSIDs in that there's little obligation that make them actually work (they can be minted without any commitment to resolution). Once that happens, you are screwed. This, and the lack of a global ARK resolver is, for me, a deal breaker."

Response: Rod, thanks for the great response. I am in nearly completely in agreement re: DOIs versus ARKs. Its the simplicity, persistence and globalness of the resolver that really works. I think most people just want to pop DOIs into their URL bar and resolve to the primary object. John Kunze makes the good point, however, that ARKs might have their uses. Because you can mix and match DOIs and ARKs, one possibility is to assign DOIs to datasets and ARKs to individual records - ARKs are built to be more granular and have these pass-through functions. Both are supported long term by the California Digital Library. There is some concern that assigning DOIs to 450 million data points might be problematic, given that there are only 60 million DOIs out there right now. Maybe we are wailing and nashing teeth for no reason.

As for Point 3, man, we are totally with you. We have thought about this A TON and have had conversations with some aggregators about just this idea. Assign GUIDs at the aggregator, and then hand these back to the Museum to take over. In fact, this is a service aggregators could provide. There are some among the BiSciCol contingent who very seriously believe that collections managers and curators will be very reluctant to add fields to their databases unless they understand what the value proposition is for having guids. I think you touch on this as well, and I really like these use cases. Maybe one thing we can do is develop a huge set of those use cases and have them be realllly clear to the collections community. We have a set of use cases we've been putting together as well.

One thing we have been doing on the BiSciCol end is disambiguating _tracking_, which involves having guids on physical specimens and then developing the means to link up the physical specimens and their downstream derivatives (metadata records, images, sequences) and _discovery_, where linkages were broken in the past, and you want to find them again. For example, if two different specimens in two different museum collections came from the same collecting event, can you reassociate them post-hoc and then re-establish this inferred relationship. We definitely see strong use cases on both fronts.

Rod says: "My concern is that unless GUIDs give immediate, tangible benefits people will ask (quite rightly) why they went through all the hassle of adopting them. And I suspect the big benefits will come from getting these GUIDs used by GBIF and GenBank in the first place, then cited in the primary literature."

Response: I TOTALLY AGREE. We have been advocating with GBIF, VertNet, iDigBio the value of persistent, well understood and globally resolvable guids such as DOIs. I also think there is great value in early adopters showing how we - as a community - can be pushing along new and obvious, needed frontiers. Our approach is --- lets get EZIDs (DOIs) onto specimens and records working at the source and the aggregator, and apply principle #4 (aggregators need to honor well-curated GUIDs from providers). We also really do see the value of _helping_ people do this guid assignment task - building simple services and actually putting some face-time into getting adoption happening at sources and aggregators. It might not be that people don't want to, it is more likely it is a matter of activation energy, and finding ways to overcome that barrier.
ReplyDelete
Replies
Roderic PageOctober 13, 2012 at 10:10 AM
Quick comment on granularity and pass-through. For me the granularity that matters is what gets cited - if it gets cited it gets a DOI. Specimens get cited, so they should get DOIs (i.e., first-class identifiers). Let's keep things simple - if you cite a paper, a data set, a specimen, you cite the DOI.

Regarding pass-though, this is nice, but could be done with DOIs (at the cost of minting a new DOI for each path), or perhaps the original data provider could have a service that supports suffixes for a DOI (i.e., strings appended to the DOI that might indicate parts of the thing being referred to).

Whatever additional features ARK might have, my sense is that the traction DOIs have (e.g., CrossRef, DataCite, Dryad, Figshare) presents an opportunity to make the case for GUIDs, and that multiplying technologies dilutes that message.
ReplyDelete
Replies
John DeckOctober 15, 2012 at 7:42 AM
Nico and Rob articulated well where we're coming from. Just a couple points from my perspective. Given the need to not only track specimen derivatives, but processes acting on specimens/samples, the number of identifiers we envision is way way past 450 million or even 1 billion.

Great points about the necessity for global resolution. However, i still think there is a use for ARKs, because of the importance of being able to assign identifiers in the field--- so many times spreadsheets leave the field and goes in 2 or more directions at once, leaving us later to rely on heuristics or long lookup tables of identifiers to match things up later. ARKs are useful for delivering pre-minted identifiers to the field when one doesn't know the exact number of id's required (e.g. for collecting events). Later, in the lab/museum, one can assign a DOI to the actual specimen or to the dataset.

To that end, assigning ARKs to processes and DOIs to physical material could be developed as a common practice.

ReplyDelete
Replies
David ShorthouseDecember 12, 2012 at 8:53 PM
Canadensys now assigns DOIs to its datasets served from its IPT. See http://www.canadensys.net/2012/link-love-dois-for-darwin-core-archives. What do you recommend for next steps?
ReplyDelete
Replies
John DeckDecember 13, 2012 at 8:43 AM
Great to hear that! As for next steps, BiSciCol is now working in conjunction with the California Digital Library (CDL) on tools the community can use for assigning identifiers to data elements within datasets. These persistent identifiers will be used to track physical objects (specimens, samples), events (collecting events, identification events) or digital information (photographs, sequences). We're actively working on this now and will be demonstrating these tools next June (at iEvoBio).
ReplyDelete
Replies
UnknownApril 30, 2013 at 8:21 AM
I can see that you are putting a lot of efforts into your blog. Some really helpful information in there. Nice to see your site. Thanks!
Ez Connector Fishbowl
ReplyDelete
Replies

Pages

Friday, October 12, 2012

Making it 'EZ' to GUID

12 comments: