Friday, October 12, 2012

Making it 'EZ' to GUID

On Global Unique Identifiers (again) for Natural History Collections Data:  How to Stop People From Saying “You’re Doing It Wrong” (or conversely, “Yay! We’re Doing It Right!”)
'
From Gary Larsen and adapted by Barry Smith in Referent Tracking
presentation at the Semantics of Biodiversity Workshop, 2012.

The natural history collections community has been hearing about GUIDs (globally unique identifiers) for a long time.  However, what we’ve typically heard are comments like “ARGH!  These don’t work” or “These are hard to implement”; or we’ve been subjected to long “policy documents” that seem to be generated by computer scientists, not the people actually working in the collections.  So the bottom line is that it’d be nice to have some clear, short “thinky things” about GUIDs that help us understand the value a bit more clearly, and that provides a simple and clear way forward.  We want to take a stab at that here and VERY MUCH WELCOME feedback.  Lots of it.  We’ve thought a ton about it and we are ready!
 
A recurrent question we have gotten from people developing collections database (or at the level of aggregators such as Vertnet or GBIF) is why we need to go beyond the self-minted, internal GUIDs and why GUIDs need to resolve and be persistent.  We could envision a large data aggregator such as iDigBio or GBIF that keeps track of digital records by assigning UUIDs (universally unique identifiers --- which are very, very easy to mint!) to these but likely without any connection to the physical source objects stored in providers institutions, and/or any connection to the same objects stored in other institutional repositories or aggregators.  Yet, the ultimate value of assigning GUIDs to objects, their metadata and derivatives is that we can track all these back to their source and generate queries that imply semantic reasoning over a robust digital landscape. In such a landscape, answering those core-challenging questions generated by collaborative projects becomes possible.  Therefore, the digitization process acquires a much deeper meaning and value by going beyond the process of straightforward data capture and moves towards an environment where we can track relationships among physical objects, annotations, and descriptive metadata as part of a global network.  If as a community we agree on the benefit of assigning GUIDs, this is the opportunity to generate a strategy that would add long-term value to this effort.  In other words, if we need to invest our resources, let’s do it in ways that we can draw benefit now and in the future.

A big question is how to best implement such a vision.  GUID implementations within our community have proven problematic as evidenced by 20% of Rod Page’s blog posts.  After much vetting of possible solutions, we believe the right answer is to leverage existing expertise developing not just GUIDs, but a set of services built around them.   In particular, we have talked to the California Digital Library (CDL) about EZIDs, and the value of using EZIDs given that these elegantly solves a lot of community needs at once and nicely positions us for the future. Speaking of community needs, the solution we advocate is not just “go get EZIDs”.  BiSciCol was funded, in part, to help with this task of working with the community and foster the implementation of GUIDs as a necessary step towards bringing our digital resources into a Linked Open Data framework.  BiSciCol wants to build out services that support the community, working with CDL and you, to make that happen.

What are EZIDS and why do we love them?  
As we mentioned in a previous blog post (http://biscicol.blogspot.com/2012/08/the-biscicol-project-team-has-had-busy.html), CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs.  The big win is that there are a bunch of CDL services already developed to help with minting these GUIDs, and assure that these are resolvable, linkable, persistent and long-term sustainable.  EZIDs have some lovely features, including their flexibility to be associated with datasets and objects through the whole digital data life cycle.  Also, EZIDs allow us to mix and match DOIs, which are well understood and used in the publishing community, with ARKs, which were developed in the archives, library and museum community and provide a bit more flexibility and the ability to assign on a more granular level to individual data objects rather than datasets.  For more details, see John Kunze’s powerpoint presentation on EZIDs).   We can work with CDL and their EZID system to build a prototype collections community GUIDs service.

So you are thinking to yourself... how much does it cost?  The answer is:  Nothing to you,   very little to BiSciCol, and ultimately remarkably lower than what has already been spent in terms of people-hours trying to sort through this very complex landscape, and develop home-grown solutions.  Sustainability has costs --- and the goal is to scale those down to the point where they are orders of magnitude lower than where they have been before by leveraging economies of scale. We do that with this solution.  Big win.  

Our View on Best Practices:

  1. GUIDs must be globally unique.  The “Darwin Core Triplet” might not be good enough.  
  2. GUIDs must be persistent.  Most projects generating GUIDs have < 10 year lifespans.  Having persistent GUIDs means we need to think about strategies for resolution services (if required) that have a > 10 year lifespan and in the context of an institution that is designed to be persistent.
  3. GUIDs must be assigned as close to the source as possible.  For example, if data is collected in the field, the identifier for that data needs to be assigned in the field and attached to the field database with ownership initially stated by the maintainers of that database.  For existing data, assignment can be made in the source database.
  4. GUIDs propagate downstream to other systems.  Creating new GUIDs in warehouses that duplicate existing ones is bad practice, and thus aggregators need to honor well-curated GUIDs from providers.  
  5. Don’t conflate GUIDs for physical material with GUIDs for metadata about that physical material.  We promote the assignment of unique GUIDs to physical material; metadata about physical material will have a separate GUID.  While physical object IDs can be any type of GUID, we recommend EZIDs as they are short, unique, opaque, resolved by a persistent entity, and redirection to metadata can be stored with the identifier itself.  UUIDs can be used for this purpose as well BUT are not as robust as EZIDs since they lack redirection or resolution or require local solutions (see #2 above for problems with such solutions).
  6. GUIDs need to be attached in a meaningful way to semantic services.  Knowing semantically that a GUID is referring to either an information artifact, a process, or a physical thing is vital to understanding how to interpret the meaning of its relationship to other GUIDs expressed in other areas and to inform aggregators how to interpret content.

A prototype collections community guid service.
GetMyGUID Service - “Promoting GUID Standard Design Practices”.  We have blathered on long enough here, but want to just give a hint of where we are going.  We want to create a service that is built by natural history collections folks (and our computer science friends) for natural history collections folks, that taps into existing goodness already created.  That is, we tap into the existing services from EZIDs but then further develop a service that encodes best practices that work in this community.  In the near future, we are going to explain how the service works, how you can access it, why it does what it does.   We know how hard it is to get folks to make updates and additions to their databases, so we want to find out how to get over that barrier!  We want to find those early adopters (and hint hint, we are working with BiSciCol partners already to get this ball rolling!).  So, more soon.  Pass the word along!


- John Deck, Rob Guralnick, and Nico Cellinese

12 comments:

  1. This sounds very encouraging. As always, a few comments.

    Regarding DOIs and ARKs, without wishing to focus on details I'd much rather you go for DOIs as the default identifier. People "get" DOIs, plus there is a well known global resolver for them (http://dx.doi.org). ARKs don't have this infrastructure, and the ARKs I can find in the wild aren't recognised by http://n2t.net. The Gallica archive makes extensive use of them (http://gallica.bnf.fr/ark:/12148/cb34349289k/date, for example, yet http://n2t.net pleads ignorance about these identifiers. Nor does it know anything about Internet Archive ARKs such as ark:/13960/t00z7hg8f. My concern here is that ARKs are reminescent of LSIDs in that there's little obligation that make them actually work (they can be minted without any commitment to resolution). Once that happens, you are screwed. This, and the lack of a global ARK resolver is, for me, a deal breaker.

    I like the emphasis on identifying the specimen, not the metadata. This is equivalent to CrossRef insisting that DOIs identify articles, and not the metadata, nor any particular physical representation of them. Had CrossRef chosen another approach (e.g., having separate identifiers for print and electronic versions it would have been a mess).

    Regarding point 3 ("GUIDs must be assigned as close to the source as possible") this assumes everyone has the resources to do this. One thing I like about identifiers that lack domain names (e.g., DOIs, Handles, ARKs) is that enable an alternative approach, namely a body with more resources could assign and manage a namespace on another's behalf, until such time as the primary provider is ready to take over that namespace. Translation - someone like GBIF could mint identifiers for the contents of a museum's collection (using a unique namespace for that museum), then hand over running those identifiers to museum at a later date. From the user's perspective nothing happens, the identifier is unchanged.

    Lastly, GUIDs are like fax machines, having just one user makes no sense. With DOIs for literature, their utility becomes almost immediately apparent once publishers start using them in their lists of literature cited. I suspect part of the problem for collections is there's been no compelling display of the possible benefits of having GUIDs. For me the immediate use cases are:

    Persistent identification of records in GBIF so that GBIF avoids the occassionally massive duplication of records I've discussed elsewhere (How many specimens does GBIF really have?.

    Reuse of GUIDs by GBIF makes it trivial to transmit corrections back to providers (GBIF needs to do something better than emails for this).

    Collections could track citations in the literature (and GenBank) to their holdings. This task requires matching museum specimen codes to newly minted GUIDs, but would be doable (if tedious). I've done a little work on this using GBIF occurrence ids as a proxy for specimen GUIDs, but became frustrated by their lack of stability.


    My concern is that unless GUIDs give immediate, tangible benefits people will ask (quite rightly) why they went through all the hassle of adopting them. And I suspect the big benefits will come from getting these GUIDs used by GBIF and GenBank in the first place, then cited in the primary literature.

    ReplyDelete
    Replies
    1. This reply is from John Kunze at the California Digital Library. I am posting on his behalf.
      ___________________________________________________________

      There are lots of place to jump in, but first I'd like to clear up some confusion about ARKs and DOIs.

      Global vs local resolution of ARKs is a choice. Unlike other schemes, ARK decouples syntax from resolution mechanism. It is thus a feature that you may choose to mint, maintain, and resolve them without belonging to a membership organization or paying fees. OTOH, a global resolver designed for ARKs and other kinds of ids is n2t.net (Name-to-Thing). While the EZID service uses n2t.net, the National Library of France and the Internet Archive have not sought out global resolution with n2t.net.

      The majority of the world's DOIs up until now have a perceived level of quality that is due to the well-funded and well-organized community that CrossRef has created. Not suprisingly, there is no magic bullet to keeping ids persistent -- CrossRef puts a lot of effort into policing broken ids by reminding and chastizing wayward publishers. Newer DOI registration agencies, such as

      + DataCite (DOIs for research datasets -- where EZID gets DOIs) and
      + EIDR (DOIs for film and television inventory, including porn)

      are now trying to build their own communities. They have different kinds of content from CrossRef and will be responsible for maintaining their own persistence. With no policing, for example, one would expect DOIs to break at a rate similar to ordinary URLs. The director of strategic initiatives at CrossRef, Geoffrey Bilder, says the important base functionality for persistence is not that they be DOIs, but that they be "persistable" (that is, redirectable). He also expects the perception of DOI quality to change rapidly and he is exhorting CrossRef members to start saying "CrossRef DOI" instead of just "DOI". DataCite also wishes to maintain high quality DOIs and is putting together link validation mechanisms to try to enforce this.

      Delete
  2. Rod says: "Regarding DOIs and ARKs, without wishing to focus on details I'd much rather you go for DOIs as the default identifier. People "get" DOIs, plus there is a well known global resolver for them (http://dx.doi.org). ARKs don't have this infrastructure... My concern here is that ARKs are reminescent of LSIDs in that there's little obligation that make them actually work (they can be minted without any commitment to resolution). Once that happens, you are screwed. This, and the lack of a global ARK resolver is, for me, a deal breaker."

    Response: Rod, thanks for the great response. I am in nearly completely in agreement re: DOIs versus ARKs. Its the simplicity, persistence and globalness of the resolver that really works. I think most people just want to pop DOIs into their URL bar and resolve to the primary object. John Kunze makes the good point, however, that ARKs might have their uses. Because you can mix and match DOIs and ARKs, one possibility is to assign DOIs to datasets and ARKs to individual records - ARKs are built to be more granular and have these pass-through functions. Both are supported long term by the California Digital Library. There is some concern that assigning DOIs to 450 million data points might be problematic, given that there are only 60 million DOIs out there right now. Maybe we are wailing and nashing teeth for no reason.

    As for Point 3, man, we are totally with you. We have thought about this A TON and have had conversations with some aggregators about just this idea. Assign GUIDs at the aggregator, and then hand these back to the Museum to take over. In fact, this is a service aggregators could provide. There are some among the BiSciCol contingent who very seriously believe that collections managers and curators will be very reluctant to add fields to their databases unless they understand what the value proposition is for having guids. I think you touch on this as well, and I really like these use cases. Maybe one thing we can do is develop a huge set of those use cases and have them be realllly clear to the collections community. We have a set of use cases we've been putting together as well.

    One thing we have been doing on the BiSciCol end is disambiguating _tracking_, which involves having guids on physical specimens and then developing the means to link up the physical specimens and their downstream derivatives (metadata records, images, sequences) and _discovery_, where linkages were broken in the past, and you want to find them again. For example, if two different specimens in two different museum collections came from the same collecting event, can you reassociate them post-hoc and then re-establish this inferred relationship. We definitely see strong use cases on both fronts.

    Rod says: "My concern is that unless GUIDs give immediate, tangible benefits people will ask (quite rightly) why they went through all the hassle of adopting them. And I suspect the big benefits will come from getting these GUIDs used by GBIF and GenBank in the first place, then cited in the primary literature."

    Response: I TOTALLY AGREE. We have been advocating with GBIF, VertNet, iDigBio the value of persistent, well understood and globally resolvable guids such as DOIs. I also think there is great value in early adopters showing how we - as a community - can be pushing along new and obvious, needed frontiers. Our approach is --- lets get EZIDs (DOIs) onto specimens and records working at the source and the aggregator, and apply principle #4 (aggregators need to honor well-curated GUIDs from providers). We also really do see the value of _helping_ people do this guid assignment task - building simple services and actually putting some face-time into getting adoption happening at sources and aggregators. It might not be that people don't want to, it is more likely it is a matter of activation energy, and finding ways to overcome that barrier.

    ReplyDelete
  3. Quick comment on granularity and pass-through. For me the granularity that matters is what gets cited - if it gets cited it gets a DOI. Specimens get cited, so they should get DOIs (i.e., first-class identifiers). Let's keep things simple - if you cite a paper, a data set, a specimen, you cite the DOI.

    Regarding pass-though, this is nice, but could be done with DOIs (at the cost of minting a new DOI for each path), or perhaps the original data provider could have a service that supports suffixes for a DOI (i.e., strings appended to the DOI that might indicate parts of the thing being referred to).

    Whatever additional features ARK might have, my sense is that the traction DOIs have (e.g., CrossRef, DataCite, Dryad, Figshare) presents an opportunity to make the case for GUIDs, and that multiplying technologies dilutes that message.

    ReplyDelete
    Replies
    1. Indeed Rod! But everything can get potentially cited. Datasets, specimens, their images, tissues, in fact all of their derivatives and metadata. It really depends on individual's interests or research focus. So, we are talking about millions, and perhaps eventually billions(?) of DOIs at what cost? I think the issue of 'how many DOIs' we would eventually need came up several times in our discussion with CDL and it seemed to us that ARK provided a workable compromise. It's a low-hanging fruit right now, but I do understand your valid points.

      Delete
    2. Nico, costs can change, indeed DOI costs have dropped a lot since registration agencies like DataCite have come along. We regard the Internet as essentially free, but clearly lots of money is being spent to make it work. Perhaps we may reach a point where identifiers like DOIs are consider part of the infrastructure and hence essentially "free". I can imagine national science funders deciding that so much publicly funded data is identified by DOIs that they need to fund its continued existence.

      It frustrates me that cost is the issue which consistent comes up when discussing DOIs. Obviously it's a consideration, but surely the primary consideration is what do we want to achieve? We want persistent, resolvable identifiers that support useful services, and that people will trust and use. None of that is "free", indeed, that DOIs cost money is one reason they are trusted (given that it costs money for a publisher to get DOIs from CrossRef I have more faith that the content of a journal using DOIs won't simply disappear than a journal that simply uses URLs).

      The outcome of discussions driven by cost invariably end up choosing the "cheap" solution, at the expense of actually solving the problem. A big reason the community previously settled on LSIDs was because the were cheap, and look where that got us. DOIs imply permanence and citability, ARKs have no such implications - sure, these properties can be asserted but that doesn't make it so. Is it not time for our field to step up to the plate and act like our digital resources matter? If we were really serious about this stuff, wouldn't we simply get on with using DOIs, and doing everything we can to encourage their widespread use and citation?

      Delete
    3. One thing that the California Digital Library has done that is very smart is to offer both DOIs and ARKs under their EZID approach. This doesn't solve the debate, above, but it recognizes that there are alternate approaches with alternate values. I believe that John Kunze (and I will forward this debate on to them) would probably argue that ARKs are good for early in digital data life cycles when there is value in being able to quickly create and destroy guids, when things are still fluid. When _publishing_, using DOIs makes more sense --- so DOIs on the physical material is absolutely the way to go, I think. And for finished digital products, I think DOIs are smart too. I am inclined to seeing BiSciCol build a service where there are still choices between ARKs and DOIs, but hopefully not tie these to cost models. I don't want to speak for anyone at CDL or CrossRef, but I think we can work this out in a way where absolutely everyone wins --- and I don't say that lightly,

      Delete
    4. Referring to "It frustrates me that cost is the issue which consistent comes up when discussing DOIs. ... DOIs imply permanence and citability,..", I'd like to build a little on my earlier comments (thanks to Rob for posting them).

      There is no magic bullet for persistence, which is a sweaty, onerous service undertaking. It is true that the fees for DOIs have gone down; in particular, the per-DOI fee has been eliminated, but the owner of the DOI is completely responsible for actual persistence, which implies staffing, training, reviewing broken link reports, and applying fixes. The burden of maintaining 80 million CrossRef DOIs -- which set the current high standard for DOI quality -- is distributed across over 4000 publishers and societies, many of them well-funded. I believe we're talking about approximately 300 million specimen ids alone, with the maintenance burden spread across 200 (? wild guess) financially struggling museums. DataCite, the source of data DOIs, has fewer than 20 members, mostly from the traditionally underfunded library community. There is a vast disparity in wealth and projected DOI registration numbers between DataCite and CrossRef.

      I also mentioned that the concept of what a DOI implies is changing. The CDL (California Digital Library, which runs EZID) is a member of DataCite, which is working hard to uphold the quality of DataCite DOIs. DataCite requires that to get a DataCite DOI, you need to provide five elements of 'data-as-publication'-type metadata (Author, Title, Publisher, Publication Year, target URL, resource type) as well as a landing page for each 'dataset'. They are also contemplating requiring a CC0 license for all _metadata_. We have EZID customers who are challenged to provide this for 100,000 ordinary datasets. How would this play out with Natural History Collections data? Some of our customers have been surprised that the bar to obtain DOIs was so high, and that their DOIs would break if they didn't maintain the target URLs themselves. But quality isn't cheap.

      Keeping maintenance costs down is important, and the suffix-passthrough feature of our n2t.net resolver will make it possible (by December) by default to manage all 10,000 separately identifiable granules of a complex dataset with only one top-level registration. This will work with ARKs (and DOIs, but we don't say so since no one thinks to resolve them there).

      "Regarding pass-though, this is nice, but could be done with DOIs (at the cost of minting a new DOI for each path), or perhaps the original data provider could have a service that supports suffixes for a DOI (i.e., strings appended to the DOI that might indicate parts of the thing being referred to)."

      Minting and maintaining a new DOI for each path presents a big maintenance burden. Unfortunately, there is no general way to pass strings along on the end of DOIs due to the way the Handle system works (which is how DOIs are resolved). There is a special "template" syntax that can be established for a subset of your DOIs by working with the Handle system folks to accomplish something similar, but I don't know if anyone uses it for DOIs. With the current infrastructure, all the granules would require separate registration and maintenance.

      ARKs also come with maintenance costs, but people can obtain them more easily than DOIs, and they tend to do so early in the life cycles of their data "products". When selected, mature, publication-like products roll out of their research, that's when folks want to obtain DOIs.

      Delete
  4. Nico and Rob articulated well where we're coming from. Just a couple points from my perspective. Given the need to not only track specimen derivatives, but processes acting on specimens/samples, the number of identifiers we envision is way way past 450 million or even 1 billion.

    Great points about the necessity for global resolution. However, i still think there is a use for ARKs, because of the importance of being able to assign identifiers in the field--- so many times spreadsheets leave the field and goes in 2 or more directions at once, leaving us later to rely on heuristics or long lookup tables of identifiers to match things up later. ARKs are useful for delivering pre-minted identifiers to the field when one doesn't know the exact number of id's required (e.g. for collecting events). Later, in the lab/museum, one can assign a DOI to the actual specimen or to the dataset.

    To that end, assigning ARKs to processes and DOIs to physical material could be developed as a common practice.





    ReplyDelete
  5. Canadensys now assigns DOIs to its datasets served from its IPT. See http://www.canadensys.net/2012/link-love-dois-for-darwin-core-archives. What do you recommend for next steps?

    ReplyDelete
  6. Great to hear that! As for next steps, BiSciCol is now working in conjunction with the California Digital Library (CDL) on tools the community can use for assigning identifiers to data elements within datasets. These persistent identifiers will be used to track physical objects (specimens, samples), events (collecting events, identification events) or digital information (photographs, sequences). We're actively working on this now and will be demonstrating these tools next June (at iEvoBio).

    ReplyDelete
  7. I can see that you are putting a lot of efforts into your blog. Some really helpful information in there. Nice to see your site. Thanks!
    Ez Connector Fishbowl

    ReplyDelete