Thursday, December 27, 2012

BiSciCol in Four Pictures

People always say a picture is worth a thousand words.   Given that, we want to present “4000+” words here.  That is, below are 4 images and some “captions” or explanatory text – that, while not fully inclusive of current efforts,  gives a  close to a complete update on progress and our next steps.



Figure the first.   One of the things that the BiSciCol crew has thought a lot about is how to express relationships among different kinds of physical and digital biological collection “objects”.  Our work is focused on tracking those relationships, which means following the linkages between objects as they move about on the Internet of Things (http://en.wikipedia.org/wiki/Internet_of_Things).  Early in the BiSciCol project, we had exactly one relationship, which we expanded a few blog posts ago, by adding a second predicate called “relatedTo” which is directionless and limits how searches could traverse our network.  We have now settled on what we hope is a final set of predicates, which also includes “derives_from” and “alias_of”.  “Derives_from” is important because it recognizes that properties of biological objects can be shared among its derivatives, such as saying that a tissue sample can be inferred to be collected in Moorea, French Polynesia because the specimen (whole organism) was defined as being collected there (“derives_from” is borrowed from the Relations Ontology and defined as transitive).  Finally, “alias_of” is a way of handling duplicate identifiers for the same object.



Figure The Second.  We know you love technical architecture diagrams during the holidays.  Although this looks a bit complicated, let’s take this apart and discuss the various parts, because it summarizes a lot of work we invested to deal with some challenging social and technical issues.  This diagram is really built on three main components:  the GetMyGuid service, the Triplifier Simplifier, and the Triplifier Repository.  The GetMyGUID service is used to mint EZIDs that can be directly passed to biocollections managers for using at the source, or that can be associated with data in the triplestore.    The Triplifier (Simplifier) is a tool for creating RDF from biocollections data, and pushing that to a user via web services or to a triplestore.  We are now working out the backend architecture to deal with storing a large number of triples.  We have developed this architecture to be flexible, simple, and based on understanding user needs (and concerns) with regards to permanent, unique identifiers and semantic web approaches.



Figure the third.  The Triplifier is a web-based software that takes input files and creates triples (http://en.wikipedia.org/wiki/N-Triples) from them.  The process for doing this involves multiple steps, starting with uploading a database or a spreadsheet to the Triplifier, specifying any known joins between tables that are uploaded, and mapping properties in those local files to known terms in an appropriate vocabulary, relating terms using predicates and then hitting “Triplify!”  For those not versed in ontologies and the semantic web, the whole process can be intimidating!  So we made it easier.  The Triplifier Simplifier can take any dataset in Darwin Core format, and we’ll do the work for you.  We’ll read the header rows, verify that they map to Darwin Core terms, and set it all up to Triplify correctly.  VoilĂ !  We have a bit more work to do here before the Simplifier is ready – the big challenge is taking these flat files “spreadsheets” and recreating a set of tables based on Darwin Core classes such as “occurrence”, “event”, “taxon”, etc.  We will spend more time discussing this in future blog posts!

Figure the Fourth.   This is another “in preparation” web interface for users to get Great and Useful EZIDs.  The options for doing so include pasting in a set of local identifiers, which could be set of catalog numbers or locally specific event Identifiers.  The GetMyGuid service creates a second column and makes an EZID per row linked to the local identifier.  A user can then import this right back into their database and have EZIDs on their source material.   The “Create GUIDs” link just mints a set of EZIDs for later use.  Some authentication will be required and we might put an expiration data on how long you can wait to use them.  The last option is “Mint a DOI for your dataset”.  You basically just type in the digital object location, and some key metadata and you get a DOI that can resolve to at least the metadata and link to the actual digital object.   As always, BiSciCol will accept any well-formed, valid URI, persistent identifiers expressed by clients.  We are working closely with the California Digital Library and extending their EZID API for use in this part of our project.

Summary:  We end 2012 on a BiSciCol high note, and not just because the meeting was in Boulder Colorado either (because of the elevation, people!  Not the legal cannabis!)  We have made a lot of progress based on productive meetings, a lot of input from various folks, and a lot of time and effort by our talented staff of programmers who work so hard to develop this and also canvas the community.  We should also take this opportunity to give a shout out to a new developer on the team, Tom Conlin, who is joining us as our backend database expert.  Great to have him on board!

- John Deck, Rob Guralnick, Brian Stucky, Tom Conlin, and Nico Cellinese

Friday, October 12, 2012

Making it 'EZ' to GUID

On Global Unique Identifiers (again) for Natural History Collections Data:  How to Stop People From Saying “You’re Doing It Wrong” (or conversely, “Yay! We’re Doing It Right!”)
'
From Gary Larsen and adapted by Barry Smith in Referent Tracking
presentation at the Semantics of Biodiversity Workshop, 2012.

The natural history collections community has been hearing about GUIDs (globally unique identifiers) for a long time.  However, what we’ve typically heard are comments like “ARGH!  These don’t work” or “These are hard to implement”; or we’ve been subjected to long “policy documents” that seem to be generated by computer scientists, not the people actually working in the collections.  So the bottom line is that it’d be nice to have some clear, short “thinky things” about GUIDs that help us understand the value a bit more clearly, and that provides a simple and clear way forward.  We want to take a stab at that here and VERY MUCH WELCOME feedback.  Lots of it.  We’ve thought a ton about it and we are ready!
 
A recurrent question we have gotten from people developing collections database (or at the level of aggregators such as Vertnet or GBIF) is why we need to go beyond the self-minted, internal GUIDs and why GUIDs need to resolve and be persistent.  We could envision a large data aggregator such as iDigBio or GBIF that keeps track of digital records by assigning UUIDs (universally unique identifiers --- which are very, very easy to mint!) to these but likely without any connection to the physical source objects stored in providers institutions, and/or any connection to the same objects stored in other institutional repositories or aggregators.  Yet, the ultimate value of assigning GUIDs to objects, their metadata and derivatives is that we can track all these back to their source and generate queries that imply semantic reasoning over a robust digital landscape. In such a landscape, answering those core-challenging questions generated by collaborative projects becomes possible.  Therefore, the digitization process acquires a much deeper meaning and value by going beyond the process of straightforward data capture and moves towards an environment where we can track relationships among physical objects, annotations, and descriptive metadata as part of a global network.  If as a community we agree on the benefit of assigning GUIDs, this is the opportunity to generate a strategy that would add long-term value to this effort.  In other words, if we need to invest our resources, let’s do it in ways that we can draw benefit now and in the future.

A big question is how to best implement such a vision.  GUID implementations within our community have proven problematic as evidenced by 20% of Rod Page’s blog posts.  After much vetting of possible solutions, we believe the right answer is to leverage existing expertise developing not just GUIDs, but a set of services built around them.   In particular, we have talked to the California Digital Library (CDL) about EZIDs, and the value of using EZIDs given that these elegantly solves a lot of community needs at once and nicely positions us for the future. Speaking of community needs, the solution we advocate is not just “go get EZIDs”.  BiSciCol was funded, in part, to help with this task of working with the community and foster the implementation of GUIDs as a necessary step towards bringing our digital resources into a Linked Open Data framework.  BiSciCol wants to build out services that support the community, working with CDL and you, to make that happen.

What are EZIDS and why do we love them?  
As we mentioned in a previous blog post (http://biscicol.blogspot.com/2012/08/the-biscicol-project-team-has-had-busy.html), CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs.  The big win is that there are a bunch of CDL services already developed to help with minting these GUIDs, and assure that these are resolvable, linkable, persistent and long-term sustainable.  EZIDs have some lovely features, including their flexibility to be associated with datasets and objects through the whole digital data life cycle.  Also, EZIDs allow us to mix and match DOIs, which are well understood and used in the publishing community, with ARKs, which were developed in the archives, library and museum community and provide a bit more flexibility and the ability to assign on a more granular level to individual data objects rather than datasets.  For more details, see John Kunze’s powerpoint presentation on EZIDs).   We can work with CDL and their EZID system to build a prototype collections community GUIDs service.

So you are thinking to yourself... how much does it cost?  The answer is:  Nothing to you,   very little to BiSciCol, and ultimately remarkably lower than what has already been spent in terms of people-hours trying to sort through this very complex landscape, and develop home-grown solutions.  Sustainability has costs --- and the goal is to scale those down to the point where they are orders of magnitude lower than where they have been before by leveraging economies of scale. We do that with this solution.  Big win.  

Our View on Best Practices:

  1. GUIDs must be globally unique.  The “Darwin Core Triplet” might not be good enough.  
  2. GUIDs must be persistent.  Most projects generating GUIDs have < 10 year lifespans.  Having persistent GUIDs means we need to think about strategies for resolution services (if required) that have a > 10 year lifespan and in the context of an institution that is designed to be persistent.
  3. GUIDs must be assigned as close to the source as possible.  For example, if data is collected in the field, the identifier for that data needs to be assigned in the field and attached to the field database with ownership initially stated by the maintainers of that database.  For existing data, assignment can be made in the source database.
  4. GUIDs propagate downstream to other systems.  Creating new GUIDs in warehouses that duplicate existing ones is bad practice, and thus aggregators need to honor well-curated GUIDs from providers.  
  5. Don’t conflate GUIDs for physical material with GUIDs for metadata about that physical material.  We promote the assignment of unique GUIDs to physical material; metadata about physical material will have a separate GUID.  While physical object IDs can be any type of GUID, we recommend EZIDs as they are short, unique, opaque, resolved by a persistent entity, and redirection to metadata can be stored with the identifier itself.  UUIDs can be used for this purpose as well BUT are not as robust as EZIDs since they lack redirection or resolution or require local solutions (see #2 above for problems with such solutions).
  6. GUIDs need to be attached in a meaningful way to semantic services.  Knowing semantically that a GUID is referring to either an information artifact, a process, or a physical thing is vital to understanding how to interpret the meaning of its relationship to other GUIDs expressed in other areas and to inform aggregators how to interpret content.

A prototype collections community guid service.
GetMyGUID Service - “Promoting GUID Standard Design Practices”.  We have blathered on long enough here, but want to just give a hint of where we are going.  We want to create a service that is built by natural history collections folks (and our computer science friends) for natural history collections folks, that taps into existing goodness already created.  That is, we tap into the existing services from EZIDs but then further develop a service that encodes best practices that work in this community.  In the near future, we are going to explain how the service works, how you can access it, why it does what it does.   We know how hard it is to get folks to make updates and additions to their databases, so we want to find out how to get over that barrier!  We want to find those early adopters (and hint hint, we are working with BiSciCol partners already to get this ball rolling!).  So, more soon.  Pass the word along!


- John Deck, Rob Guralnick, and Nico Cellinese

Saturday, August 25, 2012

News Update: how do we 'GUID'?

The BiSciCol project team has had a busy summer that included a presentation at the Annual Meeting of the Society for the Preservation of Natural History Collections (SPNHC) in New Haven, CT, and a presentation at the iEvoBio 2012 meeting in Ottawa, Canada.  Additionally, on the 13-15 August John Deck, Nico Cellinese, Rob Guralnick and Neil Davies convened at the University of California, Berkeley in order to meet with a few key partners and discuss the next steps forward for the project (meeting summary).  Before we report more about the meeting with our partners, here is some background information.  

BiSciCol's main goal is to break down the walled gardens between databases storing different kinds of biodiversity data such as specimens or samples from collecting events, sequences, images, etc. generated from those specimens or samples.  Doing so requires overcoming two separate community challenges. First, there must be a mechanism to associate globally unique identifiers (GUIDs) to collections records (Note, we are using the RSS specification GUID definition).  Second, the collections records must be expressed such that the terms used to define those records and their relationships are well understood by humans and computers.  This brings us into the “semantic web” and RDF “triples”.    

As BiSciCol has evolved, two key questions related to these challenges have emerged.  The first is whether GUIDs and creating "triples" should happen at the level of individual provider databases, or instead at the level of "aggregators" that enforce a standardized schema and encoding.  In the case of biological collections, an example of standardized schema is Darwin Core, usually encoded into a Darwin Core Archive.  Example aggregators are GBIF, VertNet and Map of Life. The second question is equally thorny and deals primarily with the content that the identifier is describing: is the identifier describing a physical object, a digital surrogate of a physical object, and is it a primary digital surrogate or a copy?  An example would be provided by specimen metadata attached to a photo record in Morphbank, which contains a copy of specimen metadata which in turn references a physical object.  

So, lets turn back to the meeting in Berkeley. That meeting included two key partners with whom we want to further develop and test ways forward given the two huge questions above.  We spent part of the time with the California Digital Library (CDL) folks, who have built a set of excellent tools that may be part of the solution to the problem of GUID assignment.  CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs.  John Kunze from CDL gave a great rundown on EZIDs and how they work, and was kind enough to meet us again on a couple of separate occasions, formal and informal.  Metadata encoding in the EZID itself may also be used to indicate use restrictions and provenance (John Kunze’s powerpoint presentation on EZIDs).  

The other key partner with whom we met was VertNet and Aaron Steele, the lead systems architect on the project.  The idea behind meeting with VertNet was to test out how we might do EZID assignment and triplification utilizing the same approach by which VertNet data is being processed from Darwin Core archives into a set of tables that can be visualized, queried and replicated.  Aaron was kind enough to participate to our hackathon and start up this process.  We set up a readme file about the Hackathon to describe our expected outputs.  Yes, the project is called "Bombus" which reflects the fact that although a bit wobbly, our goal is to have data flying around to "pollinate" other data.  Happily, the hackathon was very much a success!  We were able to tap into some existing code generated by Matt Jones (NCEAS) to mint EZIDs and VOILA, we had an output file ready for the semantic web (e.g. an output file that shows relationships between occurrences, localities and taxa based on the EZIDs).   We weren't quite able to get to the last step of querying the results, but we're very close.  More work (and reports) are to follow on this so stay tuned on the Bombus/pollinator link above.  

We have been testing a variety of solutions for identifier assignment, including: supporting user-supplied GUIDs, aggregator GUIDs, dataset DOIs, community standard identifiers (e.g. DwC Triplet), and creating QUIDs (Quasi Unique Identifiers) from hashed content.  EZID technology will play a significant role in the implementation of a number of these approaches.  None of these approaches offer a complete solution, but taken together, we can begin to build an intelligent system that can provide valuable services to aggregators, data providers, and users.  Services we will be supporting include: GUID tracking, identifying use restrictions, and GUID reappropriation.  Integrating our existing triplifier and biscicol java codebases with a scalable database back-end will fulfill most of the technical requirements needed.

We are still building our Triplifier to support those who want to take their own datasets and bring them into the semantic web framework, but BiSciCol can operate much more "at scale" with a very simple interface that accepts Darwin Core Archives or other standardized data such as those generated from Barcode of Life, Morphbank, or iDigBio, and assemble these into a shared triplestore or set of commonly accessible triplestores.  We think the issues we're tackling right now are at the sociotechnical heart of BiSciCol.  We use the term heart knowingly because it is going to be the desire and will of the community, along with the resources such as BiSciCol, that can help motivate and excite, and that will get us at least moving in the right direction. If you have any thoughts, criticisms, suggestions, we'd of course love to hear them.  


John Deck, Rob Guralnick and Nico Cellinese

Friday, April 6, 2012

Making our System Smarter

Computers are amazing at following instructions. So amazing, in fact, that a seemingly harmless instruction can potentially lead to an entirely false conclusion.
At our recent BiSciCol meeting at the University of Florida, we had a discussion about just such a case.

At its core, BiSciCol is all about connecting objects to each other. In order to accomplish connecting object identifiers to other objects, we have been using a simple relationship expression called “leadsTo” that indicates a direction in the nature of the relationship between one object and another. To illustrate how “leadsTo” works, lets provide a simple example. Suppose we have a collecting event, which we join to a specimen object using our relationship predicate “leadsTo”. The specimen object could then “lead to” a taxonomic determination, which could in turn “lead to” a scientist, and so on.

This is certainly useful as we can express an endless chain of objects and their derivatives , even if they exist in different databases. However, what if we extended the above example just a bit further, using our “leadsTo” relationship?


Uh oh--- By successively following the leadsTo relationships, we could now erroneously conclude that spec2-t1 came from spec3! This is not good! Fortunately, there is a solution.

We realized that the directional “leadsTo” relationship simply doesn't make very much sense in some situations, such as the connection between spec3 and person1 in the diagram above. Consequently, instead of the single “leadsTo” relationship, we actually need two relationship terms: one that has a distinct direction and one that implies no direction. Two terms are in use currently that do just this from the Dublin core standard: 1) relation (no direction) and 2) source (has direction).


In the first example above, we could avoid the problem entirely by describing the link between the taxonomic determination and the scientist as a non-directional relation. Using our new terminology, the graph would look something like this:


The computers involved in figuring out how to traverse the graph of relationships would know not to follow non-directional relationships and we would no longer infer that spec2-t1 came from spec3. Problem solved!


This post written by John Deck and Brian Stucky with input from Hilmar Lapp, Steve Baskauf, Andrea Thomer, Rob Guralnick, Lukasz Ziemba, Tim Robertson, Reed Beaman, and Nico Cellinese, summarizes a discussion that took place at the BiSciCol development meeting held on the March 31, 2012 at the University of Florida.

Friday, February 17, 2012

Development Meeting, Boulder, Colorado, 2-4 February, 2012

Two Days of Triplifying

The Boulder contingent of BiSciCol hosted a short two day "developer's meeting" that included John Deck, Brian Stucky, Lukasz Ziemba, Bryan Heidorn and his students Alyssa Janning and Qianjin Zhang.  As luck would have it, the BiSciCol crew arrived on a blustery morning just ahead of the biggest single February snowfall on record.  It proceeded to snow from Thursday afternoon to Saturday morning without much pause, lending a surreal quality to the proceedings.  It also fubared some plans to use campus meeting facilities, since that Friday was the first snow day in many a moon.  Enough about the weather!  Lets talk about what we did!

All participants were very pleased with the outcome of this meeting. Brian had been hard at work developing a generic plug-in interface so that anyone can write some simple code to connect whatever kinds of record sets they have and begin an import into the Triplifier.  OH! WAIT.  WAIT.  First things first!  What the heck is Triplifier, you ask?  And why do we think this Triplifier is such a good idea?

The BiSciCol project works by linking data in data sources based on logical relationships independent of a particular implementation. This is a different kettle of fish compared to a data standard. BiSciCol works where standards stop.  We particularly want to, for example, represent how a sequence is related to a specimen which is related to an event and a location.  The problem is creating a common "format" for expressing simple relationships and then using a set of those simple ones to build more complex "graphs" of these relationships. So what is that "common format"?  In the world of the Resource Description Framework (RDF), the format is called a "triple".

A triple is not that complicated; it basically expresses a unique fact about how things are related to one another.  The format of a triple is subject - predicate - object (thus the "triple" - three pieces of data). The triple format is not all that different from what is expressed in a database or spreadsheet or other structured data document.  The set of relationships that allow joins to happen in relational databases are in theory very similar.  So similar that one can convert a database or other document into triples.  And thus the point and value of a "Triplifier" -- a way to convert any set of documents into triples so that we can begin to compile a larger set of resources.

So back where we started...Brian Stucky has developed a generic plug-in for ingesting different types of data into the Triplifier. And Lukasz has used that generic plug-in to build a Darwin Core Archive ingester.  The big news, however, has to do with a platform called D2RQ.  Basically, D2RQ does the heavy lifting of representing relational databases (or your own declarations of relationships between objects) as triples.  At the heart of D2RQ are: 1) the "ClassMap" which represents classes from a schema; 2)"PropertyBridge" which basically defines the properties in RDF using the class map.; 3) Joins that link tables.  In a nutshell, a user can specify (or pass along a relational database) with the right information about the database and its foreign keys, and dump out RDF triples.

The good news is that we were able to test D2RQ with a very simple relational database that relates collectors and specimens to verify that we get the right outputs.  After some trial and error, and specifying the right class maps, we succeeded in generating meaningful RDF output.  Given this, we are ready to rock and roll with developing the Triplifier fully, and will be cranking on this over the next few months.  Lukasz has already made progress on a Web interface, and we are preparing to test the system with data from the Moorea Biocode and HERBIS project.

- Rob Guralnick reports