Pages

Thursday, December 27, 2012

BiSciCol in Four Pictures

People always say a picture is worth a thousand words.   Given that, we want to present “4000+” words here.  That is, below are 4 images and some “captions” or explanatory text – that, while not fully inclusive of current efforts,  gives a  close to a complete update on progress and our next steps.



Figure the first.   One of the things that the BiSciCol crew has thought a lot about is how to express relationships among different kinds of physical and digital biological collection “objects”.  Our work is focused on tracking those relationships, which means following the linkages between objects as they move about on the Internet of Things (http://en.wikipedia.org/wiki/Internet_of_Things).  Early in the BiSciCol project, we had exactly one relationship, which we expanded a few blog posts ago, by adding a second predicate called “relatedTo” which is directionless and limits how searches could traverse our network.  We have now settled on what we hope is a final set of predicates, which also includes “derives_from” and “alias_of”.  “Derives_from” is important because it recognizes that properties of biological objects can be shared among its derivatives, such as saying that a tissue sample can be inferred to be collected in Moorea, French Polynesia because the specimen (whole organism) was defined as being collected there (“derives_from” is borrowed from the Relations Ontology and defined as transitive).  Finally, “alias_of” is a way of handling duplicate identifiers for the same object.



Figure The Second.  We know you love technical architecture diagrams during the holidays.  Although this looks a bit complicated, let’s take this apart and discuss the various parts, because it summarizes a lot of work we invested to deal with some challenging social and technical issues.  This diagram is really built on three main components:  the GetMyGuid service, the Triplifier Simplifier, and the Triplifier Repository.  The GetMyGUID service is used to mint EZIDs that can be directly passed to biocollections managers for using at the source, or that can be associated with data in the triplestore.    The Triplifier (Simplifier) is a tool for creating RDF from biocollections data, and pushing that to a user via web services or to a triplestore.  We are now working out the backend architecture to deal with storing a large number of triples.  We have developed this architecture to be flexible, simple, and based on understanding user needs (and concerns) with regards to permanent, unique identifiers and semantic web approaches.



Figure the third.  The Triplifier is a web-based software that takes input files and creates triples (http://en.wikipedia.org/wiki/N-Triples) from them.  The process for doing this involves multiple steps, starting with uploading a database or a spreadsheet to the Triplifier, specifying any known joins between tables that are uploaded, and mapping properties in those local files to known terms in an appropriate vocabulary, relating terms using predicates and then hitting “Triplify!”  For those not versed in ontologies and the semantic web, the whole process can be intimidating!  So we made it easier.  The Triplifier Simplifier can take any dataset in Darwin Core format, and we’ll do the work for you.  We’ll read the header rows, verify that they map to Darwin Core terms, and set it all up to Triplify correctly.  VoilĂ !  We have a bit more work to do here before the Simplifier is ready – the big challenge is taking these flat files “spreadsheets” and recreating a set of tables based on Darwin Core classes such as “occurrence”, “event”, “taxon”, etc.  We will spend more time discussing this in future blog posts!

Figure the Fourth.   This is another “in preparation” web interface for users to get Great and Useful EZIDs.  The options for doing so include pasting in a set of local identifiers, which could be set of catalog numbers or locally specific event Identifiers.  The GetMyGuid service creates a second column and makes an EZID per row linked to the local identifier.  A user can then import this right back into their database and have EZIDs on their source material.   The “Create GUIDs” link just mints a set of EZIDs for later use.  Some authentication will be required and we might put an expiration data on how long you can wait to use them.  The last option is “Mint a DOI for your dataset”.  You basically just type in the digital object location, and some key metadata and you get a DOI that can resolve to at least the metadata and link to the actual digital object.   As always, BiSciCol will accept any well-formed, valid URI, persistent identifiers expressed by clients.  We are working closely with the California Digital Library and extending their EZID API for use in this part of our project.

Summary:  We end 2012 on a BiSciCol high note, and not just because the meeting was in Boulder Colorado either (because of the elevation, people!  Not the legal cannabis!)  We have made a lot of progress based on productive meetings, a lot of input from various folks, and a lot of time and effort by our talented staff of programmers who work so hard to develop this and also canvas the community.  We should also take this opportunity to give a shout out to a new developer on the team, Tom Conlin, who is joining us as our backend database expert.  Great to have him on board!

- John Deck, Rob Guralnick, Brian Stucky, Tom Conlin, and Nico Cellinese