Friday, February 17, 2012

Development Meeting, Boulder, Colorado, 2-4 February, 2012

Two Days of Triplifying

The Boulder contingent of BiSciCol hosted a short two day "developer's meeting" that included John Deck, Brian Stucky, Lukasz Ziemba, Bryan Heidorn and his students Alyssa Janning and Qianjin Zhang.  As luck would have it, the BiSciCol crew arrived on a blustery morning just ahead of the biggest single February snowfall on record.  It proceeded to snow from Thursday afternoon to Saturday morning without much pause, lending a surreal quality to the proceedings.  It also fubared some plans to use campus meeting facilities, since that Friday was the first snow day in many a moon.  Enough about the weather!  Lets talk about what we did!

All participants were very pleased with the outcome of this meeting. Brian had been hard at work developing a generic plug-in interface so that anyone can write some simple code to connect whatever kinds of record sets they have and begin an import into the Triplifier.  OH! WAIT.  WAIT.  First things first!  What the heck is Triplifier, you ask?  And why do we think this Triplifier is such a good idea?

The BiSciCol project works by linking data in data sources based on logical relationships independent of a particular implementation. This is a different kettle of fish compared to a data standard. BiSciCol works where standards stop.  We particularly want to, for example, represent how a sequence is related to a specimen which is related to an event and a location.  The problem is creating a common "format" for expressing simple relationships and then using a set of those simple ones to build more complex "graphs" of these relationships. So what is that "common format"?  In the world of the Resource Description Framework (RDF), the format is called a "triple".

A triple is not that complicated; it basically expresses a unique fact about how things are related to one another.  The format of a triple is subject - predicate - object (thus the "triple" - three pieces of data). The triple format is not all that different from what is expressed in a database or spreadsheet or other structured data document.  The set of relationships that allow joins to happen in relational databases are in theory very similar.  So similar that one can convert a database or other document into triples.  And thus the point and value of a "Triplifier" -- a way to convert any set of documents into triples so that we can begin to compile a larger set of resources.

So back where we started...Brian Stucky has developed a generic plug-in for ingesting different types of data into the Triplifier. And Lukasz has used that generic plug-in to build a Darwin Core Archive ingester.  The big news, however, has to do with a platform called D2RQ.  Basically, D2RQ does the heavy lifting of representing relational databases (or your own declarations of relationships between objects) as triples.  At the heart of D2RQ are: 1) the "ClassMap" which represents classes from a schema; 2)"PropertyBridge" which basically defines the properties in RDF using the class map.; 3) Joins that link tables.  In a nutshell, a user can specify (or pass along a relational database) with the right information about the database and its foreign keys, and dump out RDF triples.

The good news is that we were able to test D2RQ with a very simple relational database that relates collectors and specimens to verify that we get the right outputs.  After some trial and error, and specifying the right class maps, we succeeded in generating meaningful RDF output.  Given this, we are ready to rock and roll with developing the Triplifier fully, and will be cranking on this over the next few months.  Lukasz has already made progress on a Web interface, and we are preparing to test the system with data from the Moorea Biocode and HERBIS project.

- Rob Guralnick reports