BiSciCol

A sequence, a specimen and an identifier walk into a bar ….

2013-08-22T13:33:00.001-07:00

As biodiversity scientists, we rely on a breadth of data living in various domain-specific databases to assemble knowledge on particular ecoregions, taxa, or populations. We visit genetic and genomic, morphological, phylogenetic, image, and specimen databases in search of up-to-date information to answer our questions. What we find, in general, is a morass of data, disconnected between systems, as if each database were a walled garden. The main goal of BiSciCol is linking digital objects and datasets, breaking down walled gardens, and building a true network of biodiversity data where we can ask questions across domains, and following object instances where-ever they are referenced.

How do we enable linking data, on a global scale, across domains? For BiSciCol, this comes down to two approaches: 1) build better ontologies and vocabularies so we can use the same language when talking about the same thing, and 2) enable identifiers that are robust enough to persist in time and allow linking across walled gardens. For this blog-post, we’ll focus on issues we have with identifiers as they’re currently used in practice, and how they can be improved to enable better linking.

To provide a point of reference for our discussion, let’s look at two databases that contain some overlap of references to the same objects: VertNet and Genbank. VertNet is a project that aggregates vertebrate specimen data housed in collections and Genbank is a project that houses sequences, often containing references to specimen objects housed in those same museums which are part of VertNet. For our exercise in linking, we’ll use a popular method for identifying museum objects, the Darwin Core (DwC) triplet; a DwC triplet is composed of an institution code, collection code, and catalog number, separated by a colon.

In the VertNet database, the DwC triplet fields are stored separately, and the DwC triplet can be constructed programmatically by appending each field value together with colons separating them. The INSDC standard, which Genbank adopts, specifies the specimen_voucher qualifier value should be provided with the full DwC triple: that is the institution code, collection code, and catalog number separated by colons as one field. Since these approaches are very similar, it should be a simple task to map the VertNet format to Genbank. Harvesting all relevant institutions from Genbank that map to VertNet institutions, and which also has a value in the specimen_voucher field gives us over 38,000 records. VertNet itself has over 1.4 million records (at the time we harvested data) that we also have access to. We can assume that the Genbank records containing voucher specimen information should match well with VertNet data since the institutions providing data to Genbank are the same ones that are providing data to VertNet.

In fact, on our first pass, we found only 483 matches using the DwC triplet method of linking. That is 1% of the potential number of matches we would expect! If we toss the collection code field and match only on institution code plus catalog number we get 2,351 matches (6% match rate). Although we need to look a little more closely, tossing the collection code does not seem to cause collisions between collections within an institution (but, in other instances could lead to false positive links). If we combine removing collection code with a suite of data parsing tools we can can increase this a bit further to 3,153 matches (8% match rate). This is still a dismal match rate.

The primary reason for the low match rate is that Genbank will take any value in the specimen_voucher qualifier field and providers will respond in kind, inserting any value they choose. Consequently, field values are notoriously noisy, making it difficult to parse. VertNet data was much higher quality for two reasons: 1) records were curated according to distinct values for institution, collection, and catalog_number, with clear instructions for recommended values for each field, and 2) there is a process of data cleaning and customer support for VertNet data publishers. However, we can see that the DwC triplet method suffers from some serious issues when used in practice. Clearly we need a better strategy if we’re going to take identifiers seriously.

Are there other options for unique identifiers? Lets looks at identifier options currently in play, and the ramifications of each; but first, let’s take a look at what RDF tells us about identifiers. Currently, the TDWG-RDF group is discussing what constitutes a valid URI for linking data on the semantic web. The only hard and fast recommendation in RDF is that the identifier must be an HTTP URI. After all, this is the semantic web, built on the world-wide web, which uses the HTTP protocol to transfer data, so what can go wrong here? Nothing, except that we must have persistence if we want to ensure identifiers are linkable in the future, and simply being an HTTP URI says nothing about persistence. It may be available today and next month, but what about in a year? In 10 years? In 50 years? Will machine negotiation be through HTTP in 50 years? There are some work-arounds to ensure long-term persistence such as casting the identifier as a literal or using proxies to point to an HTTP resolver for identifiers. However, its clear that RDF by itself does not answer our need for identifier persistence. We need more specialized techniques.

So. Some strategies:

DwC Triplets: We’ve talked about this strategy here and some of the drawbacks. Also, they are not guaranteed to be globally unique, encode metadata into the identifier itself, which is bad practice, leading to persistence issues into the future. Worse: they are not resolvable, and they can be constructed in various, slightly different ways leading to matching problems.

LSIDs: LSIDs (http://en.wikipedia.org/wiki/LSID) have not solved the persistence question either and resolvers are built on good-will and volunteer effort. More backbone needs to be provided to make these strong persistent identifiers. For example, requiring identifiers to be resolvable rather than merely recommending resolution.

UUIDs: Programmers love UUIDs (http://en.wikipedia.org/wiki/Universally_unique_identifier) since they can be created instantly, are always globally unique (for all practical purposes), and can be built directly as database keys. However, by themselves we don’t know where to resolve them. A vanilla UUID sitting in the wild tells us practically nothing about the thing it is representing. Solutions advocating UUIDs can be a great option, as long as there is a plan for resolution, usually requiring another solution to be implemented along with it.

DOIs: DOIs (http://www.doi.org/) were designed for publications and contain built in metadata protocols, and are used the world over by many, in fact most publishers. There is an organization behind it, the International DOI Foundation, which is geared towards persisting for a long time. There is a network of resolvers which can resolve any officially minted DOI. DOIs are available at minimal costs through Datacite or Crossref.

EZIDs: EZIDs (http://n2t.net/ezid/) support Archival Resource Keys (ARKs) and DOIs through Datacite. By registering with EZID you can mint up to 1 million identifiers per year at a fixed rate. Subscription costs are reasonable. EZIDs are supported by the California Digital Library, which not only helps assure persistence, but also provides useful services that are hard to build into homebrew resolvers.

BCIDs: BCIDs (e.g. see http://biscicol.org/bcid/) are an extension of EZIDs, and use a hierarchical approach (using a technique called suffixPassthrough) to simultaneously resolve to dataset and record-level entries. Since identifier registration is done for groups, and extended using locally unique suffixes it enables rapid assignment of identifiers that are keyed to local databases while offering global resolution and persistence. With this solution, we can also sidestep the 1 million identifiers per year limit.

We conclude by noting that each aggregator out there seems to want to mint its own flavor of GUIDs, perhaps as much to “brand” an identifier space as for any other reason. We wonder if this strategy of proliferating such spaces is a great idea. A huge advantage of DOIs and EZIDs is abstraction. You know what they mean and how to resolve them because they are well-known and have organizations with specific missions to support identifier creation. This strategy ensures that identifiers can persist and resolve well into the future, and be recognizable not just within the biodiversity informatics community but any other community we interoperate with: genomics, publishing, ecology, earth sciences. This is what we’re talking about when we want to break down walled gardens.

-John Deck, Rob Guralnick, Nico Cellinese, and Tom Conlin

Sneak Peeks, BiSciCol Style

2013-05-20T13:59:00.000-07:00

Our blog has been quiet lately, as we coded and tested and waited out the cold, short days of winter and early Spring. With Spring now firmly here, we are ready to give you the opportunity to directly test some fruits of that labor. First, a quick review of where we have been. BiSciCol, and all those interested in bringing biodiversity data into the semantic web, has been plagued by a chicken and egg problem. In order for the semantic web to be a sensible solution, there needs to be a way to associate permanent, resolvable globally unique identifiers to specimens and their metadata. There ALSO needs to be a community-agreed sematic framework for expressing concepts and how they link together. You can't move forward without BOTH pieces and unfortunately the biodiversity community basically has had neither. So BiSciCol decided to tackle both problems simultaneously.

The solution we developed leverages one thing that was already in place --- a community developed and agreed-upon biodiversity metadata standard called the Darwin Core. We talked about how we have leveraged the Darwin Core in our last blog post, and how we have formalized Darwin Core "categories" (or classes), and derived relationships between them. With this piece of the puzzle complete, we now have a working tool called the Triplifier. The Triplifier takes a Darwin Core Archive, which contains some self-describing metadata about the document along with data, and converts those data to RDF. Darwin Core Archives are particularly useful because all the data in such archives is already in a standard form.

Darwin Core Archives are available for download from sources such as the VertNet IPT (http://ipt.vertnet.org), or the Canadensys IPT (http://data.canadensys.net/ipt/). Just download any Darwin Core Archive you want, load the archive zip file into the Triplifier (which we have yet to deploy to production yet, but try out the development server here: http://geomuseblade.colorado.edu/triplifier/ ) via the "File Upload:" link, click "auto-generate project for" link and select Darwin Core Archive. Load the file, get information about class and property structures, and then click "Get Triples" at the very end. You should be able to then save the RDF. For more information on how the DwC Archive Reader plugin works see the related JavaDoc page.

So what does this all mean? First, this is a working tool for creating Darwin Core data in RDF format. It may not be perfect yet, but its been stress tested, and it does the job. This is a big step forward in our opinion. We are currently Triplifying a lot of Darwin Core Archives and putting all the results into a data store for querying. Next blog post, we'll explain how valuable this can be, especially when looking for digitial objects linked to specimens, such as published literature, or gene sequences.

The other part of the chicken-egg problem is this persistent, and challenging, GUID problem. Here we also have a working prototype of a service we are calling BCIDs, which are a form of identifier that is scaleable, persistent, and leverages community standards. BCIDs are a form of EZIDs with a couple small tweaks to work for our community at scale. It represents a lot of hard thinking by John Kunze and John Deck. Here is the general idea: The BCID Resolution system resolves BCID identifiers that are passed through the Name-to-thing resolver (http://n2t.net/). All BCID group identifiers are registered with EZID, describing related categories of information such as Collecting Event, Occurrence, or Tissue. EZID then uses its suffix passthrough feature to pass the suffix back to the BCID resolver. At this point, a series of decisions are made based on the identifier syntax to determine how to display returned content. Element-level identifiers, with registered suffixes in the BCID system, also containing targets, can be resolved to a user-specified homepage. Un-registered suffixes, or where there is no defined target associated with the identifier, or when machine resolution is specifically requested will return an HTML rendering of the identifier with embedded RDF/XML syntax describing the identifier. Machine resolution can be specifically requested to any identifier by appending a "?" to the identifier. See the diagram below for extra-clarity. And check out the BCID home-page and BCID codepage.

How does this all work in practice? Suppose we have group ID = ark:/21547/Et2 (resource=dwc:Event) and do not register any elements. Now, suppose someone passes in a resolution request for ark:/21547/Et2_UUID; the system will still tell you that this is some event (dwc:Event), date it was loaded, a title and if there is a DOI/ARK associated with it. Now, suppose we decide to register those UUIDs associated with ark:/21547/Et2 and also provide web pages that have some HTML content to look at (targets) then we can show a nicely formatted, human readable page of the collecting event itself and some formatted human readable text (HTML). However, what if we're a machine and we don't want to look at all the style sheets and extraneous, difficult to parse text; rather, we just want to know when this record was loaded and the resourceType (regardless if there is some target or not). This is where "?" comes in... if the "?" is appended on the end of the ark like: ark://21547/Et2_UUID? then we automatically get RDF/XML. Minimalist but predictable and a convention in use for EZIDs currently.

Soon you will be able to call the BCID service for any dataset, whether its in RDF format or not. For datasets, one can register an ARK or DOI and associated metadata and for more granular elements, BCIDs will help assign the pass-through suffixes. We think this represents a very elegant system for dealing with the very challenging problem of guids in the biodiversity informatics community. It leverages existing tools and communities and it creates new ones needed for those involved in biocollections. If you want to try creating and using BCIDs now, talk to us and we'll work with you to get this started.

We will be presenting more about BiSciCol in meetings this Summer, at iEvoBio (http://ievobio.org/) and TDWG (http://www.tdwg.org/conference2013) , showing off what amounts to solutions that cover those chickens and eggs. In the next post we'll finally link all of this up and show how it can be used for some neat discoveries. Before winding down, BiSciCol owes a gigantic thanks to Brian Stucky who has put in a tremendous amount of effort developing the Triplifier. He is off in Panama working on his dissertation research, and will be teaching classes next Fall. We couldn't have come nearly as far as we have without him.

- Rob Guralnick, Nico Cellinese, Tom Conlin, John Deck, and Brian Stucky

BiSciCol, Triples, and Darwin Core

2013-03-12T20:07:00.001-07:00

A big part of what we want to accomplish with BiSciCol is supporting biodiversity collections data from lots of different sources. These data are often organized using a standard called "Darwin Core" (DwC), and Darwin Core-based data are commonly transmitted in a specific format known as a "Darwin Core Archive" (DwCA). So recently, we've been devoting a lot of thought and effort to figuring out how we can best support DwC and DwCAs in BiSciCol and the Triplifier. (The "Triplifier" is a tool we are building to make it easy to convert traditional, tabular data formats into RDF triples for use in BiSciCol and the Semantic Web. DwCAs are just such a format.) Representing DwC data in RDF triples and "triplifying" DwCAs presented a number of challenges, and in this post we want to discuss one of these challenges: Figuring out how to use our relations terms to capture the connections found in DwC data.

Darwin Core includes six discrete categories of information: Occurrence, Event, dcterms:Location, GeologicalContext, Identification, and Taxon. DwC does not formally describe relationships between these categories of information, though. Formally defining the relationships that join categories, or classes, of information is common practice in standards development, but DwC's developers deliberately choose not to do this in order to make the standard as flexible as possible.

Before proceeding, we should note that in the previous paragraph, we were careful to make a distinction between the words “class” and “category.” “Class” is a special word typically used to describe categories of information present in a formal ontology (which DwC is not). However, since we’re describing a method for working towards formalizing DwC content, we’ll use the word “class” hereafter to refer both to the formal model and the original DwC categories.

So, to represent DwC data as RDF triples, we needed a way to relate DwC class instances to one another. This sounds fancy, but it's really a matter of using a common-sense approach to describe relationships between entities, much as people have been doing with relational databases for decades. In fact, the darwin-sw project has already developed a complete ontology for representing DwC data in the Semantic Web. However, because BiSciCol is limited to a small set of generic relations terms, we needed a new approach for handling DwC data. Plus, by building on the core BiSciCol relations, such a solution could easily include not just DwC, but concepts from other domains such as media, biological samples, genetic material, and the environment.

To make this all a bit more concrete, let's take a look at an example. Suppose we have a single instance of Occurrence (a specimen in a collection, say) that originated from a particular collecting expedition, which is represented in DwC as an instance of the Event class. Using RDF and BiSciCol's relations predicates, how should we make the required connection between the Occurrence instance and the Event instance? More generally, how should the six core DwC classes be related to one another using BiSciCol's relations terms?

The image above illustrates our answer to this question. Recall that we are using only four relations predicates in BiSciCol: derives_from, depends_on, alias_of, and related_to (see the previous post for much more information). The diagram should be fairly self-explanatory. Some relationships are naturally described by depends_on. For example, an Identification can only exist if there is an Occurrence (e.g., a specimen) to identify and a Taxon to identify it as. On the other hand, a GeologicalContext gives us information about a collecting Event, but in at least some sense, the collecting event is independent of the geological context. Thus, the relationship between these two instances is described by related_to.

So far so good, but when dealing with real data, this solution turns out to be insufficient because DwC data sets often do not include all six core classes. What should we do if a data set includes Occurrence and Taxon, but not Identification? This scenario is not uncommon, so to deal with all possibilities, we added a few more relations to handle the cases where a class (either Identification or Event) that acts as a bridge connecting Occurrence to other classes is missing. The following diagram illustrates the complete set of relations, with the dashed, gray lines representing the relations that are used if either Identification or Event are missing.

And that's it! With this set of eight relationship triples, we should be able to handle all possible combinations of the six core DwC classes.

- Brian Stucky, John Deck, Rob Guralnick, and Tom Conlin

BiSciCol in Four Pictures

2012-12-27T09:17:00.002-08:00

People always say a picture is worth a thousand words. Given that, we want to present “4000+” words here. That is, below are 4 images and some “captions” or explanatory text – that, while not fully inclusive of current efforts, gives a close to a complete update on progress and our next steps.

Figure the first. One of the things that the BiSciCol crew has thought a lot about is how to express relationships among different kinds of physical and digital biological collection “objects”. Our work is focused on tracking those relationships, which means following the linkages between objects as they move about on the Internet of Things (http://en.wikipedia.org/wiki/Internet_of_Things). Early in the BiSciCol project, we had exactly one relationship, which we expanded a few blog posts ago, by adding a second predicate called “relatedTo” which is directionless and limits how searches could traverse our network. We have now settled on what we hope is a final set of predicates, which also includes “derives_from” and “alias_of”. “Derives_from” is important because it recognizes that properties of biological objects can be shared among its derivatives, such as saying that a tissue sample can be inferred to be collected in Moorea, French Polynesia because the specimen (whole organism) was defined as being collected there (“derives_from” is borrowed from the Relations Ontology and defined as transitive). Finally, “alias_of” is a way of handling duplicate identifiers for the same object.

Figure The Second. We know you love technical architecture diagrams during the holidays. Although this looks a bit complicated, let’s take this apart and discuss the various parts, because it summarizes a lot of work we invested to deal with some challenging social and technical issues. This diagram is really built on three main components: the GetMyGuid service, the Triplifier Simplifier, and the Triplifier Repository. The GetMyGUID service is used to mint EZIDs that can be directly passed to biocollections managers for using at the source, or that can be associated with data in the triplestore. The Triplifier (Simplifier) is a tool for creating RDF from biocollections data, and pushing that to a user via web services or to a triplestore. We are now working out the backend architecture to deal with storing a large number of triples. We have developed this architecture to be flexible, simple, and based on understanding user needs (and concerns) with regards to permanent, unique identifiers and semantic web approaches.

Figure the third. The Triplifier is a web-based software that takes input files and creates triples (http://en.wikipedia.org/wiki/N-Triples) from them. The process for doing this involves multiple steps, starting with uploading a database or a spreadsheet to the Triplifier, specifying any known joins between tables that are uploaded, and mapping properties in those local files to known terms in an appropriate vocabulary, relating terms using predicates and then hitting “Triplify!” For those not versed in ontologies and the semantic web, the whole process can be intimidating! So we made it easier. The Triplifier Simplifier can take any dataset in Darwin Core format, and we’ll do the work for you. We’ll read the header rows, verify that they map to Darwin Core terms, and set it all up to Triplify correctly. Voilà! We have a bit more work to do here before the Simplifier is ready – the big challenge is taking these flat files “spreadsheets” and recreating a set of tables based on Darwin Core classes such as “occurrence”, “event”, “taxon”, etc. We will spend more time discussing this in future blog posts!

Figure the Fourth. This is another “in preparation” web interface for users to get Great and Useful EZIDs. The options for doing so include pasting in a set of local identifiers, which could be set of catalog numbers or locally specific event Identifiers. The GetMyGuid service creates a second column and makes an EZID per row linked to the local identifier. A user can then import this right back into their database and have EZIDs on their source material. The “Create GUIDs” link just mints a set of EZIDs for later use. Some authentication will be required and we might put an expiration data on how long you can wait to use them. The last option is “Mint a DOI for your dataset”. You basically just type in the digital object location, and some key metadata and you get a DOI that can resolve to at least the metadata and link to the actual digital object. As always, BiSciCol will accept any well-formed, valid URI, persistent identifiers expressed by clients. We are working closely with the California Digital Library and extending their EZID API for use in this part of our project.

Summary: We end 2012 on a BiSciCol high note, and not just because the meeting was in Boulder Colorado either (because of the elevation, people! Not the legal cannabis!) We have made a lot of progress based on productive meetings, a lot of input from various folks, and a lot of time and effort by our talented staff of programmers who work so hard to develop this and also canvas the community. We should also take this opportunity to give a shout out to a new developer on the team, Tom Conlin, who is joining us as our backend database expert. Great to have him on board!

- John Deck, Rob Guralnick, Brian Stucky, Tom Conlin, and Nico Cellinese

Making it 'EZ' to GUID

2012-10-12T12:53:00.001-07:00

On Global Unique Identifiers (again) for Natural History Collections Data: How to Stop People From Saying “You’re Doing It Wrong” (or conversely, “Yay! We’re Doing It Right!”)

From Gary Larsen and adapted by Barry Smith in Referent Tracking

presentation at the Semantics of Biodiversity Workshop, 2012.

The natural history collections community has been hearing about GUIDs (globally unique identifiers) for a long time. However, what we’ve typically heard are comments like “ARGH! These don’t work” or “These are hard to implement”; or we’ve been subjected to long “policy documents” that seem to be generated by computer scientists, not the people actually working in the collections. So the bottom line is that it’d be nice to have some clear, short “thinky things” about GUIDs that help us understand the value a bit more clearly, and that provides a simple and clear way forward. We want to take a stab at that here and VERY MUCH WELCOME feedback. Lots of it. We’ve thought a ton about it and we are ready!

A recurrent question we have gotten from people developing collections database (or at the level of aggregators such as Vertnet or GBIF) is why we need to go beyond the self-minted, internal GUIDs and why GUIDs need to resolve and be persistent. We could envision a large data aggregator such as iDigBio or GBIF that keeps track of digital records by assigning UUIDs (universally unique identifiers --- which are very, very easy to mint!) to these but likely without any connection to the physical source objects stored in providers institutions, and/or any connection to the same objects stored in other institutional repositories or aggregators. Yet, the ultimate value of assigning GUIDs to objects, their metadata and derivatives is that we can track all these back to their source and generate queries that imply semantic reasoning over a robust digital landscape. In such a landscape, answering those core-challenging questions generated by collaborative projects becomes possible. Therefore, the digitization process acquires a much deeper meaning and value by going beyond the process of straightforward data capture and moves towards an environment where we can track relationships among physical objects, annotations, and descriptive metadata as part of a global network. If as a community we agree on the benefit of assigning GUIDs, this is the opportunity to generate a strategy that would add long-term value to this effort. In other words, if we need to invest our resources, let’s do it in ways that we can draw benefit now and in the future.

A big question is how to best implement such a vision. GUID implementations within our community have proven problematic as evidenced by 20% of Rod Page’s blog posts. After much vetting of possible solutions, we believe the right answer is to leverage existing expertise developing not just GUIDs, but a set of services built around them.   In particular, we have talked to the California Digital Library (CDL) about EZIDs, and the value of using EZIDs given that these elegantly solves a lot of community needs at once and nicely positions us for the future. Speaking of community needs, the solution we advocate is not just “go get EZIDs”. BiSciCol was funded, in part, to help with this task of working with the community and foster the implementation of GUIDs as a necessary step towards bringing our digital resources into a Linked Open Data framework. BiSciCol wants to build out services that support the community, working with CDL and you, to make that happen.

What are EZIDS and why do we love them?
As we mentioned in a previous blog post (http://biscicol.blogspot.com/2012/08/the-biscicol-project-team-has-had-busy.html), CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs. The big win is that there are a bunch of CDL services already developed to help with minting these GUIDs, and assure that these are resolvable, linkable, persistent and long-term sustainable. EZIDs have some lovely features, including their flexibility to be associated with datasets and objects through the whole digital data life cycle. Also, EZIDs allow us to mix and match DOIs, which are well understood and used in the publishing community, with ARKs, which were developed in the archives, library and museum community and provide a bit more flexibility and the ability to assign on a more granular level to individual data objects rather than datasets. For more details, see John Kunze’s powerpoint presentation on EZIDs).   We can work with CDL and their EZID system to build a prototype collections community GUIDs service.

So you are thinking to yourself... how much does it cost? The answer is: Nothing to you,   very little to BiSciCol, and ultimately remarkably lower than what has already been spent in terms of people-hours trying to sort through this very complex landscape, and develop home-grown solutions. Sustainability has costs --- and the goal is to scale those down to the point where they are orders of magnitude lower than where they have been before by leveraging economies of scale. We do that with this solution. Big win.

Our View on Best Practices:

GUIDs must be globally unique. The “Darwin Core Triplet” might not be good enough.
GUIDs must be persistent. Most projects generating GUIDs have < 10 year lifespans. Having persistent GUIDs means we need to think about strategies for resolution services (if required) that have a > 10 year lifespan and in the context of an institution that is designed to be persistent.
GUIDs must be assigned as close to the source as possible. For example, if data is collected in the field, the identifier for that data needs to be assigned in the field and attached to the field database with ownership initially stated by the maintainers of that database. For existing data, assignment can be made in the source database.
GUIDs propagate downstream to other systems. Creating new GUIDs in warehouses that duplicate existing ones is bad practice, and thus aggregators need to honor well-curated GUIDs from providers.
Don’t conflate GUIDs for physical material with GUIDs for metadata about that physical material. We promote the assignment of unique GUIDs to physical material; metadata about physical material will have a separate GUID. While physical object IDs can be any type of GUID, we recommend EZIDs as they are short, unique, opaque, resolved by a persistent entity, and redirection to metadata can be stored with the identifier itself. UUIDs can be used for this purpose as well BUT are not as robust as EZIDs since they lack redirection or resolution or require local solutions (see #2 above for problems with such solutions).
GUIDs need to be attached in a meaningful way to semantic services. Knowing semantically that a GUID is referring to either an information artifact, a process, or a physical thing is vital to understanding how to interpret the meaning of its relationship to other GUIDs expressed in other areas and to inform aggregators how to interpret content.

A prototype collections community guid service.
GetMyGUID Service - “Promoting GUID Standard Design Practices”. We have blathered on long enough here, but want to just give a hint of where we are going. We want to create a service that is built by natural history collections folks (and our computer science friends) for natural history collections folks, that taps into existing goodness already created. That is, we tap into the existing services from EZIDs but then further develop a service that encodes best practices that work in this community. In the near future, we are going to explain how the service works, how you can access it, why it does what it does. We know how hard it is to get folks to make updates and additions to their databases, so we want to find out how to get over that barrier! We want to find those early adopters (and hint hint, we are working with BiSciCol partners already to get this ball rolling!). So, more soon. Pass the word along!

- John Deck, Rob Guralnick, and Nico Cellinese

News Update: how do we 'GUID'?

2012-08-25T11:30:00.002-07:00

The BiSciCol project team has had a busy summer that included a presentation at the Annual Meeting of the Society for the Preservation of Natural History Collections (SPNHC) in New Haven, CT, and a presentation at the iEvoBio 2012 meeting in Ottawa, Canada. Additionally, on the 13-15 August John Deck, Nico Cellinese, Rob Guralnick and Neil Davies convened at the University of California, Berkeley in order to meet with a few key partners and discuss the next steps forward for the project (meeting summary). Before we report more about the meeting with our partners, here is some background information.

BiSciCol's main goal is to break down the walled gardens between databases storing different kinds of biodiversity data such as specimens or samples from collecting events, sequences, images, etc. generated from those specimens or samples. Doing so requires overcoming two separate community challenges. First, there must be a mechanism to associate globally unique identifiers (GUIDs) to collections records (Note, we are using the RSS specification GUID definition). Second, the collections records must be expressed such that the terms used to define those records and their relationships are well understood by humans and computers. This brings us into the “semantic web” and RDF “triples”.

As BiSciCol has evolved, two key questions related to these challenges have emerged. The first is whether GUIDs and creating "triples" should happen at the level of individual provider databases, or instead at the level of "aggregators" that enforce a standardized schema and encoding. In the case of biological collections, an example of standardized schema is Darwin Core, usually encoded into a Darwin Core Archive. Example aggregators are GBIF, VertNet and Map of Life. The second question is equally thorny and deals primarily with the content that the identifier is describing: is the identifier describing a physical object, a digital surrogate of a physical object, and is it a primary digital surrogate or a copy? An example would be provided by specimen metadata attached to a photo record in Morphbank, which contains a copy of specimen metadata which in turn references a physical object.

So, lets turn back to the meeting in Berkeley. That meeting included two key partners with whom we want to further develop and test ways forward given the two huge questions above. We spent part of the time with the California Digital Library (CDL) folks, who have built a set of excellent tools that may be part of the solution to the problem of GUID assignment. CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs. John Kunze from CDL gave a great rundown on EZIDs and how they work, and was kind enough to meet us again on a couple of separate occasions, formal and informal. Metadata encoding in the EZID itself may also be used to indicate use restrictions and provenance (John Kunze’s powerpoint presentation on EZIDs).

The other key partner with whom we met was VertNet and Aaron Steele, the lead systems architect on the project. The idea behind meeting with VertNet was to test out how we might do EZID assignment and triplification utilizing the same approach by which VertNet data is being processed from Darwin Core archives into a set of tables that can be visualized, queried and replicated. Aaron was kind enough to participate to our hackathon and start up this process. We set up a readme file about the Hackathon to describe our expected outputs. Yes, the project is called "Bombus" which reflects the fact that although a bit wobbly, our goal is to have data flying around to "pollinate" other data. Happily, the hackathon was very much a success! We were able to tap into some existing code generated by Matt Jones (NCEAS) to mint EZIDs and VOILA, we had an output file ready for the semantic web (e.g. an output file that shows relationships between occurrences, localities and taxa based on the EZIDs). We weren't quite able to get to the last step of querying the results, but we're very close. More work (and reports) are to follow on this so stay tuned on the Bombus/pollinator link above.

We have been testing a variety of solutions for identifier assignment, including: supporting user-supplied GUIDs, aggregator GUIDs, dataset DOIs, community standard identifiers (e.g. DwC Triplet), and creating QUIDs (Quasi Unique Identifiers) from hashed content. EZID technology will play a significant role in the implementation of a number of these approaches. None of these approaches offer a complete solution, but taken together, we can begin to build an intelligent system that can provide valuable services to aggregators, data providers, and users. Services we will be supporting include: GUID tracking, identifying use restrictions, and GUID reappropriation. Integrating our existing triplifier and biscicol java codebases with a scalable database back-end will fulfill most of the technical requirements needed.

We are still building our Triplifier to support those who want to take their own datasets and bring them into the semantic web framework, but BiSciCol can operate much more "at scale" with a very simple interface that accepts Darwin Core Archives or other standardized data such as those generated from Barcode of Life, Morphbank, or iDigBio, and assemble these into a shared triplestore or set of commonly accessible triplestores. We think the issues we're tackling right now are at the sociotechnical heart of BiSciCol. We use the term heart knowingly because it is going to be the desire and will of the community, along with the resources such as BiSciCol, that can help motivate and excite, and that will get us at least moving in the right direction. If you have any thoughts, criticisms, suggestions, we'd of course love to hear them.

John Deck, Rob Guralnick and Nico Cellinese

Making our System Smarter

2012-04-06T15:20:00.004-07:00

Computers are amazing at following instructions. So amazing, in fact, that a seemingly harmless instruction can potentially lead to an entirely false conclusion.

At our recent BiSciCol meeting at the University of Florida, we had a discussion about just such a case.

At its core, BiSciCol is all about connecting objects to each other. In order to accomplish connecting object identifiers to other objects, we have been using a simple relationship expression called “leadsTo” that indicates a direction in the nature of the relationship between one object and another. To illustrate how “leadsTo” works, lets provide a simple example. Suppose we have a collecting event, which we join to a specimen object using our relationship predicate “leadsTo”. The specimen object could then “lead to” a taxonomic determination, which could in turn “lead to” a scientist, and so on.

This is certainly useful as we can express an endless chain of objects and their derivatives , even if they exist in different databases. However, what if we extended the above example just a bit further, using our “leadsTo” relationship?

Uh oh--- By successively following the leadsTo relationships, we could now erroneously conclude that spec2-t1 came from spec3! This is not good! Fortunately, there is a solution.

We realized that the directional “leadsTo” relationship simply doesn't make very much sense in some situations, such as the connection between spec3 and person1 in the diagram above. Consequently, instead of the single “leadsTo” relationship, we actually need two relationship terms: one that has a distinct direction and one that implies no direction. Two terms are in use currently that do just this from the Dublin core standard: 1) relation (no direction) and 2) source (has direction).

In the first example above, we could avoid the problem entirely by describing the link between the taxonomic determination and the scientist as a non-directional relation. Using our new terminology, the graph would look something like this:

The computers involved in figuring out how to traverse the graph of relationships would know not to follow non-directional relationships and we would no longer infer that spec2-t1 came from spec3. Problem solved!

This post written by John Deck and Brian Stucky with input from Hilmar Lapp, Steve Baskauf, Andrea Thomer, Rob Guralnick, Lukasz Ziemba, Tim Robertson, Reed Beaman, and Nico Cellinese, summarizes a discussion that took place at the BiSciCol development meeting held on the March 31, 2012 at the University of Florida.

Development Meeting, Boulder, Colorado, 2-4 February, 2012

2012-02-17T15:06:00.000-08:00

Two Days of Triplifying

The Boulder contingent of BiSciCol hosted a short two day "developer's meeting" that included John Deck, Brian Stucky, Lukasz Ziemba, Bryan Heidorn and his students Alyssa Janning and Qianjin Zhang. As luck would have it, the BiSciCol crew arrived on a blustery morning just ahead of the biggest single February snowfall on record. It proceeded to snow from Thursday afternoon to Saturday morning without much pause, lending a surreal quality to the proceedings. It also fubared some plans to use campus meeting facilities, since that Friday was the first snow day in many a moon. Enough about the weather! Lets talk about what we did!

All participants were very pleased with the outcome of this meeting. Brian had been hard at work developing a generic plug-in interface so that anyone can write some simple code to connect whatever kinds of record sets they have and begin an import into the Triplifier. OH! WAIT. WAIT. First things first! What the heck is Triplifier, you ask? And why do we think this Triplifier is such a good idea?

The BiSciCol project works by linking data in data sources based on logical relationships independent of a particular implementation. This is a different kettle of fish compared to a data standard. BiSciCol works where standards stop. We particularly want to, for example, represent how a sequence is related to a specimen which is related to an event and a location. The problem is creating a common "format" for expressing simple relationships and then using a set of those simple ones to build more complex "graphs" of these relationships. So what is that "common format"? In the world of the Resource Description Framework (RDF), the format is called a "triple".

A triple is not that complicated; it basically expresses a unique fact about how things are related to one another. The format of a triple is subject - predicate - object (thus the "triple" - three pieces of data). The triple format is not all that different from what is expressed in a database or spreadsheet or other structured data document. The set of relationships that allow joins to happen in relational databases are in theory very similar. So similar that one can convert a database or other document into triples. And thus the point and value of a "Triplifier" -- a way to convert any set of documents into triples so that we can begin to compile a larger set of resources.

So back where we started...Brian Stucky has developed a generic plug-in for ingesting different types of data into the Triplifier. And Lukasz has used that generic plug-in to build a Darwin Core Archive ingester. The big news, however, has to do with a platform called D2RQ. Basically, D2RQ does the heavy lifting of representing relational databases (or your own declarations of relationships between objects) as triples. At the heart of D2RQ are: 1) the "ClassMap" which represents classes from a schema; 2)"PropertyBridge" which basically defines the properties in RDF using the class map.; 3) Joins that link tables. In a nutshell, a user can specify (or pass along a relational database) with the right information about the database and its foreign keys, and dump out RDF triples.

The good news is that we were able to test D2RQ with a very simple relational database that relates collectors and specimens to verify that we get the right outputs. After some trial and error, and specifying the right class maps, we succeeded in generating meaningful RDF output. Given this, we are ready to rock and roll with developing the Triplifier fully, and will be cranking on this over the next few months. Lukasz has already made progress on a Web interface, and we are preparing to test the system with data from the Moorea Biocode and HERBIS project.

- Rob Guralnick reports

Development meeting at UF, Sept. 12-13, 2011 - A quick summary

2011-09-23T11:56:00.000-07:00

Our core development team met again, and although a few key people were missing, there were enough to generate a healthy discussion and set the course for more progress on BiSciCol development. The following people attended: Reed Beaman (UF), Nico Cellinese (UF), Brian Stucky (University of Colorado, Boulder), John Deck (UC Berkeley), Bryan Heidorn (UA), Tom Orrell (SI), Kate Rachwal (UF), Russell Watkins (UF), Lukasz Ziemba (UF). Overall, we discussed a number of topics to include user scenarios and potential queries, output of query results, coding requirements, how to represent geographic information, how to handle taxonomic names, tasks to be completed, timetable. Consistent threads throughout these discussions were the form of BiSciCol, from a structural and user interface perspective, and its role with respect to data providers (clients), processing and service. On Monday 12th, we reviewed the user scenarios we generated in the past and discussed potential queries derived from the workflows presented in these scenarios. It quickly became clear that there was a convergence of common queries across all scenarios. Potential query types are listed below followed by a brief example in parentheses. It should be noted that the data schema is based on transitive relations between objects, including “relatedTo” and “sameAs” inferences between objects. Also, filters can be applied to any Object Id with respect to: date, source database, and level or depth of relatedness. These filters can be applied in combination within a single query (see below). We will have to spend more time thinking about scalability and how to handle very large and/or complex queries.

Enumeration of Query Types

Relations = The graph of transitive relations attached to a particular object. Including “sameAs” inferences between this object and all related objects.

DateFilters = All queries below can be modified with a date filter for any Object ID Source DB Filters = Ability to filter on source database.

Potential Query Types:

1. WHERE Object=X, return relations (Given a specimen ID, return eventID, tissueIDs, etc… that are related)

2. WHERE Object=X, return List of Objects that are "sameAs" (From Use Case #4: Given a collectors specimen ObjectID, return internal IDs, and other institution IDs that refer to this same ID)

3. WHERE Object=X, and has duplicate non-matching literals (A GUID has two different taxonomic names "Jim" and "Fred")

Filtering Queries (on rdf:Type):

Recursive filters can be applied to derived graphs (returned output.).

We need to support multiple types of filters, which means (1) multiple filters at one level of relatedness and/or/not (2) multiple filters across several levels of relatedness.

1. Given an Object, return all related objects that are a certain rdf:type (e.g. Show all images that are attached to a Specimen ID).

2. Given an Object, return related objects that are a certain rdf:type AND related objects that are another rdf:type (e.g. given an EventID, only search for specimens with identifications so we don't also include tissues, extractions, sequences with identifications).

A consideration that arose from the query conversation was database updates and tracking of changes. A discrete timetable schedule for database updates needs to be determined. The mechanism for updates will involve applying an inferencing process to data delivered from partners to create new relations tables. Currently, we think that the process will proceed as follows:

1. Clients publish data to a Virtuoso Database for example, (perhaps via GBIF’s IPT) into a. client specific relations table and b. client specific sameAs table

2. BiSciCol will periodically re-infer all relations (all previously submitted data is still accessible)

3. BiSciCol publishes results of user queries periodically (e.g. through RSS) Regarding real-time changes/incorporation of individual object records we think this is easily done for relatedObjects (e.g. putting in a newRelationsTable) but we need more discussion on the inferencing process to complete (1x / month) for sameAs relations

So, having stated all of the above we still need to:

- Think about complex queries and scalability
- Expressing “Darwin-Triples” (institution code, collection code, catalog #) as a GUID. E.g. http://mydarwincore.org/resolver/ic=MVZ;cc=VERTS;cn=12345 OR LSID --- this is a project relegated to some implementation, probably HERBIS test case.
- Update Test case to put Multiple Literals for same Object
- Pre-inferred relations as a separate table?

We then discussed various forms of output of query results (see summary listed below). Suggestions included some standard forms of visualization such as summary statistics, charting (e.g. bar, pie, etc.), browsable lists, sortable HTML tables, a timeline of object updates and changes, and mapping of geolocated objects. Other ideas included RDF triples, various JSON outputs, some sort of standards compliance measure(s), incremental levels (e.g. list a specified depth or level of the sameAs or relatedTo). One interesting idea was to provide a means to evaluate the completeness and quality of collector efforts and databases. Two suggested approaches were user vetting, with Facebook-like “up or down” voting buttons, or a possibly more objective ranking algorithm (e.g. Google’s page ranking). For now, we are going to provide outputs in the form of tables, hierarchical (tree view), maps, and by rdf:Type. However, we agreed to seek out other interesting and innovative forms of visualizing results.

0. pie charts, statistics on data that is returned (e.g. Sencha)
1. RDF Output (integrate w/ freebase)
2. JSON Output
3. Standards Compliance Measure (e.g. MIMARKS score in percent)
4. Sortable HTML Table  
5. Summarize results by rdf:Type (either modified HTML Table or modified JSTree e.g. see http://library.conservefloridawater.org/WCC?act=view&oid=11964077&lng=1)

dwc:Identifications

dwc:Taxon

6. Search for interesting visualizations
7. Service renders incremental portions of graphs (e.g. 1 transitive level deep)
8. Timeline – what has changed over a period (graph of date last modified)
9. Browse lists: institutions, collectors, etc …
10. Mapping results
11. Vote Up or Down on Datasets or Collectors (e.g. see http://answers.semanticweb.com/questions/1581/whats-missing-from-rdf-databases) - for example, consider facebook like buttons, or google page rank on output methods for vetting, objective algorithms.

Tuesday morning was largely occupied with technical coding discussions. Specific tasks were identified and prioritized relative to project milestones (see summary below). This included user interface development, documentation of code, REST service tuning, data ingestion filters, and RDF-specific enhancements such as indexing and inferencing. One major component identified for development has been dubbed “TriSciCol”, which establishes and expresses the triple-store object relations from client-supplied databases. Details on the design and functionality of this component will be circulated soon.

Focus areas

Lukasz Ziemba (UF) User Interface Development
Brian Stucky (CU) REST Service / Tweeks
Brian Stucky (CU) Code-Base / Documentation of Code / Unit Testing
John Deck Data Ingest (GBIF data?, Cam Web)
Deck/Stucky Data Indexing / RDF Structure / Inferencing Engines / Reasoners
Bryan Heidron (UA) Herbis Use Case / Arctos Integration
Orrell (SI), Deck, Cellinese (UF) TriSciCol – flesh out specifications and submit for review by BiSciCol collaborators

How to store and represent geographic information in BiSciCol

“Location” is a reference to the verbatim location information by the original collector. This GUID points to the Darwin Core Location class except for the parts of that class that refer to a georeference. The “Georeference” class expresses a georeferencing instance that is related to “Location”. This allows us to express multiple georeferencing instances for an individual “Location” and track DateLastModified dates for georeferencing instances atomically.

The result is that “Locations” are never mapped by BiSciCol, only “Georeferences” are. In this way, we can define types of “Georeferences” that correspond to the manner in which they will be mapped or utilized in consuming applications. E.g. geo:lat/geo:lng, SpatialThing, MGRS, etc.

A note for consideration here is that currently georeference terms are embedded in the Location class in Darwin Core and we will probably want to make a reccomendation to move the georeferencing terms into their own class.

Terms

dwc:Location is an object type that indicates the associated identifier is a location (E.g. text description, coordinate, any direct reading from an instrument, or observed). The content linked by this identifier is normally verbatim location information. This identifier cannot contain references to location literals.

:Georeference is an object type that indicates the associated identifier is a geo-referenced location. This is the only object type that can reference spatial representations as literals. Need update from DwC folks on status of Georeference being its own DwC Class?

:hasGeoreference is a predicate that joins an identifier of type GeoreferenceID to a literal representation of a spatial representation. :hasGeoreference can have the following subProperties:

:hasMGRSGeoreference
:hasSpatialThingGeoreference
:hasWKTGeoreference

Example N3 file

:ObjectX a dwc:Event

:ObjectX dwc:DateLastModified “2011-01-01”

:ObjectY a dwc:Location

:ObjectY dwc:DateLastModified “2011-01-02”

:ObjectZ1 a dwc:Georeference

:ObjectZ1 :hasSpatialThingGeoreference “48.198634,16.371648;crs=wgs84;u=40”

:ObjectZ1 dwc:DateLastModified “2011-01-03”

:ObjectZ2 a dwc:Georeference

:ObjectZ2 :hasMGRSGeoreference “4QFJ1234567890”

:ObjectZ2 dwc:DateLastModified “2011-01-03”

:ObjectX relatedTo :ObjectY

:ObjectY relatedTo :ObjectZ1

:ObjectY relatedTo :ObjectZ2

Java Code

Georeference class implements Location interface. SpatialThingGeoreference, MGRSGeoreference, WKTGeoreference classes extend the Georeference class. A georeference is returned by BSCObject with the method getGeoreference.

We’ll meet again as a much larger group on the 19th of October during the TDWG meeting in New Orleans. If you happen to be there and interested in our discussion, do join us and/or stay tuned for more news.

Tagging use case 585

2011-07-21T13:05:00.000-07:00

Tagging use case 585. requires semantic linking to taxonomy of the shark, geospatial track and place names and likely a tracking methods ontology.

BiSciCol core software architecture

2011-06-18T00:04:00.000-07:00

This simplified UML diagram illustrates the current architecture of the core BiSciCol classes. These classes are responsible for most of the lower-level BiSciCol functionality: interacting with data, working with queries, traversing the BiSciCol data structures, and converting BiSciCol data to other representations, such as XML. Note that this diagram does not depict all relationships or dependencies among the classes. Instead, relationships most important for a general understanding of the code are included, such as inheritance, interface implementation, and aggregation. Further, most private class methods and members are excluded for simplicity.

In general, we've been working to develop an architecture characterized by classes with clearly-defined responsibilities and a high degree of flexibility. We've also tried to develop objects that map naturally to the BiSciCol problem domain. For instance, both BiSciCol data objects and models are represented by corresponding abstract data types. Furthermore, the code that implements BiSciCol objects and data operations is completely separated from the code responsible for converting BiSciCol data to other formats, such as XML.

This is simply a snapshot of the state of the code at this time. Some of the classes illustrated above are incomplete and/or will almost certainly be redesigned. Collaborations between classes will likely change as well. Nevertheless, the diagram shows not only progress we've made, but also many of the design ideas we've been working with.

BioSciCol VertNet Integration Meeting

2011-06-02T08:45:00.000-07:00

Recent big news is the funding of VertNet! Given this, some of us got together to discuss ways we can work together towards some common goals.

Location:”Jupiter Cafe”
John Deck (Kolsch)
Aaron Steele (Jupiter Red)
John Wieczorek (Hot Chocolate)

VertNet will publish to PubSubHubbub. BiSciCol can subscribe to the hub and receive updates on what is going on.

JohnD:
- VN can encourage, promote, implement use of GUIDs
- Links to GUIDs of Annotation references (of all kinds like data quality stuff e.g. Creation of WGS84/dec.lat/lng instance of DMS data)
- BiSciCol can help VertNet in augmenting search (relationship graph search)

JohnW:
- BiSciCol is base study for annotation use cases (query on annotation relations?)
- VertNet won’t look at relationships between objects, so BiSciCol could be FTW
- Recommends Twitter dialog between BiSciCol and VN

Aaron:
- Annotations
- PubSubHubbub
- Twitter dialog

More to come. If 2 of the attendees can figure out how to use (and if they really want to use) twitter we can stay plugged in that way (per JohnW's suggestion). Otherwise, we'll be forced to meet over beers and use our speech.

Revised Prototype Implementation Diagram

2011-04-14T09:31:00.000-07:00

This is a refined diagram describing our preliminary implementation

Development Update

2011-03-26T07:43:00.001-07:00

Brian Stucky and I made a first pass at an online prototype back in mid-February. This prototype featured a text-file triple-store back-end, a bit of javascript, a cool modeling widget, and some JSP glue.

Since then, I've been working on a few things to make this prototype more solid:

1) Implementing a database back-end for the triple-store. I've been working with Virtuoso under an evaluation license and have found it fairly easy to work with. I would prefer a free or more open-source solution but it seems Virtuoso is the best product out there for our needs.

2) Building queries on REST services. Not wanting to make the REST services an after-though to our architecture, I am building the REST services into the our web prototype.

3) Re-building SPARQL queries to be more portable, scalable, and faster. Specifically, the transitive closure functions were previously built using specialized syntax not portable to larger systems. The new syntax is more portable and will be alot faster. E.g:

SELECT ?s ?dist FROM
WHERE
{
{
SELECT ?s ?o
WHERE
{
?s bsc:relatedTo ?o .
}
} OPTION (TRANSITIVE, t_distinct, t_in(?s), t_out(?o), t_min (1), t_max (20), t_step ('step_no') as ?dist) .
FILTER (?o= dwc:spec12345)
}

Hoping to get this new prototype online in the next week or two.

Preliminary Implementation Diagram

2011-03-03T10:49:00.000-08:00

An indication of where we are headed with our architecture/implementation. This is a draft form and comments are welcome!

BiSciCol Technical Meeting: University of Florida, 23-24 February 2011

2011-02-25T04:25:00.000-08:00

The BiSciCol Technical team had a successful brainstorming meeting over two warm Florida days.

Attendees:

Nico Cellinese (UF), Kate Rachwal (UF), Reed Beaman (UF), Gustav Paulay (UF), Russ Watkins (UF), John Deck (Berkeley), Brian Stucky (Colorado University), Rich Pyle (remotely, Bishop Museum) and our very special guest Steve Baskauf (Vanderbilt University).

The main goal of the meeting was to define the BiSciCol scope, techcnical goals and the model. Happy to report we made great progress. We are now ready to move to an implementation phase and are confident will be able to release a prototype in July 2011. See below a snapshot of out design document.

Purpose of the System

1. Notify all objects that are related to Object X that Object X has changed.

2. Manage relationship definitions between objects that exist in distributed databases.

3. Manage annotations of objects where annotation is treated as an Object relationship.

Technical Goals of the System

1. scalable design

2. easy coding but must be robust

3. expandable to different domains

Envisioning this as a distributed network with structured content defined. Goal is to allow for multiple data sources, each with their own thematic content. The thematic content is built on defining relationships between objects.

The Model

Linkages between objects is unstructured which allows any object to be tied to any other object via the predicate “relatedTo”. This is similar to SKOS “related” but we have made relatedTo transitive. relatedTo does not require objects to be organized hierarchically.

Following are the triples recognized in the network:

:objectId :relatedTo :objectId

:objectId rdf:type :objectType

:relatedTo :hasProperty :propertyDefinitionURI

:objectId dwc:dateLastModified “YYYY-MM-DD”

News and updates

2011-02-14T10:47:00.001-08:00

Since our last meeting in Washington, DC we worked out some of the object relationships and how we want to formalize concept/objects such as individuals, specimens, vouchers etc. for the purposes of our network. We also agreed to consider implementing RDF in our design. To this end, a number of people have since purchased and read (hopefully past the introduction) the book "Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL". Finally, we agreed to deliver a working prototype of the BiSciCol network in July 2011, implementing data from Biocode, CalPhotos, and UF.

During January and the first part of February, John Deck (Berkeley), Rob Guralnick (Colorado U.), and Brian Stuckey (Colorado U.) have been crafting a technical implementation plan for the BiSciCol prototype. CU has offered a server and Tomcat instance for our prototypes and Brian has been implementing that along with getting the web interface going. John has been working on a model that ties together collecting events, specimens, tissues, DNA extractions, and photographs (more object types later), while incorporating location and modification date. Taxonomy is notably absent for now and we will soon be working more closely with Rich Pyle and Rob Whitton (Bishop Museum). Codebase is Java and inferencing/RDF work being implemented in Jena/ARQ (open source Java).

On the 23rd and 24th of February, John and Brian will be at UF for a technical meeting with Nico Cellinese, Kate Rachwal, Russ Watkins, and special guest Steve Bauskauf (Vanderbilt University) to review the UF implementation and proposed model. We will also work remotely with Rich Pyle and Rob Whitton. More details on this meeting will be posted soon. Given our prototype deadline of July 2011 we have a lot of ahead of us, especially in integrating taxonomic names components but we feel we are on target with progress to date.

16-17 December 2010 - Meeting outcome

2011-01-31T07:13:00.000-08:00

Our December Meeting brought us all together and we were fortunate to have Davie Vieglais from DataOne and Bob Morris from FilteredPush (a.k.a. Push-me-pull-you). We reviewed use cases and case scenarios, discussed object relationships and ways to model them.

We are making progress with designing a technical architecture and on February 23-25, 2011 John Deck (Berkeley), Brian Stuckey (U. Colorado), Kate Rachwal (UF), Russ Watkins (UF), Nico Cellinese (UF), and our guest Steve Baskauf (Venderbilt) will meet at the Florida Museum of Natural History to refine solutions to some of the impending technical issues.

We created a number of subgroups that will be in charge of developing specific aspects of the project:

1. Taxonomic Names (Rich Pyle [chair], Gustav Paulay, Tom Orrell, Chris Meyer, Nico Cellinese, John Deck, Jonathan Coddington, Rob Whitton).

2. Ontology (Nico Cellinese [chair], Bob Morris, Kate Rachwal, John Deck, Jonathan Coddington, Rich Pyle).

3. Geospatial (Rob Guralnick [chair], Reed Beaman, Russ Watkins, Kate Rachwal, John Deck).

4. Technical Architecture (John Deck [chair], Rob Whitton, Dave Vieglais, Kate Rachwal, Rob Guralnick, Russ Watkins, Tom Orrell, Bryan Heidorn, Jonathan Coddington).

5. Domain Scientists (Chris Meyer [chair], Nico Cellinese, Gustav Pailay, George Roderick, Neil Davies, Rob Guralnick, John Deck, Tom Orrell, Jonathan Coddington, Reed Beaman). Improve on test case list.

6. Sustainability Group (Neil Davies [chair], Rob Guralnick, Reed Beaman, Chris Meyer).

Reed Beaman, Rob Guralnick and Russ Watkins are compiling a conceptual plan that will include specifics on who is doing what, timeline, and priorities. We all agreed to deploy a prototype in July 2001 for testing by the community.

More later.....

Meeting in Washington DC, 16-17 December, 2010

2010-12-17T21:11:00.000-08:00

We had a great, productive, fun meeting and notes will be posted soon. In the meanwhile, I wanted to expose another inconvenient truth: informatics and technology are not always best friends. Check this out!

Here in the above "picture": Chris Meyer, Tom Orrell, Rob Guralnick, John Deck, Neil Davies, Gustav Paulay, Bob Morris, Jamie Whitacre, Kate Rachwal, Meghan Parker, and Nico Cellinese. Additional participants were Dave Vieglais, Reed Beaman, John Keltner, Linda Ward and Jon Coddington. "Photograph" taken by Reed Beaman.

BiSciCol Technical Architecture

2010-12-02T16:08:00.000-08:00

Here is a quick diagram showing BiSciCol proposed architecture. Click the image to see the details.

More information and code is at the Google Code Page.

NSF original proposal

2010-10-27T10:09:00.000-07:00

The original proposal submitted in July 2009 to the Division of Biological Infrastructure (DBI-BRC) can be downloaded here: http://bit.ly/cCq3DH

BiSciCol Tracker: Ready, set, go!

2010-10-27T08:29:00.000-07:00

BiSciCol (Biological Science Collections) Tracker is a recently funded NSF project (September 2010) with the goal of building an infrastructure designed to tag and track scientific collections and all of their derivatives.

Scientific collections created and used in basic research are an integral part of our scientific infrastructure. Individual specimens in these collections serve as the anchor for an expanding array of information that grows and changes with time about the specimen and the group that the specimen represents. Unfortunately, as we all know, specimens and subsamples are scattered geographically across institutions. Taxonomic, genomic, geospatial, and other information about the specimens is also scattered across independent computer systems and on paper, and are very difficult to access or synthesize. Current data sharing systems such as DigIR are one-way channels and do not allow for quick and easy two-way linking of information or updates as new knowledge is gained.

The BiSciCol team will take the appropriate next steps to address a community-wide challenge facing the biological collections community – linking and tracking scientific collection objects (specimens, sequences, images, etc.) and their digital metadata across multiple institutional collections with heterogeneous information management systems. In current distributed data systems (e.g., GBIF, MANIS, HerpNET, ORNIS), information is passed one-way from data providers to users. No mechanism exists to tag or annotate collection objects and link information to other collection objects or data records and back to the original collections. Our deliverables include 1) develop a tracking and annotation system based on globally unique identifiers (GUIDs) and ontological relationships; 2) deploy this system and others in a Virtual Information Appliance (VIA) as a Virtual Machine (VM); and 3) document and implement a set of use cases and practices, based on characteristic physical and digital workflows in the community.

The need to provide access to validated biodiversity information has been documented in a number of workshops, reports, etc., but as yet there is no single implementation that would support collections and research information management using the proposed approach. BiSciCol is designed on the simple premise that changes to data objects are trackable with GUIDs, and that semantic relationships are assignable and discoverable among physical and data objects, for example when a specimen is imaged or sampled for DNA extraction. Ultimately, this project enables discovery, accessibility, and networking of collections, in order to advance semantic interoperability for collection information systems.

Our deliverables are designed to benefit the entire biological collections community by taking initial steps to implement core information infrastructure based on established challenges in the community. Collections data are critical to land management decisions, maintenance of biodiversity, and analysis of the causes and consequence of climate change. Finally, we will actively engage use communities through training workshops, summer student internships, and community BioBlitz enhancements.

Who we are: The BiSciCol collaborative represents a broadly trained team of biologists, collections curators, and information and technology specialists. Our team includes 6 Institutions (University of Florida, The Smithsonian, University of California, Berkeley, University of Colorado, Boulder, University of Arizona, and The Bishop Museum, Hawaii) and 15 Investigators (Nico Cellinese, Jonathan Coddington, Neil Davies, John Deck, Rob Guralnick, Bryan Heidorn, Steve Manchester, Chris Meyer, Tom Orrell, Gustav Paulay, Rich Pyle, Kate Rachwal, George Roderick, Russell Watkins, Rob Whitton, and Norris Williams).

Needless to say, we are anxious to start!