Thursday, December 27, 2012

BiSciCol in Four Pictures

People always say a picture is worth a thousand words.   Given that, we want to present “4000+” words here.  That is, below are 4 images and some “captions” or explanatory text – that, while not fully inclusive of current efforts,  gives a  close to a complete update on progress and our next steps.



Figure the first.   One of the things that the BiSciCol crew has thought a lot about is how to express relationships among different kinds of physical and digital biological collection “objects”.  Our work is focused on tracking those relationships, which means following the linkages between objects as they move about on the Internet of Things (http://en.wikipedia.org/wiki/Internet_of_Things).  Early in the BiSciCol project, we had exactly one relationship, which we expanded a few blog posts ago, by adding a second predicate called “relatedTo” which is directionless and limits how searches could traverse our network.  We have now settled on what we hope is a final set of predicates, which also includes “derives_from” and “alias_of”.  “Derives_from” is important because it recognizes that properties of biological objects can be shared among its derivatives, such as saying that a tissue sample can be inferred to be collected in Moorea, French Polynesia because the specimen (whole organism) was defined as being collected there (“derives_from” is borrowed from the Relations Ontology and defined as transitive).  Finally, “alias_of” is a way of handling duplicate identifiers for the same object.



Figure The Second.  We know you love technical architecture diagrams during the holidays.  Although this looks a bit complicated, let’s take this apart and discuss the various parts, because it summarizes a lot of work we invested to deal with some challenging social and technical issues.  This diagram is really built on three main components:  the GetMyGuid service, the Triplifier Simplifier, and the Triplifier Repository.  The GetMyGUID service is used to mint EZIDs that can be directly passed to biocollections managers for using at the source, or that can be associated with data in the triplestore.    The Triplifier (Simplifier) is a tool for creating RDF from biocollections data, and pushing that to a user via web services or to a triplestore.  We are now working out the backend architecture to deal with storing a large number of triples.  We have developed this architecture to be flexible, simple, and based on understanding user needs (and concerns) with regards to permanent, unique identifiers and semantic web approaches.



Figure the third.  The Triplifier is a web-based software that takes input files and creates triples (http://en.wikipedia.org/wiki/N-Triples) from them.  The process for doing this involves multiple steps, starting with uploading a database or a spreadsheet to the Triplifier, specifying any known joins between tables that are uploaded, and mapping properties in those local files to known terms in an appropriate vocabulary, relating terms using predicates and then hitting “Triplify!”  For those not versed in ontologies and the semantic web, the whole process can be intimidating!  So we made it easier.  The Triplifier Simplifier can take any dataset in Darwin Core format, and we’ll do the work for you.  We’ll read the header rows, verify that they map to Darwin Core terms, and set it all up to Triplify correctly.  VoilĂ !  We have a bit more work to do here before the Simplifier is ready – the big challenge is taking these flat files “spreadsheets” and recreating a set of tables based on Darwin Core classes such as “occurrence”, “event”, “taxon”, etc.  We will spend more time discussing this in future blog posts!

Figure the Fourth.   This is another “in preparation” web interface for users to get Great and Useful EZIDs.  The options for doing so include pasting in a set of local identifiers, which could be set of catalog numbers or locally specific event Identifiers.  The GetMyGuid service creates a second column and makes an EZID per row linked to the local identifier.  A user can then import this right back into their database and have EZIDs on their source material.   The “Create GUIDs” link just mints a set of EZIDs for later use.  Some authentication will be required and we might put an expiration data on how long you can wait to use them.  The last option is “Mint a DOI for your dataset”.  You basically just type in the digital object location, and some key metadata and you get a DOI that can resolve to at least the metadata and link to the actual digital object.   As always, BiSciCol will accept any well-formed, valid URI, persistent identifiers expressed by clients.  We are working closely with the California Digital Library and extending their EZID API for use in this part of our project.

Summary:  We end 2012 on a BiSciCol high note, and not just because the meeting was in Boulder Colorado either (because of the elevation, people!  Not the legal cannabis!)  We have made a lot of progress based on productive meetings, a lot of input from various folks, and a lot of time and effort by our talented staff of programmers who work so hard to develop this and also canvas the community.  We should also take this opportunity to give a shout out to a new developer on the team, Tom Conlin, who is joining us as our backend database expert.  Great to have him on board!

- John Deck, Rob Guralnick, Brian Stucky, Tom Conlin, and Nico Cellinese

Friday, October 12, 2012

Making it 'EZ' to GUID

On Global Unique Identifiers (again) for Natural History Collections Data:  How to Stop People From Saying “You’re Doing It Wrong” (or conversely, “Yay! We’re Doing It Right!”)
'
From Gary Larsen and adapted by Barry Smith in Referent Tracking
presentation at the Semantics of Biodiversity Workshop, 2012.

The natural history collections community has been hearing about GUIDs (globally unique identifiers) for a long time.  However, what we’ve typically heard are comments like “ARGH!  These don’t work” or “These are hard to implement”; or we’ve been subjected to long “policy documents” that seem to be generated by computer scientists, not the people actually working in the collections.  So the bottom line is that it’d be nice to have some clear, short “thinky things” about GUIDs that help us understand the value a bit more clearly, and that provides a simple and clear way forward.  We want to take a stab at that here and VERY MUCH WELCOME feedback.  Lots of it.  We’ve thought a ton about it and we are ready!
 
A recurrent question we have gotten from people developing collections database (or at the level of aggregators such as Vertnet or GBIF) is why we need to go beyond the self-minted, internal GUIDs and why GUIDs need to resolve and be persistent.  We could envision a large data aggregator such as iDigBio or GBIF that keeps track of digital records by assigning UUIDs (universally unique identifiers --- which are very, very easy to mint!) to these but likely without any connection to the physical source objects stored in providers institutions, and/or any connection to the same objects stored in other institutional repositories or aggregators.  Yet, the ultimate value of assigning GUIDs to objects, their metadata and derivatives is that we can track all these back to their source and generate queries that imply semantic reasoning over a robust digital landscape. In such a landscape, answering those core-challenging questions generated by collaborative projects becomes possible.  Therefore, the digitization process acquires a much deeper meaning and value by going beyond the process of straightforward data capture and moves towards an environment where we can track relationships among physical objects, annotations, and descriptive metadata as part of a global network.  If as a community we agree on the benefit of assigning GUIDs, this is the opportunity to generate a strategy that would add long-term value to this effort.  In other words, if we need to invest our resources, let’s do it in ways that we can draw benefit now and in the future.

A big question is how to best implement such a vision.  GUID implementations within our community have proven problematic as evidenced by 20% of Rod Page’s blog posts.  After much vetting of possible solutions, we believe the right answer is to leverage existing expertise developing not just GUIDs, but a set of services built around them.   In particular, we have talked to the California Digital Library (CDL) about EZIDs, and the value of using EZIDs given that these elegantly solves a lot of community needs at once and nicely positions us for the future. Speaking of community needs, the solution we advocate is not just “go get EZIDs”.  BiSciCol was funded, in part, to help with this task of working with the community and foster the implementation of GUIDs as a necessary step towards bringing our digital resources into a Linked Open Data framework.  BiSciCol wants to build out services that support the community, working with CDL and you, to make that happen.

What are EZIDS and why do we love them?  
As we mentioned in a previous blog post (http://biscicol.blogspot.com/2012/08/the-biscicol-project-team-has-had-busy.html), CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs.  The big win is that there are a bunch of CDL services already developed to help with minting these GUIDs, and assure that these are resolvable, linkable, persistent and long-term sustainable.  EZIDs have some lovely features, including their flexibility to be associated with datasets and objects through the whole digital data life cycle.  Also, EZIDs allow us to mix and match DOIs, which are well understood and used in the publishing community, with ARKs, which were developed in the archives, library and museum community and provide a bit more flexibility and the ability to assign on a more granular level to individual data objects rather than datasets.  For more details, see John Kunze’s powerpoint presentation on EZIDs).   We can work with CDL and their EZID system to build a prototype collections community GUIDs service.

So you are thinking to yourself... how much does it cost?  The answer is:  Nothing to you,   very little to BiSciCol, and ultimately remarkably lower than what has already been spent in terms of people-hours trying to sort through this very complex landscape, and develop home-grown solutions.  Sustainability has costs --- and the goal is to scale those down to the point where they are orders of magnitude lower than where they have been before by leveraging economies of scale. We do that with this solution.  Big win.  

Our View on Best Practices:

  1. GUIDs must be globally unique.  The “Darwin Core Triplet” might not be good enough.  
  2. GUIDs must be persistent.  Most projects generating GUIDs have < 10 year lifespans.  Having persistent GUIDs means we need to think about strategies for resolution services (if required) that have a > 10 year lifespan and in the context of an institution that is designed to be persistent.
  3. GUIDs must be assigned as close to the source as possible.  For example, if data is collected in the field, the identifier for that data needs to be assigned in the field and attached to the field database with ownership initially stated by the maintainers of that database.  For existing data, assignment can be made in the source database.
  4. GUIDs propagate downstream to other systems.  Creating new GUIDs in warehouses that duplicate existing ones is bad practice, and thus aggregators need to honor well-curated GUIDs from providers.  
  5. Don’t conflate GUIDs for physical material with GUIDs for metadata about that physical material.  We promote the assignment of unique GUIDs to physical material; metadata about physical material will have a separate GUID.  While physical object IDs can be any type of GUID, we recommend EZIDs as they are short, unique, opaque, resolved by a persistent entity, and redirection to metadata can be stored with the identifier itself.  UUIDs can be used for this purpose as well BUT are not as robust as EZIDs since they lack redirection or resolution or require local solutions (see #2 above for problems with such solutions).
  6. GUIDs need to be attached in a meaningful way to semantic services.  Knowing semantically that a GUID is referring to either an information artifact, a process, or a physical thing is vital to understanding how to interpret the meaning of its relationship to other GUIDs expressed in other areas and to inform aggregators how to interpret content.

A prototype collections community guid service.
GetMyGUID Service - “Promoting GUID Standard Design Practices”.  We have blathered on long enough here, but want to just give a hint of where we are going.  We want to create a service that is built by natural history collections folks (and our computer science friends) for natural history collections folks, that taps into existing goodness already created.  That is, we tap into the existing services from EZIDs but then further develop a service that encodes best practices that work in this community.  In the near future, we are going to explain how the service works, how you can access it, why it does what it does.   We know how hard it is to get folks to make updates and additions to their databases, so we want to find out how to get over that barrier!  We want to find those early adopters (and hint hint, we are working with BiSciCol partners already to get this ball rolling!).  So, more soon.  Pass the word along!


- John Deck, Rob Guralnick, and Nico Cellinese

Saturday, August 25, 2012

News Update: how do we 'GUID'?

The BiSciCol project team has had a busy summer that included a presentation at the Annual Meeting of the Society for the Preservation of Natural History Collections (SPNHC) in New Haven, CT, and a presentation at the iEvoBio 2012 meeting in Ottawa, Canada.  Additionally, on the 13-15 August John Deck, Nico Cellinese, Rob Guralnick and Neil Davies convened at the University of California, Berkeley in order to meet with a few key partners and discuss the next steps forward for the project (meeting summary).  Before we report more about the meeting with our partners, here is some background information.  

BiSciCol's main goal is to break down the walled gardens between databases storing different kinds of biodiversity data such as specimens or samples from collecting events, sequences, images, etc. generated from those specimens or samples.  Doing so requires overcoming two separate community challenges. First, there must be a mechanism to associate globally unique identifiers (GUIDs) to collections records (Note, we are using the RSS specification GUID definition).  Second, the collections records must be expressed such that the terms used to define those records and their relationships are well understood by humans and computers.  This brings us into the “semantic web” and RDF “triples”.    

As BiSciCol has evolved, two key questions related to these challenges have emerged.  The first is whether GUIDs and creating "triples" should happen at the level of individual provider databases, or instead at the level of "aggregators" that enforce a standardized schema and encoding.  In the case of biological collections, an example of standardized schema is Darwin Core, usually encoded into a Darwin Core Archive.  Example aggregators are GBIF, VertNet and Map of Life. The second question is equally thorny and deals primarily with the content that the identifier is describing: is the identifier describing a physical object, a digital surrogate of a physical object, and is it a primary digital surrogate or a copy?  An example would be provided by specimen metadata attached to a photo record in Morphbank, which contains a copy of specimen metadata which in turn references a physical object.  

So, lets turn back to the meeting in Berkeley. That meeting included two key partners with whom we want to further develop and test ways forward given the two huge questions above.  We spent part of the time with the California Digital Library (CDL) folks, who have built a set of excellent tools that may be part of the solution to the problem of GUID assignment.  CDL has developed EZIDs, which are flexible GUIDs built off of DOIs and ARKs.  John Kunze from CDL gave a great rundown on EZIDs and how they work, and was kind enough to meet us again on a couple of separate occasions, formal and informal.  Metadata encoding in the EZID itself may also be used to indicate use restrictions and provenance (John Kunze’s powerpoint presentation on EZIDs).  

The other key partner with whom we met was VertNet and Aaron Steele, the lead systems architect on the project.  The idea behind meeting with VertNet was to test out how we might do EZID assignment and triplification utilizing the same approach by which VertNet data is being processed from Darwin Core archives into a set of tables that can be visualized, queried and replicated.  Aaron was kind enough to participate to our hackathon and start up this process.  We set up a readme file about the Hackathon to describe our expected outputs.  Yes, the project is called "Bombus" which reflects the fact that although a bit wobbly, our goal is to have data flying around to "pollinate" other data.  Happily, the hackathon was very much a success!  We were able to tap into some existing code generated by Matt Jones (NCEAS) to mint EZIDs and VOILA, we had an output file ready for the semantic web (e.g. an output file that shows relationships between occurrences, localities and taxa based on the EZIDs).   We weren't quite able to get to the last step of querying the results, but we're very close.  More work (and reports) are to follow on this so stay tuned on the Bombus/pollinator link above.  

We have been testing a variety of solutions for identifier assignment, including: supporting user-supplied GUIDs, aggregator GUIDs, dataset DOIs, community standard identifiers (e.g. DwC Triplet), and creating QUIDs (Quasi Unique Identifiers) from hashed content.  EZID technology will play a significant role in the implementation of a number of these approaches.  None of these approaches offer a complete solution, but taken together, we can begin to build an intelligent system that can provide valuable services to aggregators, data providers, and users.  Services we will be supporting include: GUID tracking, identifying use restrictions, and GUID reappropriation.  Integrating our existing triplifier and biscicol java codebases with a scalable database back-end will fulfill most of the technical requirements needed.

We are still building our Triplifier to support those who want to take their own datasets and bring them into the semantic web framework, but BiSciCol can operate much more "at scale" with a very simple interface that accepts Darwin Core Archives or other standardized data such as those generated from Barcode of Life, Morphbank, or iDigBio, and assemble these into a shared triplestore or set of commonly accessible triplestores.  We think the issues we're tackling right now are at the sociotechnical heart of BiSciCol.  We use the term heart knowingly because it is going to be the desire and will of the community, along with the resources such as BiSciCol, that can help motivate and excite, and that will get us at least moving in the right direction. If you have any thoughts, criticisms, suggestions, we'd of course love to hear them.  


John Deck, Rob Guralnick and Nico Cellinese

Friday, April 6, 2012

Making our System Smarter

Computers are amazing at following instructions. So amazing, in fact, that a seemingly harmless instruction can potentially lead to an entirely false conclusion.
At our recent BiSciCol meeting at the University of Florida, we had a discussion about just such a case.

At its core, BiSciCol is all about connecting objects to each other. In order to accomplish connecting object identifiers to other objects, we have been using a simple relationship expression called “leadsTo” that indicates a direction in the nature of the relationship between one object and another. To illustrate how “leadsTo” works, lets provide a simple example. Suppose we have a collecting event, which we join to a specimen object using our relationship predicate “leadsTo”. The specimen object could then “lead to” a taxonomic determination, which could in turn “lead to” a scientist, and so on.

This is certainly useful as we can express an endless chain of objects and their derivatives , even if they exist in different databases. However, what if we extended the above example just a bit further, using our “leadsTo” relationship?


Uh oh--- By successively following the leadsTo relationships, we could now erroneously conclude that spec2-t1 came from spec3! This is not good! Fortunately, there is a solution.

We realized that the directional “leadsTo” relationship simply doesn't make very much sense in some situations, such as the connection between spec3 and person1 in the diagram above. Consequently, instead of the single “leadsTo” relationship, we actually need two relationship terms: one that has a distinct direction and one that implies no direction. Two terms are in use currently that do just this from the Dublin core standard: 1) relation (no direction) and 2) source (has direction).


In the first example above, we could avoid the problem entirely by describing the link between the taxonomic determination and the scientist as a non-directional relation. Using our new terminology, the graph would look something like this:


The computers involved in figuring out how to traverse the graph of relationships would know not to follow non-directional relationships and we would no longer infer that spec2-t1 came from spec3. Problem solved!


This post written by John Deck and Brian Stucky with input from Hilmar Lapp, Steve Baskauf, Andrea Thomer, Rob Guralnick, Lukasz Ziemba, Tim Robertson, Reed Beaman, and Nico Cellinese, summarizes a discussion that took place at the BiSciCol development meeting held on the March 31, 2012 at the University of Florida.

Friday, February 17, 2012

Development Meeting, Boulder, Colorado, 2-4 February, 2012

Two Days of Triplifying

The Boulder contingent of BiSciCol hosted a short two day "developer's meeting" that included John Deck, Brian Stucky, Lukasz Ziemba, Bryan Heidorn and his students Alyssa Janning and Qianjin Zhang.  As luck would have it, the BiSciCol crew arrived on a blustery morning just ahead of the biggest single February snowfall on record.  It proceeded to snow from Thursday afternoon to Saturday morning without much pause, lending a surreal quality to the proceedings.  It also fubared some plans to use campus meeting facilities, since that Friday was the first snow day in many a moon.  Enough about the weather!  Lets talk about what we did!

All participants were very pleased with the outcome of this meeting. Brian had been hard at work developing a generic plug-in interface so that anyone can write some simple code to connect whatever kinds of record sets they have and begin an import into the Triplifier.  OH! WAIT.  WAIT.  First things first!  What the heck is Triplifier, you ask?  And why do we think this Triplifier is such a good idea?

The BiSciCol project works by linking data in data sources based on logical relationships independent of a particular implementation. This is a different kettle of fish compared to a data standard. BiSciCol works where standards stop.  We particularly want to, for example, represent how a sequence is related to a specimen which is related to an event and a location.  The problem is creating a common "format" for expressing simple relationships and then using a set of those simple ones to build more complex "graphs" of these relationships. So what is that "common format"?  In the world of the Resource Description Framework (RDF), the format is called a "triple".

A triple is not that complicated; it basically expresses a unique fact about how things are related to one another.  The format of a triple is subject - predicate - object (thus the "triple" - three pieces of data). The triple format is not all that different from what is expressed in a database or spreadsheet or other structured data document.  The set of relationships that allow joins to happen in relational databases are in theory very similar.  So similar that one can convert a database or other document into triples.  And thus the point and value of a "Triplifier" -- a way to convert any set of documents into triples so that we can begin to compile a larger set of resources.

So back where we started...Brian Stucky has developed a generic plug-in for ingesting different types of data into the Triplifier. And Lukasz has used that generic plug-in to build a Darwin Core Archive ingester.  The big news, however, has to do with a platform called D2RQ.  Basically, D2RQ does the heavy lifting of representing relational databases (or your own declarations of relationships between objects) as triples.  At the heart of D2RQ are: 1) the "ClassMap" which represents classes from a schema; 2)"PropertyBridge" which basically defines the properties in RDF using the class map.; 3) Joins that link tables.  In a nutshell, a user can specify (or pass along a relational database) with the right information about the database and its foreign keys, and dump out RDF triples.

The good news is that we were able to test D2RQ with a very simple relational database that relates collectors and specimens to verify that we get the right outputs.  After some trial and error, and specifying the right class maps, we succeeded in generating meaningful RDF output.  Given this, we are ready to rock and roll with developing the Triplifier fully, and will be cranking on this over the next few months.  Lukasz has already made progress on a Web interface, and we are preparing to test the system with data from the Moorea Biocode and HERBIS project.

- Rob Guralnick reports


Friday, September 23, 2011

Development meeting at UF, Sept. 12-13, 2011 - A quick summary

Our core development team met again, and although a few key people were missing, there were enough to generate a healthy discussion and set the course for more progress on BiSciCol development. The following people attended: Reed Beaman (UF), Nico Cellinese (UF), Brian Stucky (University of Colorado, Boulder), John Deck (UC Berkeley), Bryan Heidorn (UA), Tom Orrell (SI), Kate Rachwal (UF), Russell Watkins (UF), Lukasz Ziemba (UF). Overall, we discussed a number of topics to include user scenarios and potential queries, output of query results, coding requirements, how to represent geographic information, how to handle taxonomic names, tasks to be completed, timetable. Consistent threads throughout these discussions were the form of BiSciCol, from a structural and user interface perspective, and its role with respect to data providers (clients), processing and service. On Monday 12th, we reviewed the user scenarios we generated in the past and discussed potential queries derived from the workflows presented in these scenarios. It quickly became clear that there was a convergence of common queries across all scenarios. Potential query types are listed below followed by a brief example in parentheses. It should be noted that the data schema is based on transitive relations between objects, including “relatedTo” and “sameAs” inferences between objects. Also, filters can be applied to any Object Id with respect to: date, source database, and level or depth of relatedness. These filters can be applied in combination within a single query (see below). We will have to spend more time thinking about scalability and how to handle very large and/or complex queries.

Enumeration of Query Types 

Relations = The graph of transitive relations attached to a particular object. Including “sameAs” inferences between this object and all related objects.

DateFilters = All queries below can be modified with a date filter for any Object ID Source DB Filters = Ability to filter on source database.

 Potential Query Types:

 1. WHERE Object=X, return relations (Given a specimen ID, return eventID, tissueIDs, etc… that are related)

 2. WHERE Object=X, return List of Objects that are "sameAs" (From Use Case #4: Given a collectors specimen ObjectID, return internal IDs, and other institution IDs that refer to this same ID)

 3. WHERE Object=X, and has duplicate non-matching literals (A GUID has two different taxonomic names "Jim" and "Fred")

 Filtering Queries (on rdf:Type):

 Recursive filters can be applied to derived graphs (returned output.).

We need to support multiple types of filters, which means (1) multiple filters at one level of relatedness and/or/not (2) multiple filters across several levels of relatedness.

1. Given an Object, return all related objects that are a certain rdf:type (e.g. Show all images that are attached to a Specimen ID).

 2. Given an Object, return related objects that are a certain rdf:type AND related objects that are another rdf:type (e.g. given an EventID, only search for specimens with identifications so we don't also include tissues, extractions, sequences with identifications).

A consideration that arose from the query conversation was database updates and tracking of changes. A discrete timetable schedule for database updates needs to be determined. The mechanism for updates will involve applying an inferencing process to data delivered from partners to create new relations tables. Currently, we think that the process will proceed as follows:

1. Clients publish data to a Virtuoso Database for example, (perhaps via GBIF’s IPT) into a. client specific relations table and b. client specific sameAs table

2. BiSciCol will periodically re-infer all relations (all previously submitted data is still accessible)

3. BiSciCol publishes results of user queries periodically (e.g. through RSS) Regarding real-time changes/incorporation of individual object records we think this is easily done for relatedObjects (e.g. putting in a newRelationsTable) but we need more discussion on the inferencing process to complete (1x / month) for sameAs relations

So, having stated all of the above we still need to: 

- Think about complex queries and scalability
- Expressing “Darwin-Triples” (institution code, collection code, catalog #) as a GUID. E.g. http://mydarwincore.org/resolver/ic=MVZ;cc=VERTS;cn=12345 OR LSID --- this is a project relegated to some implementation, probably HERBIS test case.
- Update Test case to put Multiple Literals for same Object
- Pre-inferred relations as a separate table?

We then discussed various forms of output of query results (see summary listed below). Suggestions included some standard forms of visualization such as summary statistics, charting (e.g. bar, pie, etc.), browsable lists, sortable HTML tables, a timeline of object updates and changes, and mapping of geolocated objects. Other ideas included RDF triples, various JSON outputs, some sort of standards compliance measure(s), incremental levels (e.g. list a specified depth or level of the sameAs or relatedTo). One interesting idea was to provide a means to evaluate the completeness and quality of collector efforts and databases. Two suggested approaches were user vetting, with Facebook-like “up or down” voting buttons, or a possibly more objective ranking algorithm (e.g. Google’s page ranking). For now, we are going to provide outputs in the form of tables, hierarchical (tree view), maps, and by rdf:Type. However, we agreed to seek out other interesting and innovative forms of visualizing results.

 0. pie charts, statistics on data that is returned (e.g. Sencha)
1. RDF Output (integrate w/ freebase)
2. JSON Output
3. Standards Compliance Measure (e.g. MIMARKS score in percent)
4. Sortable HTML Table 

5. Summarize results by rdf:Type (either modified HTML Table or modified JSTree e.g. see http://library.conservefloridawater.org/WCC?act=view&oid=11964077&lng=1)
6. Search for interesting visualizations
7. Service renders incremental portions of graphs (e.g. 1 transitive level deep)
8. Timeline – what has changed over a period (graph of date last modified)
9. Browse lists: institutions, collectors, etc …
10. Mapping results
11. Vote Up or Down on Datasets or Collectors (e.g. see http://answers.semanticweb.com/questions/1581/whats-missing-from-rdf-databases) - for example, consider facebook like buttons, or google page rank on output methods for vetting, objective algorithms.

Tuesday morning was largely occupied with technical coding discussions. Specific tasks were identified and prioritized relative to project milestones (see summary below). This included user interface development, documentation of code, REST service tuning, data ingestion filters, and RDF-specific enhancements such as indexing and inferencing. One major component identified for development has been dubbed “TriSciCol”, which establishes and expresses the triple-store object relations from client-supplied databases. Details on the design and functionality of this component will be circulated soon.

Focus areas

Lukasz Ziemba (UF) User Interface Development
Brian Stucky (CU) REST Service / Tweeks
Brian Stucky (CU) Code-Base / Documentation of Code / Unit Testing
John Deck Data Ingest (GBIF data?, Cam Web)
Deck/Stucky Data Indexing / RDF Structure / Inferencing Engines / Reasoners
Bryan Heidron (UA) Herbis Use Case / Arctos Integration
Orrell (SI), Deck, Cellinese (UF) TriSciCol – flesh out specifications and submit for review by BiSciCol collaborators

 How to store and represent geographic information in BiSciCol 

 “Location” is a reference to the verbatim location information by the original collector. This GUID points to the Darwin Core Location class except for the parts of that class that refer to a georeference. The “Georeference” class expresses a georeferencing instance that is related to “Location”. This allows us to express multiple georeferencing instances for an individual “Location” and track DateLastModified dates for georeferencing instances atomically.

 The result is that “Locations” are never mapped by BiSciCol, only “Georeferences” are. In this way, we can define types of “Georeferences” that correspond to the manner in which they will be mapped or utilized in consuming applications. E.g. geo:lat/geo:lng, SpatialThing, MGRS, etc.

 A note for consideration here is that currently georeference terms are embedded in the Location class in Darwin Core and we will probably want to make a reccomendation to move the georeferencing terms into their own class.

 Terms

 dwc:Location is an object type that indicates the associated identifier is a location (E.g. text description, coordinate, any direct reading from an instrument, or observed). The content linked by this identifier is normally verbatim location information. This identifier cannot contain references to location literals. 

:Georeference is an object type that indicates the associated identifier is a geo-referenced location. This is the only object type that can reference spatial representations as literals. Need update from DwC folks on status of Georeference being its own DwC Class?

:hasGeoreference is a predicate that joins an identifier of type GeoreferenceID to a literal representation of a spatial representation. :hasGeoreference can have the following subProperties:
  • :hasMGRSGeoreference 
  • :hasSpatialThingGeoreference 
  • :hasWKTGeoreference 
 Example N3 file
:ObjectX a dwc:Event  
:ObjectX dwc:DateLastModified “2011-01-01” 

:ObjectY a dwc:Location 
:ObjectY dwc:DateLastModified “2011-01-02” 

:ObjectZ1 a dwc:Georeference 
:ObjectZ1 :hasSpatialThingGeoreference “48.198634,16.371648;crs=wgs84;u=40” 
:ObjectZ1 dwc:DateLastModified “2011-01-03” 

:ObjectZ2 a dwc:Georeference 
:ObjectZ2 :hasMGRSGeoreference “4QFJ1234567890” 
:ObjectZ2 dwc:DateLastModified “2011-01-03” 

:ObjectX relatedTo :ObjectY 
:ObjectY relatedTo :ObjectZ1 
:ObjectY relatedTo :ObjectZ2 

Java Code 
Georeference class implements Location interface. SpatialThingGeoreference, MGRSGeoreference, WKTGeoreference classes extend the Georeference class. A georeference is returned by BSCObject with the method getGeoreference. 

 We’ll meet again as a much larger group on the 19th of October during the TDWG meeting in New Orleans. If you happen to be there and interested in our discussion, do join us and/or stay tuned for more news.

Thursday, July 21, 2011

Tagging use case 585

Tagging use case 585. requires semantic linking to taxonomy of the shark, geospatial track and place names and likely a tracking methods ontology.