Friday, September 23, 2011

Development meeting at UF, Sept. 12-13, 2011 - A quick summary

Our core development team met again, and although a few key people were missing, there were enough to generate a healthy discussion and set the course for more progress on BiSciCol development. The following people attended: Reed Beaman (UF), Nico Cellinese (UF), Brian Stucky (University of Colorado, Boulder), John Deck (UC Berkeley), Bryan Heidorn (UA), Tom Orrell (SI), Kate Rachwal (UF), Russell Watkins (UF), Lukasz Ziemba (UF). Overall, we discussed a number of topics to include user scenarios and potential queries, output of query results, coding requirements, how to represent geographic information, how to handle taxonomic names, tasks to be completed, timetable. Consistent threads throughout these discussions were the form of BiSciCol, from a structural and user interface perspective, and its role with respect to data providers (clients), processing and service. On Monday 12th, we reviewed the user scenarios we generated in the past and discussed potential queries derived from the workflows presented in these scenarios. It quickly became clear that there was a convergence of common queries across all scenarios. Potential query types are listed below followed by a brief example in parentheses. It should be noted that the data schema is based on transitive relations between objects, including “relatedTo” and “sameAs” inferences between objects. Also, filters can be applied to any Object Id with respect to: date, source database, and level or depth of relatedness. These filters can be applied in combination within a single query (see below). We will have to spend more time thinking about scalability and how to handle very large and/or complex queries.

Enumeration of Query Types 

Relations = The graph of transitive relations attached to a particular object. Including “sameAs” inferences between this object and all related objects.

DateFilters = All queries below can be modified with a date filter for any Object ID Source DB Filters = Ability to filter on source database.

 Potential Query Types:

 1. WHERE Object=X, return relations (Given a specimen ID, return eventID, tissueIDs, etc… that are related)

 2. WHERE Object=X, return List of Objects that are "sameAs" (From Use Case #4: Given a collectors specimen ObjectID, return internal IDs, and other institution IDs that refer to this same ID)

 3. WHERE Object=X, and has duplicate non-matching literals (A GUID has two different taxonomic names "Jim" and "Fred")

 Filtering Queries (on rdf:Type):

 Recursive filters can be applied to derived graphs (returned output.).

We need to support multiple types of filters, which means (1) multiple filters at one level of relatedness and/or/not (2) multiple filters across several levels of relatedness.

1. Given an Object, return all related objects that are a certain rdf:type (e.g. Show all images that are attached to a Specimen ID).

 2. Given an Object, return related objects that are a certain rdf:type AND related objects that are another rdf:type (e.g. given an EventID, only search for specimens with identifications so we don't also include tissues, extractions, sequences with identifications).

A consideration that arose from the query conversation was database updates and tracking of changes. A discrete timetable schedule for database updates needs to be determined. The mechanism for updates will involve applying an inferencing process to data delivered from partners to create new relations tables. Currently, we think that the process will proceed as follows:

1. Clients publish data to a Virtuoso Database for example, (perhaps via GBIF’s IPT) into a. client specific relations table and b. client specific sameAs table

2. BiSciCol will periodically re-infer all relations (all previously submitted data is still accessible)

3. BiSciCol publishes results of user queries periodically (e.g. through RSS) Regarding real-time changes/incorporation of individual object records we think this is easily done for relatedObjects (e.g. putting in a newRelationsTable) but we need more discussion on the inferencing process to complete (1x / month) for sameAs relations

So, having stated all of the above we still need to: 

- Think about complex queries and scalability
- Expressing “Darwin-Triples” (institution code, collection code, catalog #) as a GUID. E.g. http://mydarwincore.org/resolver/ic=MVZ;cc=VERTS;cn=12345 OR LSID --- this is a project relegated to some implementation, probably HERBIS test case.
- Update Test case to put Multiple Literals for same Object
- Pre-inferred relations as a separate table?

We then discussed various forms of output of query results (see summary listed below). Suggestions included some standard forms of visualization such as summary statistics, charting (e.g. bar, pie, etc.), browsable lists, sortable HTML tables, a timeline of object updates and changes, and mapping of geolocated objects. Other ideas included RDF triples, various JSON outputs, some sort of standards compliance measure(s), incremental levels (e.g. list a specified depth or level of the sameAs or relatedTo). One interesting idea was to provide a means to evaluate the completeness and quality of collector efforts and databases. Two suggested approaches were user vetting, with Facebook-like “up or down” voting buttons, or a possibly more objective ranking algorithm (e.g. Google’s page ranking). For now, we are going to provide outputs in the form of tables, hierarchical (tree view), maps, and by rdf:Type. However, we agreed to seek out other interesting and innovative forms of visualizing results.

 0. pie charts, statistics on data that is returned (e.g. Sencha)
1. RDF Output (integrate w/ freebase)
2. JSON Output
3. Standards Compliance Measure (e.g. MIMARKS score in percent)
4. Sortable HTML Table 

5. Summarize results by rdf:Type (either modified HTML Table or modified JSTree e.g. see http://library.conservefloridawater.org/WCC?act=view&oid=11964077&lng=1)
6. Search for interesting visualizations
7. Service renders incremental portions of graphs (e.g. 1 transitive level deep)
8. Timeline – what has changed over a period (graph of date last modified)
9. Browse lists: institutions, collectors, etc …
10. Mapping results
11. Vote Up or Down on Datasets or Collectors (e.g. see http://answers.semanticweb.com/questions/1581/whats-missing-from-rdf-databases) - for example, consider facebook like buttons, or google page rank on output methods for vetting, objective algorithms.

Tuesday morning was largely occupied with technical coding discussions. Specific tasks were identified and prioritized relative to project milestones (see summary below). This included user interface development, documentation of code, REST service tuning, data ingestion filters, and RDF-specific enhancements such as indexing and inferencing. One major component identified for development has been dubbed “TriSciCol”, which establishes and expresses the triple-store object relations from client-supplied databases. Details on the design and functionality of this component will be circulated soon.

Focus areas

Lukasz Ziemba (UF) User Interface Development
Brian Stucky (CU) REST Service / Tweeks
Brian Stucky (CU) Code-Base / Documentation of Code / Unit Testing
John Deck Data Ingest (GBIF data?, Cam Web)
Deck/Stucky Data Indexing / RDF Structure / Inferencing Engines / Reasoners
Bryan Heidron (UA) Herbis Use Case / Arctos Integration
Orrell (SI), Deck, Cellinese (UF) TriSciCol – flesh out specifications and submit for review by BiSciCol collaborators

 How to store and represent geographic information in BiSciCol 

 “Location” is a reference to the verbatim location information by the original collector. This GUID points to the Darwin Core Location class except for the parts of that class that refer to a georeference. The “Georeference” class expresses a georeferencing instance that is related to “Location”. This allows us to express multiple georeferencing instances for an individual “Location” and track DateLastModified dates for georeferencing instances atomically.

 The result is that “Locations” are never mapped by BiSciCol, only “Georeferences” are. In this way, we can define types of “Georeferences” that correspond to the manner in which they will be mapped or utilized in consuming applications. E.g. geo:lat/geo:lng, SpatialThing, MGRS, etc.

 A note for consideration here is that currently georeference terms are embedded in the Location class in Darwin Core and we will probably want to make a reccomendation to move the georeferencing terms into their own class.

 Terms

 dwc:Location is an object type that indicates the associated identifier is a location (E.g. text description, coordinate, any direct reading from an instrument, or observed). The content linked by this identifier is normally verbatim location information. This identifier cannot contain references to location literals. 

:Georeference is an object type that indicates the associated identifier is a geo-referenced location. This is the only object type that can reference spatial representations as literals. Need update from DwC folks on status of Georeference being its own DwC Class?

:hasGeoreference is a predicate that joins an identifier of type GeoreferenceID to a literal representation of a spatial representation. :hasGeoreference can have the following subProperties:
  • :hasMGRSGeoreference 
  • :hasSpatialThingGeoreference 
  • :hasWKTGeoreference 
 Example N3 file
:ObjectX a dwc:Event  
:ObjectX dwc:DateLastModified “2011-01-01” 

:ObjectY a dwc:Location 
:ObjectY dwc:DateLastModified “2011-01-02” 

:ObjectZ1 a dwc:Georeference 
:ObjectZ1 :hasSpatialThingGeoreference “48.198634,16.371648;crs=wgs84;u=40” 
:ObjectZ1 dwc:DateLastModified “2011-01-03” 

:ObjectZ2 a dwc:Georeference 
:ObjectZ2 :hasMGRSGeoreference “4QFJ1234567890” 
:ObjectZ2 dwc:DateLastModified “2011-01-03” 

:ObjectX relatedTo :ObjectY 
:ObjectY relatedTo :ObjectZ1 
:ObjectY relatedTo :ObjectZ2 

Java Code 
Georeference class implements Location interface. SpatialThingGeoreference, MGRSGeoreference, WKTGeoreference classes extend the Georeference class. A georeference is returned by BSCObject with the method getGeoreference. 

 We’ll meet again as a much larger group on the 19th of October during the TDWG meeting in New Orleans. If you happen to be there and interested in our discussion, do join us and/or stay tuned for more news.