BiSciCol: 2011

Friday, September 23, 2011

Development meeting at UF, Sept. 12-13, 2011 - A quick summary

Our core development team met again, and although a few key people were missing, there were enough to generate a healthy discussion and set the course for more progress on BiSciCol development. The following people attended: Reed Beaman (UF), Nico Cellinese (UF), Brian Stucky (University of Colorado, Boulder), John Deck (UC Berkeley), Bryan Heidorn (UA), Tom Orrell (SI), Kate Rachwal (UF), Russell Watkins (UF), Lukasz Ziemba (UF). Overall, we discussed a number of topics to include user scenarios and potential queries, output of query results, coding requirements, how to represent geographic information, how to handle taxonomic names, tasks to be completed, timetable. Consistent threads throughout these discussions were the form of BiSciCol, from a structural and user interface perspective, and its role with respect to data providers (clients), processing and service. On Monday 12th, we reviewed the user scenarios we generated in the past and discussed potential queries derived from the workflows presented in these scenarios. It quickly became clear that there was a convergence of common queries across all scenarios. Potential query types are listed below followed by a brief example in parentheses. It should be noted that the data schema is based on transitive relations between objects, including “relatedTo” and “sameAs” inferences between objects. Also, filters can be applied to any Object Id with respect to: date, source database, and level or depth of relatedness. These filters can be applied in combination within a single query (see below). We will have to spend more time thinking about scalability and how to handle very large and/or complex queries.

Enumeration of Query Types

Relations = The graph of transitive relations attached to a particular object. Including “sameAs” inferences between this object and all related objects.

DateFilters = All queries below can be modified with a date filter for any Object ID Source DB Filters = Ability to filter on source database.

Potential Query Types:

1. WHERE Object=X, return relations (Given a specimen ID, return eventID, tissueIDs, etc… that are related)

2. WHERE Object=X, return List of Objects that are "sameAs" (From Use Case #4: Given a collectors specimen ObjectID, return internal IDs, and other institution IDs that refer to this same ID)

3. WHERE Object=X, and has duplicate non-matching literals (A GUID has two different taxonomic names "Jim" and "Fred")

Filtering Queries (on rdf:Type):

Recursive filters can be applied to derived graphs (returned output.).

We need to support multiple types of filters, which means (1) multiple filters at one level of relatedness and/or/not (2) multiple filters across several levels of relatedness.

1. Given an Object, return all related objects that are a certain rdf:type (e.g. Show all images that are attached to a Specimen ID).

2. Given an Object, return related objects that are a certain rdf:type AND related objects that are another rdf:type (e.g. given an EventID, only search for specimens with identifications so we don't also include tissues, extractions, sequences with identifications).

A consideration that arose from the query conversation was database updates and tracking of changes. A discrete timetable schedule for database updates needs to be determined. The mechanism for updates will involve applying an inferencing process to data delivered from partners to create new relations tables. Currently, we think that the process will proceed as follows:

1. Clients publish data to a Virtuoso Database for example, (perhaps via GBIF’s IPT) into a. client specific relations table and b. client specific sameAs table

2. BiSciCol will periodically re-infer all relations (all previously submitted data is still accessible)

3. BiSciCol publishes results of user queries periodically (e.g. through RSS) Regarding real-time changes/incorporation of individual object records we think this is easily done for relatedObjects (e.g. putting in a newRelationsTable) but we need more discussion on the inferencing process to complete (1x / month) for sameAs relations

So, having stated all of the above we still need to:

- Think about complex queries and scalability
- Expressing “Darwin-Triples” (institution code, collection code, catalog #) as a GUID. E.g. http://mydarwincore.org/resolver/ic=MVZ;cc=VERTS;cn=12345 OR LSID --- this is a project relegated to some implementation, probably HERBIS test case.
- Update Test case to put Multiple Literals for same Object
- Pre-inferred relations as a separate table?

We then discussed various forms of output of query results (see summary listed below). Suggestions included some standard forms of visualization such as summary statistics, charting (e.g. bar, pie, etc.), browsable lists, sortable HTML tables, a timeline of object updates and changes, and mapping of geolocated objects. Other ideas included RDF triples, various JSON outputs, some sort of standards compliance measure(s), incremental levels (e.g. list a specified depth or level of the sameAs or relatedTo). One interesting idea was to provide a means to evaluate the completeness and quality of collector efforts and databases. Two suggested approaches were user vetting, with Facebook-like “up or down” voting buttons, or a possibly more objective ranking algorithm (e.g. Google’s page ranking). For now, we are going to provide outputs in the form of tables, hierarchical (tree view), maps, and by rdf:Type. However, we agreed to seek out other interesting and innovative forms of visualizing results.

0. pie charts, statistics on data that is returned (e.g. Sencha)
1. RDF Output (integrate w/ freebase)
2. JSON Output
3. Standards Compliance Measure (e.g. MIMARKS score in percent)
4. Sortable HTML Table  
5. Summarize results by rdf:Type (either modified HTML Table or modified JSTree e.g. see http://library.conservefloridawater.org/WCC?act=view&oid=11964077&lng=1)

dwc:Identifications

dwc:Taxon

6. Search for interesting visualizations
7. Service renders incremental portions of graphs (e.g. 1 transitive level deep)
8. Timeline – what has changed over a period (graph of date last modified)
9. Browse lists: institutions, collectors, etc …
10. Mapping results
11. Vote Up or Down on Datasets or Collectors (e.g. see http://answers.semanticweb.com/questions/1581/whats-missing-from-rdf-databases) - for example, consider facebook like buttons, or google page rank on output methods for vetting, objective algorithms.

Tuesday morning was largely occupied with technical coding discussions. Specific tasks were identified and prioritized relative to project milestones (see summary below). This included user interface development, documentation of code, REST service tuning, data ingestion filters, and RDF-specific enhancements such as indexing and inferencing. One major component identified for development has been dubbed “TriSciCol”, which establishes and expresses the triple-store object relations from client-supplied databases. Details on the design and functionality of this component will be circulated soon.

Focus areas

Lukasz Ziemba (UF) User Interface Development
Brian Stucky (CU) REST Service / Tweeks
Brian Stucky (CU) Code-Base / Documentation of Code / Unit Testing
John Deck Data Ingest (GBIF data?, Cam Web)
Deck/Stucky Data Indexing / RDF Structure / Inferencing Engines / Reasoners
Bryan Heidron (UA) Herbis Use Case / Arctos Integration
Orrell (SI), Deck, Cellinese (UF) TriSciCol – flesh out specifications and submit for review by BiSciCol collaborators

How to store and represent geographic information in BiSciCol

“Location” is a reference to the verbatim location information by the original collector. This GUID points to the Darwin Core Location class except for the parts of that class that refer to a georeference. The “Georeference” class expresses a georeferencing instance that is related to “Location”. This allows us to express multiple georeferencing instances for an individual “Location” and track DateLastModified dates for georeferencing instances atomically.

The result is that “Locations” are never mapped by BiSciCol, only “Georeferences” are. In this way, we can define types of “Georeferences” that correspond to the manner in which they will be mapped or utilized in consuming applications. E.g. geo:lat/geo:lng, SpatialThing, MGRS, etc.

A note for consideration here is that currently georeference terms are embedded in the Location class in Darwin Core and we will probably want to make a reccomendation to move the georeferencing terms into their own class.

Terms

dwc:Location is an object type that indicates the associated identifier is a location (E.g. text description, coordinate, any direct reading from an instrument, or observed). The content linked by this identifier is normally verbatim location information. This identifier cannot contain references to location literals.

:Georeference is an object type that indicates the associated identifier is a geo-referenced location. This is the only object type that can reference spatial representations as literals. Need update from DwC folks on status of Georeference being its own DwC Class?

:hasGeoreference is a predicate that joins an identifier of type GeoreferenceID to a literal representation of a spatial representation. :hasGeoreference can have the following subProperties:

:hasMGRSGeoreference
:hasSpatialThingGeoreference
:hasWKTGeoreference

Example N3 file

:ObjectX a dwc:Event

:ObjectX dwc:DateLastModified “2011-01-01”

:ObjectY a dwc:Location

:ObjectY dwc:DateLastModified “2011-01-02”

:ObjectZ1 a dwc:Georeference

:ObjectZ1 :hasSpatialThingGeoreference “48.198634,16.371648;crs=wgs84;u=40”

:ObjectZ1 dwc:DateLastModified “2011-01-03”

:ObjectZ2 a dwc:Georeference

:ObjectZ2 :hasMGRSGeoreference “4QFJ1234567890”

:ObjectZ2 dwc:DateLastModified “2011-01-03”

:ObjectX relatedTo :ObjectY

:ObjectY relatedTo :ObjectZ1

:ObjectY relatedTo :ObjectZ2

Java Code

Georeference class implements Location interface. SpatialThingGeoreference, MGRSGeoreference, WKTGeoreference classes extend the Georeference class. A georeference is returned by BSCObject with the method getGeoreference.

We’ll meet again as a much larger group on the 19th of October during the TDWG meeting in New Orleans. If you happen to be there and interested in our discussion, do join us and/or stay tuned for more news.

Thursday, July 21, 2011

Tagging use case 585

Tagging use case 585. requires semantic linking to taxonomy of the shark, geospatial track and place names and likely a tracking methods ontology.

Saturday, June 18, 2011

BiSciCol core software architecture

This simplified UML diagram illustrates the current architecture of the core BiSciCol classes. These classes are responsible for most of the lower-level BiSciCol functionality: interacting with data, working with queries, traversing the BiSciCol data structures, and converting BiSciCol data to other representations, such as XML. Note that this diagram does not depict all relationships or dependencies among the classes. Instead, relationships most important for a general understanding of the code are included, such as inheritance, interface implementation, and aggregation. Further, most private class methods and members are excluded for simplicity.

In general, we've been working to develop an architecture characterized by classes with clearly-defined responsibilities and a high degree of flexibility. We've also tried to develop objects that map naturally to the BiSciCol problem domain. For instance, both BiSciCol data objects and models are represented by corresponding abstract data types. Furthermore, the code that implements BiSciCol objects and data operations is completely separated from the code responsible for converting BiSciCol data to other formats, such as XML.

This is simply a snapshot of the state of the code at this time. Some of the classes illustrated above are incomplete and/or will almost certainly be redesigned. Collaborations between classes will likely change as well. Nevertheless, the diagram shows not only progress we've made, but also many of the design ideas we've been working with.

Thursday, June 2, 2011

BioSciCol VertNet Integration Meeting

Recent big news is the funding of VertNet! Given this, some of us got together to discuss ways we can work together towards some common goals.

Location:”Jupiter Cafe”
John Deck (Kolsch)
Aaron Steele (Jupiter Red)
John Wieczorek (Hot Chocolate)

VertNet will publish to PubSubHubbub. BiSciCol can subscribe to the hub and receive updates on what is going on.

JohnD:
- VN can encourage, promote, implement use of GUIDs
- Links to GUIDs of Annotation references (of all kinds like data quality stuff e.g. Creation of WGS84/dec.lat/lng instance of DMS data)
- BiSciCol can help VertNet in augmenting search (relationship graph search)

JohnW:
- BiSciCol is base study for annotation use cases (query on annotation relations?)
- VertNet won’t look at relationships between objects, so BiSciCol could be FTW
- Recommends Twitter dialog between BiSciCol and VN

Aaron:
- Annotations
- PubSubHubbub
- Twitter dialog

More to come. If 2 of the attendees can figure out how to use (and if they really want to use) twitter we can stay plugged in that way (per JohnW's suggestion). Otherwise, we'll be forced to meet over beers and use our speech.

Thursday, April 14, 2011

Revised Prototype Implementation Diagram

This is a refined diagram describing our preliminary implementation

Saturday, March 26, 2011

Development Update

Brian Stucky and I made a first pass at an online prototype back in mid-February. This prototype featured a text-file triple-store back-end, a bit of javascript, a cool modeling widget, and some JSP glue.

Since then, I've been working on a few things to make this prototype more solid:

1) Implementing a database back-end for the triple-store. I've been working with Virtuoso under an evaluation license and have found it fairly easy to work with. I would prefer a free or more open-source solution but it seems Virtuoso is the best product out there for our needs.

2) Building queries on REST services. Not wanting to make the REST services an after-though to our architecture, I am building the REST services into the our web prototype.

3) Re-building SPARQL queries to be more portable, scalable, and faster. Specifically, the transitive closure functions were previously built using specialized syntax not portable to larger systems. The new syntax is more portable and will be alot faster. E.g:

SELECT ?s ?dist FROM
WHERE
{
{
SELECT ?s ?o
WHERE
{
?s bsc:relatedTo ?o .
}
} OPTION (TRANSITIVE, t_distinct, t_in(?s), t_out(?o), t_min (1), t_max (20), t_step ('step_no') as ?dist) .
FILTER (?o= dwc:spec12345)
}

Hoping to get this new prototype online in the next week or two.

Thursday, March 3, 2011

Preliminary Implementation Diagram

An indication of where we are headed with our architecture/implementation. This is a draft form and comments are welcome!

Friday, February 25, 2011

BiSciCol Technical Meeting: University of Florida, 23-24 February 2011

The BiSciCol Technical team had a successful brainstorming meeting over two warm Florida days.

Attendees:

Nico Cellinese (UF), Kate Rachwal (UF), Reed Beaman (UF), Gustav Paulay (UF), Russ Watkins (UF), John Deck (Berkeley), Brian Stucky (Colorado University), Rich Pyle (remotely, Bishop Museum) and our very special guest Steve Baskauf (Vanderbilt University).

The main goal of the meeting was to define the BiSciCol scope, techcnical goals and the model. Happy to report we made great progress. We are now ready to move to an implementation phase and are confident will be able to release a prototype in July 2011. See below a snapshot of out design document.

Purpose of the System

1. Notify all objects that are related to Object X that Object X has changed.

2. Manage relationship definitions between objects that exist in distributed databases.

3. Manage annotations of objects where annotation is treated as an Object relationship.

Technical Goals of the System

1. scalable design

2. easy coding but must be robust

3. expandable to different domains

Envisioning this as a distributed network with structured content defined. Goal is to allow for multiple data sources, each with their own thematic content. The thematic content is built on defining relationships between objects.

The Model

Linkages between objects is unstructured which allows any object to be tied to any other object via the predicate “relatedTo”. This is similar to SKOS “related” but we have made relatedTo transitive. relatedTo does not require objects to be organized hierarchically.

Following are the triples recognized in the network:

:objectId :relatedTo :objectId

:objectId rdf:type :objectType

:relatedTo :hasProperty :propertyDefinitionURI

:objectId dwc:dateLastModified “YYYY-MM-DD”

Monday, February 14, 2011

News and updates

Since our last meeting in Washington, DC we worked out some of the object relationships and how we want to formalize concept/objects such as individuals, specimens, vouchers etc. for the purposes of our network. We also agreed to consider implementing RDF in our design. To this end, a number of people have since purchased and read (hopefully past the introduction) the book "Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL". Finally, we agreed to deliver a working prototype of the BiSciCol network in July 2011, implementing data from Biocode, CalPhotos, and UF.

During January and the first part of February, John Deck (Berkeley), Rob Guralnick (Colorado U.), and Brian Stuckey (Colorado U.) have been crafting a technical implementation plan for the BiSciCol prototype. CU has offered a server and Tomcat instance for our prototypes and Brian has been implementing that along with getting the web interface going. John has been working on a model that ties together collecting events, specimens, tissues, DNA extractions, and photographs (more object types later), while incorporating location and modification date. Taxonomy is notably absent for now and we will soon be working more closely with Rich Pyle and Rob Whitton (Bishop Museum). Codebase is Java and inferencing/RDF work being implemented in Jena/ARQ (open source Java).

On the 23rd and 24th of February, John and Brian will be at UF for a technical meeting with Nico Cellinese, Kate Rachwal, Russ Watkins, and special guest Steve Bauskauf (Vanderbilt University) to review the UF implementation and proposed model. We will also work remotely with Rich Pyle and Rob Whitton. More details on this meeting will be posted soon. Given our prototype deadline of July 2011 we have a lot of ahead of us, especially in integrating taxonomic names components but we feel we are on target with progress to date.

Monday, January 31, 2011

16-17 December 2010 - Meeting outcome

Our December Meeting brought us all together and we were fortunate to have Davie Vieglais from DataOne and Bob Morris from FilteredPush (a.k.a. Push-me-pull-you). We reviewed use cases and case scenarios, discussed object relationships and ways to model them.

We are making progress with designing a technical architecture and on February 23-25, 2011 John Deck (Berkeley), Brian Stuckey (U. Colorado), Kate Rachwal (UF), Russ Watkins (UF), Nico Cellinese (UF), and our guest Steve Baskauf (Venderbilt) will meet at the Florida Museum of Natural History to refine solutions to some of the impending technical issues.

We created a number of subgroups that will be in charge of developing specific aspects of the project:

1. Taxonomic Names (Rich Pyle [chair], Gustav Paulay, Tom Orrell, Chris Meyer, Nico Cellinese, John Deck, Jonathan Coddington, Rob Whitton).

2. Ontology (Nico Cellinese [chair], Bob Morris, Kate Rachwal, John Deck, Jonathan Coddington, Rich Pyle).

3. Geospatial (Rob Guralnick [chair], Reed Beaman, Russ Watkins, Kate Rachwal, John Deck).

4. Technical Architecture (John Deck [chair], Rob Whitton, Dave Vieglais, Kate Rachwal, Rob Guralnick, Russ Watkins, Tom Orrell, Bryan Heidorn, Jonathan Coddington).

5. Domain Scientists (Chris Meyer [chair], Nico Cellinese, Gustav Pailay, George Roderick, Neil Davies, Rob Guralnick, John Deck, Tom Orrell, Jonathan Coddington, Reed Beaman). Improve on test case list.

6. Sustainability Group (Neil Davies [chair], Rob Guralnick, Reed Beaman, Chris Meyer).

Reed Beaman, Rob Guralnick and Russ Watkins are compiling a conceptual plan that will include specifics on who is doing what, timeline, and priorities. We all agreed to deploy a prototype in July 2001 for testing by the community.

More later.....

Pages