Friday, January 30, 2015

While reading about how SPARQL computes its solutions I considered that we can get much better efficiency by choosing when and how to do joins based on our own needs rather than letting RDFLib's sparql plugin handle it. I think it took me so long to make this change in thinking since I'm used to doing things in relational databases using static queries with a set number of joins. I've started writing some demo programs that build paths in the RDF graph to query. The solution uses generators, so porting to other languages will require a slightly different approach that breaks up the code in the generators for other languages, like Java, that do not support them. Happily, other facilities, like being able to iterate over an RDF graph given a triple pattern is available in Java through Jena with list_statements and in C through Redland with librdf_model_find_statements.

Thursday, January 29, 2015

Reading

I am currently doing some reading on how SPARQL queries are evaluated in order to improve the inefficiency of complicated queries created by PyOpenWorm.

Thursday, January 22, 2015

PyOpenWorm R1

We (myself, Stephen Larson, Anton Shekhov, Travis Jacobs) have continued working on PyOpenWorm in order to get it release-ready. The work that has been done has focused on documentation of previously existing work, addition of data-integrity tests, and inclusion of data in the PyOpenWorm database from existing spreadsheets and the sqlite3 database contributed by Timothy Busbice.

Tuesday, September 16, 2014

End of GSOC

We've reached the end of Google Summer of Code (it was weeks ago now, but I'm writing this blog post anyway).

Although I've written a lot of code and made a lot of plans, I can't really back off of the project until I see it being used in a way that people (outside and inside of OpenWorm) are actually seeing it. When I came into the project, my long-term goal was an "annotated simulation". This is a system for adding annotations to data, carrying them through to model files, and finally exposing the annotations to end-users upon inspection into simulation results. This is to answer our long standing criticism that, in some way or another, our simulations don't line up with the actual science. (Similar concerns are addressed from a computational perspective in this thread) Over this passed summer, I've gotten the first step to a place where I almost like it, but the second step has stalled significantly, and the third was never started.

Monday, August 18, 2014

Sunday, August 3, 2014

As I've been using PyOpenWorm, I've encountered some cracks in the design. The primary issue is the interaction between configuration, object properties, and database set-up. Some examples of problems that have resulted:
  1. Run-time configuration has been set-up such that each object has to receive its configuration anew from the object that creates it or else use a globally set default configuration. This is a very brittle framework that falls apart any time the appropriate configuration is not passed to Configureable objects contained within other Configureable objects which can result in configuration values hanging around between connect() and disconnect(). This isn't a big deal for simple one-off scripts, but can be annoying when testing or when designing more complex scripts.
  2. Registration of a Property doesn't take place until the first time its owner is initialized. This prevents us from resolving a Property to an object directly from the graph without having made one of its owners first. This results in properties which are identical having different names in sub-classes due to the dynamic nature of property-name creation.
  3. Each DataObject sub-class has to be registered as a separate step from defining the class. This is an minor annoyance when creating new classes.
  4. It's necessary to either know the structure of generated namespace URIs (e.g., http://openworm.org/entities/Neuron/) and recreate them or to create an object of the desired type when referencing objects in another namespace. The first approach is sub-optimal because we might have reason to change the structure of the URIs in the future which is already a problem for some auxiliary methods designed to extract information from a URI. The second approach is also problematic since it requires the creation of objects which are never used otherwise, creating overhead for the garbage collector as well as the programmer
My proposed solution to these problems is to push the configuration and class-dependent initialization to the class-definiton phase through the use of Python's decorators which run on library-load. This takes care of the fourth point by having all of the namespaces and namespace sturcture made static by the time any object should be created.

The third point is covered by folding the current registration into the class decoration.

The second point is covered by attaching each Property subclass to the owner's class while retaining the initialization in the __init__() call for accessing owner-instances.

Finally, the first point can be addressed without the proposed solution by not having configuration passed through __init__. Initially I conceived of a tree structure for configuration which could be augmented by internal nodes and passed on to children with specific configuration needs that didn't need communication to higher level users. This structure is exemplified by Data which is Configureable, but also a Configure object suitable for configuring other objects. Unfortunately, this is the only case where the feature gets any use and isn't required for passing augmented configuration to the module as a whole, so it can simply be removed.

I hope to address some of these issues soon after outstanding issues are closed and features for the next release are well-defined.

UPDATE: Most of these issues have been addressed in the new-classes branch. Not yet merged into master.

Sunday, July 20, 2014

It seems I have more to learn about rdflib's SPARQL query engine. A persistent issue was the mis-use of indexing for what (I thought) should be simple queries. My error was in including triples of the form ?x rdf:type openworm:type in the query patterns. For reasons I don't yet understand, the engine evaluates such statements before others even if there's a more reductive join it could perform from other patterns. Hopefully I can understand this eventually.

In any case, standard object->property->value queries are effectively O(1) time again--yay!