Sunday, July 20, 2014

It seems I have more to learn about rdflib's SPARQL query engine. A persistent issue was the mis-use of indexing for what (I thought) should be simple queries. My error was in including triples of the form ?x rdf:type openworm:type in the query patterns. For reasons I don't yet understand, the engine evaluates such statements before others even if there's a more reductive join it could perform from other patterns. Hopefully I can understand this eventually.

In any case, standard object->property->value queries are effectively O(1) time again--yay!

Thursday, July 17, 2014

small update:

The issue from the previous post was one of indexing. The Property object identifiers were being generated differently for every Neuron object which obviously made minimal use of the index. Basing identifiers on the identity of the owner object and the value of the property takes the query time to get a single property back to constant time. For example:

for x in range(1000):
    Neuron().save() # creates and saves a randomly generated neuron object
Neuron().name()
Takes about 10 seconds for the call to name() on rdflib's memory store as well as the SleepyCat store (closer to 11 seconds) bundled with rdflib. That's about .01 seconds for getting the name of a specific neuron regardless of the database size.

Tuesday, July 15, 2014

I adjusted the query according to the previous post and also added patterns for the cell the property is actually on (oops):
prefix openworm: <http://openworm.org/entities/>
prefix neuron: <http://openworm.org/entities/Neuron/>
prefix sp: <http://openworm.org/entities/SimpleProperty/>
select distinct ?type where
{ 
?Neuron rdf:type openworm:Neuron .

?Neuron neuron:lineageName ?Neuron_lineageName .
?Neuron_lineageName rdf:type openworm:Neuron_lineageName .
?Neuron_lineageName sp:value ?lineageName .

?Neuron neuron:name ?Neuron_name .
?Neuron_name rdf:type openworm:Neuron_name .
?Neuron_name sp:value "AVAL" .

?Neuron neuron:type ?Neuron_type .
?Neuron_type rdf:type openworm:Neuron_type .
?Neuron_type sp:value ?type .

?Neuron neuron:receptor ?Neuron_receptor .
?Neuron_receptor rdf:type openworm:Neuron_receptor .
?Neuron_receptor sp:value ?receptor 
}
For the data stores we've been using, such queries don't seem to execute very efficiently.

I would hope for it to be quick since

  1. there should be exactly one statement matching (a, sp:value, "AVAL") to look up in the 'pos' index,
  2. and then only one matching (b, neuron:name, a) looked up again in the 'osp' index,
  3. and an index lookup on the 'spo' index for the types of 'b' to check it has the correct type should be efficient.
  4. The neuron should then have only one neuron:name,
  5. and one of all of the other properties which can be looked up in the 'spo' index.

For queries against Sesame, I have to try using the Java API to be more sure these indexes are actually there, otherwise I'm making a lot of assumptions about how well the indexes work. Besides that, I don't have an easy way of understanding how the queries are getting executed without measuring often very long query times. Maybe I can get some sense of how these queries are being handled with some test data.

I've started reading a little about ZODB, a python object storage database. The core of what I've been making for PyOpenWorm is an Object<->RDF mapper. The choice of RDF storage solutions to back PyOpenWorm was based on the work that had already been done, my experience in working with some RDF tools, and an expectation that joining with existing datasets may be easier with RDF than other storage options. It may be useful to re-evaluate going forward.

I've been trying out the KyotoCabinet store for rdflib. Uploading the whole worm under the current parameters takes about 17 minutes.

Timing any of the queries on the full data set is difficult since even 'simple' ones tend to take several minutes. The queries I'm running look something like this for getting the type ('interneuron', 'sensory', 'motor') :

select distinct ?type where { ?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/SimpleProperty> .
?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/Property> .
?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/DataObject> .
?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/Neuron_type> .
?Neuron98fc353f03788d3e <http://openworm.org/entities/Neuron/type> ?Neuron_type0056fa3c098f873e .
?Neuron_type0056fa3c098f873e <http://openworm.org/entities/SimpleProperty/value> ?type }

I've adopted a pattern of mirroring the triples into the upload query and the select query. The objective with this is to make it easier to change the data structures without having to update both the upload format and the select with every change. This is the reason for the query above containing 3 rdf:type patterns -- in lieu of RDF inference (intending to include later), I want to be able to utilize the class hierarchy to, for instance, get all of the Neuron objects when querying for Cell objects. To have this, I insert, for each object, all of its super-types. This would let me query for objects of type Cell and get all those of type Neuron (and Muscle). Any object should have all of the super-type tags in addition to its most specific form, so there's no problem, theoretically, querying for the right object.

The main issue is that these queries get executed very slowly. I suspect it's something to do with scanning the table. Interrupting a query (because who has time for that?) we find ourselves here:

  File "build/bdist.linux-x86_64/egg/rdflib/plugins/sparql/evaluate.py", line 47, in evalBGP
  File "build/bdist.linux-x86_64/egg/rdflib/graph.py", line 1373, in triples
  File "build/bdist.linux-x86_64/egg/rdflib_kyotocabinet/KyotoCabinet.py", line 315, in triples
There is an index on the predicates-objects which should make getting the objects of a given type quick:
%ls -lh worm.db
total 193M
-rw-r--r-- 1 markw markw 6.1M Jul 15 13:35 contexts.kch
-rw-r--r-- 1 markw markw  55M Jul 15 13:35 c^o^s^p^.kch
-rw-r--r-- 1 markw markw  55M Jul 15 13:35 c^p^o^s^.kch
-rw-r--r-- 1 markw markw  55M Jul 15 13:52 c^s^p^o^.kch
-rw-r--r-- 1 markw markw  14M Jul 15 13:35 i2k.kch
-rw-r--r-- 1 markw markw  14M Jul 15 13:35 k2i.kch
-rw-r--r-- 1 markw markw 6.1M Jul 15 13:35 namespace.kch
-rw-r--r-- 1 markw markw 6.1M Jul 15 13:35 prefix.kch

I'm thinking that for my query above though, the database engine is doing a join between the results from the index lookup. Adding in an option to toggle between the triples generated for querying versus uploading should make these queries more reasonable while still allowing us to get subtypes from a super-type query. Some of these issues I've run into (there are others...) suggest that the shared query/upload pattern may be inappropriate. I'll have to think more about it.