Tuesday, July 15, 2014

I've been trying out the KyotoCabinet store for rdflib. Uploading the whole worm under the current parameters takes about 17 minutes.

Timing any of the queries on the full data set is difficult since even 'simple' ones tend to take several minutes. The queries I'm running look something like this for getting the type ('interneuron', 'sensory', 'motor') :

select distinct ?type where { ?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/SimpleProperty> .
?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/Property> .
?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/DataObject> .
?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/Neuron_type> .
?Neuron98fc353f03788d3e <http://openworm.org/entities/Neuron/type> ?Neuron_type0056fa3c098f873e .
?Neuron_type0056fa3c098f873e <http://openworm.org/entities/SimpleProperty/value> ?type }

I've adopted a pattern of mirroring the triples into the upload query and the select query. The objective with this is to make it easier to change the data structures without having to update both the upload format and the select with every change. This is the reason for the query above containing 3 rdf:type patterns -- in lieu of RDF inference (intending to include later), I want to be able to utilize the class hierarchy to, for instance, get all of the Neuron objects when querying for Cell objects. To have this, I insert, for each object, all of its super-types. This would let me query for objects of type Cell and get all those of type Neuron (and Muscle). Any object should have all of the super-type tags in addition to its most specific form, so there's no problem, theoretically, querying for the right object.

The main issue is that these queries get executed very slowly. I suspect it's something to do with scanning the table. Interrupting a query (because who has time for that?) we find ourselves here:

  File "build/bdist.linux-x86_64/egg/rdflib/plugins/sparql/evaluate.py", line 47, in evalBGP
  File "build/bdist.linux-x86_64/egg/rdflib/graph.py", line 1373, in triples
  File "build/bdist.linux-x86_64/egg/rdflib_kyotocabinet/KyotoCabinet.py", line 315, in triples
There is an index on the predicates-objects which should make getting the objects of a given type quick:
%ls -lh worm.db
total 193M
-rw-r--r-- 1 markw markw 6.1M Jul 15 13:35 contexts.kch
-rw-r--r-- 1 markw markw  55M Jul 15 13:35 c^o^s^p^.kch
-rw-r--r-- 1 markw markw  55M Jul 15 13:35 c^p^o^s^.kch
-rw-r--r-- 1 markw markw  55M Jul 15 13:52 c^s^p^o^.kch
-rw-r--r-- 1 markw markw  14M Jul 15 13:35 i2k.kch
-rw-r--r-- 1 markw markw  14M Jul 15 13:35 k2i.kch
-rw-r--r-- 1 markw markw 6.1M Jul 15 13:35 namespace.kch
-rw-r--r-- 1 markw markw 6.1M Jul 15 13:35 prefix.kch

I'm thinking that for my query above though, the database engine is doing a join between the results from the index lookup. Adding in an option to toggle between the triples generated for querying versus uploading should make these queries more reasonable while still allowing us to get subtypes from a super-type query. Some of these issues I've run into (there are others...) suggest that the shared query/upload pattern may be inappropriate. I'll have to think more about it.

No comments:

Post a Comment