OpenWorm notes: 2014

Tuesday, September 16, 2014

End of GSOC

We've reached the end of Google Summer of Code (it was weeks ago now, but I'm writing this blog post anyway).

Although I've written a lot of code and made a lot of plans, I can't really back off of the project until I see it being used in a way that people (outside and inside of OpenWorm) are actually seeing it. When I came into the project, my long-term goal was an "annotated simulation". This is a system for adding annotations to data, carrying them through to model files, and finally exposing the annotations to end-users upon inspection into simulation results. This is to answer our long standing criticism that, in some way or another, our simulations don't line up with the actual science. (Similar concerns are addressed from a computational perspective in this thread) Over this passed summer, I've gotten the first step to a place where I almost like it, but the second step has stalled significantly, and the third was never started.

Monday, August 18, 2014

Cleaning up documentation...

Sunday, August 3, 2014

As I've been using PyOpenWorm, I've encountered some cracks in the design. The primary issue is the interaction between configuration, object properties, and database set-up. Some examples of problems that have resulted:

Run-time configuration has been set-up such that each object has to receive its configuration anew from the object that creates it or else use a globally set default configuration. This is a very brittle framework that falls apart any time the appropriate configuration is not passed to Configureable objects contained within other Configureable objects which can result in configuration values hanging around between connect() and disconnect(). This isn't a big deal for simple one-off scripts, but can be annoying when testing or when designing more complex scripts.
Registration of a Property doesn't take place until the first time its owner is initialized. This prevents us from resolving a Property to an object directly from the graph without having made one of its owners first. This results in properties which are identical having different names in sub-classes due to the dynamic nature of property-name creation.
Each DataObject sub-class has to be registered as a separate step from defining the class. This is an minor annoyance when creating new classes.
It's necessary to either know the structure of generated namespace URIs (e.g., http://openworm.org/entities/Neuron/) and recreate them or to create an object of the desired type when referencing objects in another namespace. The first approach is sub-optimal because we might have reason to change the structure of the URIs in the future which is already a problem for some auxiliary methods designed to extract information from a URI. The second approach is also problematic since it requires the creation of objects which are never used otherwise, creating overhead for the garbage collector as well as the programmer

My proposed solution to these problems is to push the configuration and class-dependent initialization to the class-definiton phase through the use of Python's decorators which run on library-load. This takes care of the fourth point by having all of the namespaces and namespace sturcture made static by the time any object should be created.

The third point is covered by folding the current registration into the class decoration.

The second point is covered by attaching each Property subclass to the owner's class while retaining the initialization in the __init__() call for accessing owner-instances.

Finally, the first point can be addressed without the proposed solution by not having configuration passed through __init__. Initially I conceived of a tree structure for configuration which could be augmented by internal nodes and passed on to children with specific configuration needs that didn't need communication to higher level users. This structure is exemplified by Data which is Configureable, but also a Configure object suitable for configuring other objects. Unfortunately, this is the only case where the feature gets any use and isn't required for passing augmented configuration to the module as a whole, so it can simply be removed.

I hope to address some of these issues soon after outstanding issues are closed and features for the next release are well-defined.

UPDATE: Most of these issues have been addressed in the new-classes branch. Not yet merged into master.

Sunday, July 20, 2014

It seems I have more to learn about rdflib's SPARQL query engine. A persistent issue was the mis-use of indexing for what (I thought) should be simple queries. My error was in including triples of the form ?x rdf:type openworm:type in the query patterns. For reasons I don't yet understand, the engine evaluates such statements before others even if there's a more reductive join it could perform from other patterns. Hopefully I can understand this eventually.

In any case, standard object->property->value queries are effectively O(1) time again--yay!

Thursday, July 17, 2014

small update:

The issue from the previous post was one of indexing. The Property object identifiers were being generated differently for every Neuron object which obviously made minimal use of the index. Basing identifiers on the identity of the owner object and the value of the property takes the query time to get a single property back to constant time. For example:

for x in range(1000):
    Neuron().save() # creates and saves a randomly generated neuron object
Neuron().name()

Takes about 10 seconds for the call to name() on rdflib's memory store as well as the SleepyCat store (closer to 11 seconds) bundled with rdflib. That's about .01 seconds for getting the name of a specific neuron regardless of the database size.

Tuesday, July 15, 2014

I adjusted the query according to the previous post and also added patterns for the cell the property is actually on (oops):

prefix openworm: <http://openworm.org/entities/>
prefix neuron: <http://openworm.org/entities/Neuron/>
prefix sp: <http://openworm.org/entities/SimpleProperty/>
select distinct ?type where
{ 
?Neuron rdf:type openworm:Neuron .

?Neuron neuron:lineageName ?Neuron_lineageName .
?Neuron_lineageName rdf:type openworm:Neuron_lineageName .
?Neuron_lineageName sp:value ?lineageName .

?Neuron neuron:name ?Neuron_name .
?Neuron_name rdf:type openworm:Neuron_name .
?Neuron_name sp:value "AVAL" .

?Neuron neuron:type ?Neuron_type .
?Neuron_type rdf:type openworm:Neuron_type .
?Neuron_type sp:value ?type .

?Neuron neuron:receptor ?Neuron_receptor .
?Neuron_receptor rdf:type openworm:Neuron_receptor .
?Neuron_receptor sp:value ?receptor 
}

For the data stores we've been using, such queries don't seem to execute very efficiently.

I would hope for it to be quick since

there should be exactly one statement matching (a, sp:value, "AVAL") to look up in the 'pos' index,
and then only one matching (b, neuron:name, a) looked up again in the 'osp' index,
and an index lookup on the 'spo' index for the types of 'b' to check it has the correct type should be efficient.
The neuron should then have only one neuron:name,
and one of all of the other properties which can be looked up in the 'spo' index.

For queries against Sesame, I have to try using the Java API to be more sure these indexes are actually there, otherwise I'm making a lot of assumptions about how well the indexes work. Besides that, I don't have an easy way of understanding how the queries are getting executed without measuring often very long query times. Maybe I can get some sense of how these queries are being handled with some test data.

I've started reading a little about ZODB, a python object storage database. The core of what I've been making for PyOpenWorm is an Object<->RDF mapper. The choice of RDF storage solutions to back PyOpenWorm was based on the work that had already been done, my experience in working with some RDF tools, and an expectation that joining with existing datasets may be easier with RDF than other storage options. It may be useful to re-evaluate going forward.

I've been trying out the KyotoCabinet store for rdflib. Uploading the whole worm under the current parameters takes about 17 minutes.

Timing any of the queries on the full data set is difficult since even 'simple' ones tend to take several minutes. The queries I'm running look something like this for getting the type ('interneuron', 'sensory', 'motor') :

select distinct ?type where { ?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/SimpleProperty> .
?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/Property> .
?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/DataObject> .
?Neuron_type0056fa3c098f873e <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://openworm.org/entities/Neuron_type> .
?Neuron98fc353f03788d3e <http://openworm.org/entities/Neuron/type> ?Neuron_type0056fa3c098f873e .
?Neuron_type0056fa3c098f873e <http://openworm.org/entities/SimpleProperty/value> ?type }

I've adopted a pattern of mirroring the triples into the upload query and the select query. The objective with this is to make it easier to change the data structures without having to update both the upload format and the select with every change. This is the reason for the query above containing 3 rdf:type patterns -- in lieu of RDF inference (intending to include later), I want to be able to utilize the class hierarchy to, for instance, get all of the Neuron objects when querying for Cell objects. To have this, I insert, for each object, all of its super-types. This would let me query for objects of type Cell and get all those of type Neuron (and Muscle). Any object should have all of the super-type tags in addition to its most specific form, so there's no problem, theoretically, querying for the right object.

The main issue is that these queries get executed very slowly. I suspect it's something to do with scanning the table. Interrupting a query (because who has time for that?) we find ourselves here:

  File "build/bdist.linux-x86_64/egg/rdflib/plugins/sparql/evaluate.py", line 47, in evalBGP
  File "build/bdist.linux-x86_64/egg/rdflib/graph.py", line 1373, in triples
  File "build/bdist.linux-x86_64/egg/rdflib_kyotocabinet/KyotoCabinet.py", line 315, in triples

There is an index on the predicates-objects which should make getting the objects of a given type quick:

%ls -lh worm.db
total 193M
-rw-r--r-- 1 markw markw 6.1M Jul 15 13:35 contexts.kch
-rw-r--r-- 1 markw markw  55M Jul 15 13:35 c^o^s^p^.kch

-rw-r--r-- 1 markw markw 55M Jul 15 13:35 c^p^o^s^.kch

-rw-r--r-- 1 markw markw  55M Jul 15 13:52 c^s^p^o^.kch
-rw-r--r-- 1 markw markw  14M Jul 15 13:35 i2k.kch
-rw-r--r-- 1 markw markw  14M Jul 15 13:35 k2i.kch
-rw-r--r-- 1 markw markw 6.1M Jul 15 13:35 namespace.kch
-rw-r--r-- 1 markw markw 6.1M Jul 15 13:35 prefix.kch

I'm thinking that for my query above though, the database engine is doing a join between the results from the index lookup. Adding in an option to toggle between the triples generated for querying versus uploading should make these queries more reasonable while still allowing us to get subtypes from a super-type query. Some of these issues I've run into (there are others...) suggest that the shared query/upload pattern may be inappropriate. I'll have to think more about it.

Saturday, June 28, 2014

I haven't posted in a couple weeks!

Since the last update, our design has moved away from a remote database to one locally hosted. So each user of PyOpenWorm would have his/her own copy of the database on their machine together with a serialization of the database which can be version controlled under git. The serialization isn't too large (40MB for trix, 20MB for n-quads, 13MB for trig).

One of my long standing issues is attributing data to data sources -- labs, articles, curated data. This aspect of attribution, assigning metadata, can be done before the data model is fully specified, and has been in various places. We would only (!) need to translate into our data model after that.

For now I'm implementing the API.

Sunday, June 15, 2014

Lately, I've been attempting to generate morphology files from the database to use in RegenerateConnectome.py. Generating them is fine, but obviously pulling from the database for every one of the 302 neurons is pretty slow. It may not be worth it to speed up that loop more directly rather than caching the database. Need to test this.

EDIT: Comparing to the full database copied to a local Sleepycat store, the queries are actually slower. Same with OpenRDF Sesame...probably due to my laptop being way underpowered compared to the server...or resource sharing.

EDIT2: Think I'll look into HTTP caching.

Thursday, June 12, 2014

PyOpenWorm feature freeze

I have declared a feature freeze for PyOpenWorm. For this week we're just writing code to pass the tests we've already written and those obviously derived from the API.

Wednesday, June 11, 2014

I am trying to figure out how ReadTheDocs can bring in documentation from the python source files.

update (10:33): Sphinx installed
update (10:44): Got Network's documentation printing out. Let's see about the others.
update (10:57): aaand, now the docs don't update from the source files. Not even deleting the build directory changes it.
update (11:19): Started using sphinx-apidoc. Has really noisy output.

Wednesday, June 4, 2014

Citations, DOIs, Pubmed IDs

Since we've started discussing the API for PyOpenWorm, we've started talking about putting up citations of data sources. What we're looking at doing is making a declarative API that allows us to write things like:

n1 = Neuron('AVAL')
n2 = Neuron('ADER')
r1 = n1.connectsWith(n2) # r1 is now a representation of this 'connectsWith' relationship
s = Chemical('X')
r2 = s.modulates(r) # r2 is the representation of the 'modulates' relationship
e = Evidence(doi="http://dx.doi.org/10.1007%2Fs00454-010-9273-0")
e.asserts(r2)

In English, "10.1007%2Fs00454-010-9273-0 suggests that chemical X modulates the connection between AVAL and ADER". This 'relationship' API (a sub-API to the PyOpenWorm API), has the primary feature of allowing statements made as to be part of higher order statements.

Wednesday, May 28, 2014

Lineage data

I finally dug into the daughter_of tree (unfortunately, a forest. see below) cell division data.

After cleaning it up a little I put it into OpenRDF: http://107.170.133.175:8080/openrdf-workbench/repositories/OpenWorm2

There are some discontinuities which make the graph unconnected. The roots in our forest of division trees can be shown with this query:

PREFIX ns1:<http://openworm.org/entities/>

select distinct ?z where { ?y ns1:daughter_of ?z . filter(not exists { ?z ns1:daughter_of ?p }) }

These discontinuities have nothing to do with the input errors mentioned in the github issue as all of those in the daughter_of table were easily fixed.
To look at the path-to-root for 'AB plppaaaapa' :

PREFIX ns1:<http://openworm.org/entities/>

select distinct ?z where { ns1:AB_plppaaaapa ns1:daughter_of+ ?z }

The code I used to extract and upload: http://pastebin.com/NVnAfD8D

It's just the cell divisions that we've had for some time. We've discussed bringing in volume data for each of the cells in development in order to model differentiation waves discussed by Richard Gordon and colleagues. As of yet, we don't have this data readily available.

Thursday, May 22, 2014

OpenRDF problems

I was having difficulty with setting up OpenRDF on a Digital Ocean machine, but figured it out. A permissions problem was preventing sesame from creating its log files. Changing the ownership of /usr/share/tomcat7 to user tomcat7 fixed it.

Tuesday, May 20, 2014

I've switched from using a local SQLite database to an OpenRDF SPARQL service which supports SPARQL 1.1 updates. The update statements should include mandatory information about the updater as well as other checks specific to the kind of data being modified. I'm looking into an inference engine that stands between the DB and PyOpenWorm, but in the meantime I'll be encoding the rules in PyOpenWorm itself.

Sunday, May 18, 2014

I have an OWL file with WormBase data in it. And a sparql endpoint that has the same data.

Wormbase has it going on (with respect to anatomical tree displays).

Also:

Work has started on the requirements writeup.

Sunday, May 11, 2014

Cell lineage data

Just links:

Relevant meeting notes

Whitepaper

Github issue

Saturday, May 10, 2014

Updates, Persistence, and Berkeley DB store

For persisting updates to data through PyOpenWorm, we need some stable store. For now I'm looking at how we can save the updates locally. From there we would have mechanisms to write back local changes to the shared repository, possibly with staging for review by other members, like a Github pull request.

Stephen L. has suggested the integration of git with the database system, in particular through presentation of changes through diffs in SQL statements. With the scheme above this could involve dumping to SQL statements on commit to be legible in the diff. To extract a meaningful diff from from these dumps, we'd need to sort the values first so that no-change would mean matching subsequences for diff to find. Just from a cursory assessment, I don't think this too difficult, and we can already see that it would be valuable for exposing the changes in a human readable format and for integrating into existing Github-based workflows.

I have started experimenting with Berkeley DB using the "Sleepycat" store for RDFlib. I've done a little profiling and found that performing queries is about 10 times slower (.02 ms for in-memory v. .17ms for Sleepycat). What's interesting is that query times fall precipitously after a number of repeated queries, suggesting some caching. I'd like to read up on Berkeley DB to get a fuller picture. I'll also be looking at times for writes back to the database.

Wednesday, May 7, 2014

Coming from the data meeting held today, I looked further into the movement validation repo.

I will review later today.

Tuesday, May 6, 2014

Data sources and collection

This is a list of data sources I've identified as actually being used in projects, starting from here. The goal is to understand the different types of data source to know how we can access them and what input/output formats should be targeted:

MySQL database: mysql_31129_celegans at my01.winhost.com
Ion Channel Spreadsheet includes citations from Pubmed showing relationships of ion channels to genes.
Wormbase, generally. There seems to be some query functionality, but it uses its own 'special' query language -- seems not very complicated; pattern based.
Time series data?
Movement validation data pointing to the ftp from Laura Grundy
Comments on some of the spreadsheet cells -- not exported from google drive.

I've also found a lot of scripts for extracting data from different formats:

Extracting from the Laura Grundy ftp
From spreadsheets(CElegansNeuroML project)

I'm still looking around. I'll make a new post with any updates.

Sunday, May 4, 2014

neuroConstruct and CElegansNeuroML

I am thinking that a good model to start working from is the CElegansNeuroML project. For right now I'm going to browse around to figure out how it all fits together. One of the problems I have is that my NeuroConstruct build can't open the 3D display. I'm using Xephyr in XMonad so I can get a non-tiling WM in here (Java AWT programs don't play well with Xmonad it seems). Mayhaps I can use it in a different session.

For right now, I'm going to explore some more data sources, start looking at where there are shared entities. Regardless of what we end up doing, we'll need to handle this somehow.

Friday, May 2, 2014

I'm thinking of trying out the RDFlib store API

Thursday, May 1, 2014

PyOpenWorm and NeuroML

I'm trying to make a neuroml file from the generated rdf database in PyOpenWorm. Trying to make sense of the schema. Everything is related through integer keys and I haven't found a human description yet.

Update 1:
All of the methods read through to the a RDF graph built up in the network module. However, the neuron is not checked for existence in the databse on creation. I'm going to add a primary access method with the check.

Update 2:
Thinking about the structure of these modules. I'm guessing a config file would be good for moving the configurations around. PyOpenWorm should be a library that can be used to make the wrappers, so it shouldn't be tied to any specific database. Sounds like a job for Dependency Inversion (i.e. parametrizing the network class).

Update 3:
Learning libneuorml from Padraig G's examples in the github repo. Right now, I'm leaning towards separating the file-type specific components from PyOpenWorm proper. Eventually I want to stop opening up the library files (closely tied to the schema) and spend my time with the file-specific converters.

Wednesday, April 30, 2014

Just found this conversation with Steve McGrew and Richard Gordon from January. What happened from here?

Provenance

I am looking at the question of data sources for the project. I have latched onto the question of provenance: tracking where data was generated, when, by who, for what. Here is some evidence for the project for my concern:

Open issue: Add original citation information to the first row ...
This mailing list conversation about provenance in RDF Re: CElegans MySQL Database
This post requesting sources. Addressed(?) by Tim B's work on building a triple store for citations and such.
A mention of the origins of the data on the current database

Other things that just are relevant to research into data sources

This post about a video source for worm movement validation.
This page from worm atlas which serves as the basis for much of the connectome google doc spreadsheet
Schafer lab, which is often referred to in data discussions
These incredibly helpful links from Stephen L.
- http://www.textpresso.org/celegans/
- Project References
- Resources through Open Worm Science page

As a side note, I found this post from Jim H. that makes me feel better about my initial bewilderment with the project organization :) It's also a good reference for the state of the project's data about from about a year ago.

Monday, April 28, 2014

Pipelines

It will be invaluable for me in determining the needs of the OpenWorm project to have a set of the "pipelines" between different data sources and target simulation engines (this is in the name of the project right?). For example in movement validation I have

Raw Video => Measurements => Features => Statistics

from this progress report.

I think I have a feel for the general flow of things

Collected data (primary sources) and/or journal articles describing measurements of chemical(?)/electrical pathways in C. elegans; neuronal/muscular connections; movement patterns; avoidance behaviors.

If we have raw data, then measurements from the data

Translate data into a form which can be checked against existing models
Run models against translated data and measure how they differ
Store these disparities between the model and "real world" measurements
Feedback the model inaccuracy to improve the model. That is, iterate through 2-5.

Of course, when more raw data come in, they need to be fed back in through step 1.

There's also, I suspect, a need to extract the data from the models (gotten from step 4) and put them into reports and articles for dissemination. This last would require holding the origins of the data all the way through (provenance!) so you don't have to go back through and say, "who generated this data?" How is this being handled currently?

There is, certainly, a "zone of deference" to the capabilities of project members to keep track of their data sources (they're scientists after all), but we'd like to make it harder to lose track of the sources of the data. Moving into the simulation formats, it isn't clear to me what the best way is to preserve that data outside of procedural recommendations for tagging data on re-entry into the backing stores.

We are probably looking at something like auxiliary files for storing the metadata needed for keeping track of semantic web data. In writing out to the distinct formats like, for example, some_model.nml we would also write out some_model.nml.rdf for the metadata. The rdf file would include not only provenance, but also other information, like more detailed simulation parameters than those supported by the target format, the version of the RDF->simulation format translator, etc.

These are just "spit ball" thoughts. My main thing is first to gather details on what the project is doing now and what it want's to do. These details should take the form of the pipelines I mention at the start of the post but with greater detail and information on what needs to be preserved, added, removed in moving between formats in both directions.

General meeting

Full meeting notes for 28 April (10:30-12:00 Central time)

Some things I missed, Johannes (other GSOC student), Reddit AMA, discussions on geppetto and sibernetic.

The parts that I got (from ~11:00 Central time):

Came in during Matteo's update. Discussing Waffle.io and Geppetto tutorial for people unfamiliar with some of the underlying code.

Introduced myself.

Link to the notes for the meeting in my google drive.

Giovanni discussed Kickstarter
Kickstarter page repo: https://github.com/openworm/org.openworm.website/tree/kickstarter

Someone from Gateway contacted by Stephen. Previous assistance given in similar context. Hopefuly extending exposure.

Kickstarter dip in the middle of the campaign. Need to pump up the volume (reach more people). Typically picks up at the end of the campaign.

Michael prospectives paper, turing test section, movement validation, describe testing pipeline

Jim converting processing pipeline from Matlab to Python with Michael in next couple of weeks.

Andrey (and someone else) absent. possibly sleeping.

Raynor working on modeling NMJ of worm https://github.com/openworm/OpenWorm/issues/124. Simulation running in NEURON. Trying to match the NeuoML to specifications in a paper which describes graded responses. Distinction in modeling between ACh- and GABA-ergic synapses.

Padraig suggests generating from neuroConstruct into NEURON and building from that.

Disconnected

Michael getting exact modelling of sinusoidal motion.

Tom making fitness functions for the worm's motion based on more complex sinusoids.

Meeting ends.

Wednesday, April 23, 2014

PyOpenWorm

Read through PyOpenWorm source. It looks like it's drawing from some overlapping data sources:

https://raw.github.com/openworm/data-viz/master/HivePlots/neurons.csv and https://raw.github.com/openworm/data-viz/master/HivePlots/connectome.csv for the networkx stuff
a local sqlite database for the RDF stuff

I need to look over these data sources to get an idea of what they contain and whether there's overlap that should be eliminated.

Thursday, March 27, 2014

Today I'm reading about the OpenWorm community communications norms.
There's a list of best practices among the project members for communication and documentation which I've taken a look at. The main thing I get from this is that I should announce what I'm going to do before I start working on a ticket from Github.

I also found a document for data collection from videos.