OpenWorm notes: April 2014

Wednesday, April 30, 2014

Just found this conversation with Steve McGrew and Richard Gordon from January. What happened from here?

Provenance

I am looking at the question of data sources for the project. I have latched onto the question of provenance: tracking where data was generated, when, by who, for what. Here is some evidence for the project for my concern:

Open issue: Add original citation information to the first row ...
This mailing list conversation about provenance in RDF Re: CElegans MySQL Database
This post requesting sources. Addressed(?) by Tim B's work on building a triple store for citations and such.
A mention of the origins of the data on the current database

Other things that just are relevant to research into data sources

This post about a video source for worm movement validation.
This page from worm atlas which serves as the basis for much of the connectome google doc spreadsheet
Schafer lab, which is often referred to in data discussions
These incredibly helpful links from Stephen L.
- http://www.textpresso.org/celegans/
- Project References
- Resources through Open Worm Science page

As a side note, I found this post from Jim H. that makes me feel better about my initial bewilderment with the project organization :) It's also a good reference for the state of the project's data about from about a year ago.

Monday, April 28, 2014

Pipelines

It will be invaluable for me in determining the needs of the OpenWorm project to have a set of the "pipelines" between different data sources and target simulation engines (this is in the name of the project right?). For example in movement validation I have

Raw Video => Measurements => Features => Statistics

from this progress report.

I think I have a feel for the general flow of things

Collected data (primary sources) and/or journal articles describing measurements of chemical(?)/electrical pathways in C. elegans; neuronal/muscular connections; movement patterns; avoidance behaviors.

If we have raw data, then measurements from the data

Translate data into a form which can be checked against existing models
Run models against translated data and measure how they differ
Store these disparities between the model and "real world" measurements
Feedback the model inaccuracy to improve the model. That is, iterate through 2-5.

Of course, when more raw data come in, they need to be fed back in through step 1.

There's also, I suspect, a need to extract the data from the models (gotten from step 4) and put them into reports and articles for dissemination. This last would require holding the origins of the data all the way through (provenance!) so you don't have to go back through and say, "who generated this data?" How is this being handled currently?

There is, certainly, a "zone of deference" to the capabilities of project members to keep track of their data sources (they're scientists after all), but we'd like to make it harder to lose track of the sources of the data. Moving into the simulation formats, it isn't clear to me what the best way is to preserve that data outside of procedural recommendations for tagging data on re-entry into the backing stores.

We are probably looking at something like auxiliary files for storing the metadata needed for keeping track of semantic web data. In writing out to the distinct formats like, for example, some_model.nml we would also write out some_model.nml.rdf for the metadata. The rdf file would include not only provenance, but also other information, like more detailed simulation parameters than those supported by the target format, the version of the RDF->simulation format translator, etc.

These are just "spit ball" thoughts. My main thing is first to gather details on what the project is doing now and what it want's to do. These details should take the form of the pipelines I mention at the start of the post but with greater detail and information on what needs to be preserved, added, removed in moving between formats in both directions.

General meeting

Full meeting notes for 28 April (10:30-12:00 Central time)

Some things I missed, Johannes (other GSOC student), Reddit AMA, discussions on geppetto and sibernetic.

The parts that I got (from ~11:00 Central time):

Came in during Matteo's update. Discussing Waffle.io and Geppetto tutorial for people unfamiliar with some of the underlying code.

Introduced myself.

Link to the notes for the meeting in my google drive.

Giovanni discussed Kickstarter
Kickstarter page repo: https://github.com/openworm/org.openworm.website/tree/kickstarter

Someone from Gateway contacted by Stephen. Previous assistance given in similar context. Hopefuly extending exposure.

Kickstarter dip in the middle of the campaign. Need to pump up the volume (reach more people). Typically picks up at the end of the campaign.

Michael prospectives paper, turing test section, movement validation, describe testing pipeline

Jim converting processing pipeline from Matlab to Python with Michael in next couple of weeks.

Andrey (and someone else) absent. possibly sleeping.

Raynor working on modeling NMJ of worm https://github.com/openworm/OpenWorm/issues/124. Simulation running in NEURON. Trying to match the NeuoML to specifications in a paper which describes graded responses. Distinction in modeling between ACh- and GABA-ergic synapses.

Padraig suggests generating from neuroConstruct into NEURON and building from that.

Disconnected

Michael getting exact modelling of sinusoidal motion.

Tom making fitness functions for the worm's motion based on more complex sinusoids.

Meeting ends.

Wednesday, April 23, 2014

PyOpenWorm

Read through PyOpenWorm source. It looks like it's drawing from some overlapping data sources:

https://raw.github.com/openworm/data-viz/master/HivePlots/neurons.csv and https://raw.github.com/openworm/data-viz/master/HivePlots/connectome.csv for the networkx stuff
a local sqlite database for the RDF stuff

I need to look over these data sources to get an idea of what they contain and whether there's overlap that should be eliminated.