Monday, April 28, 2014

Pipelines

It will be invaluable for me in determining the needs of the OpenWorm project to have a set of the "pipelines" between different data sources and target simulation engines (this is in the name of the project right?). For example in movement validation I have
Raw Video => Measurements  =>  Features  =>  Statistics
from this progress report.

I think I have a feel for the general flow of things

  1. Collected data (primary sources) and/or journal articles describing measurements of chemical(?)/electrical pathways in C. elegans; neuronal/muscular connections; movement patterns; avoidance behaviors.
    1. If we have raw data, then measurements from the data
  2. Translate data into a form which can be checked against existing models
  3. Run models against translated data and measure how they differ
  4. Store these disparities between the model and "real world" measurements
  5. Feedback the model inaccuracy to improve the model. That is, iterate through 2-5.
Of course, when more raw data come in, they need to be fed back in through step 1.

There's also, I suspect, a need to extract the data from the models (gotten from step 4) and put them into reports and articles for dissemination. This last would require holding the origins of the data all the way through (provenance!) so you don't have to go back through and say, "who generated this data?" How is this being handled currently?

There is, certainly, a "zone of deference" to the capabilities of project members to keep track of their data sources (they're scientists after all), but we'd like to make it harder to lose track of the sources of the data. Moving into the simulation formats, it isn't clear to me what the best way is to preserve that data outside of procedural recommendations for tagging data on re-entry into the backing stores.

We are probably looking at something like auxiliary files for storing the metadata needed for keeping track of semantic web data. In writing out to the distinct formats like, for example, some_model.nml we would also write out some_model.nml.rdf for the metadata. The rdf file would include not only provenance, but also other information, like more detailed simulation parameters than those supported by the target format, the version of the RDF->simulation format translator, etc.

These are just "spit ball" thoughts. My main thing is first to gather details on what the project is doing now and what it want's to do. These details should take the form of the pipelines I mention at the start of the post but with greater detail and information on what needs to be preserved, added, removed in moving between formats in both directions.

No comments:

Post a Comment