Saturday, May 10, 2014

Updates, Persistence, and Berkeley DB store

For persisting updates to data through PyOpenWorm, we need some stable store. For now I'm looking at how we can save the updates locally. From there we would have mechanisms to write back local changes to the shared repository, possibly with staging for review by other members, like a Github pull request.

Stephen L. has suggested the integration of git with the database system, in particular through presentation of changes through diffs in SQL statements. With the scheme above this could involve dumping to SQL statements on commit to be legible in the diff. To extract a meaningful diff from from these dumps, we'd need to sort the values first so that no-change would mean matching subsequences for diff to find. Just from a cursory assessment, I don't think this too difficult, and we can already see that it would be valuable for exposing the changes in a human readable format and for integrating into existing Github-based workflows.

I have started experimenting with Berkeley DB using the "Sleepycat" store for RDFlib. I've done a little profiling and found that performing queries is about 10 times slower (.02 ms for in-memory v. .17ms for Sleepycat). What's interesting is that query times fall precipitously after a number of repeated queries, suggesting some caching. I'd like to read up on Berkeley DB to get a fuller picture. I'll also be looking at times for writes back to the database.

No comments:

Post a Comment