Quantum Keep

Quantum Keep was an exploratory project investigating different ways of using git for storage and versioning of data as opposed to source files.

I would be lying if I said the main reason for this project was anything other than my own curiosity, but one use case I was particularly interested in was the idea of a small-scale content management system for personal websites that keeps both the site's configuration and design and the site's content in a single git repository, for ease of development, deployment and backup, rather than forcing the user to deal with a separate database system.

The Storage Model

In my experiments I tried several different storage models, all building on the raw git concepts of blobs, trees and commits.

The first model I built as a prototype was a naive mapping of a simple nested map data structure onto trees and blobs. Maps were represented as trees whose entries were named after the keys, and string values were represented as blobs. This was a very straightforward mapping, but also very inefficient on reads since many separate git objects must be visited in order re-constitute the data structure.

However, the first prototype did result in some useful utilities for manipulating git's primitives as data structures rather than files, so armed with those utilities my second approach was to build a schema-based store with each commit containing the following structure:

a top-level tree with exactly two entries of 'schema' and 'data', each referring to other trees
each entry in the schema tree is a blob describing the structure of a particular container (a table in SQL perlance).
each entry in the data tree is a further tree for each container described in the schema.
the per-container tree contains further trees whose entry names form a search tree for the object primary keys, eventually leading to a blob containing the actual data.
all data blobs contain data serialized using msgpack, a compact binary serialization with a data model similar to JSON.

The second prototype was also the first that attempted to solve for diffing and merging of data structures, described in more detail below, but in the end the presence of the strong schema led to some pretty complex cases to handle when merging commits containing schema changes with commits containing data changes, etc.

The third and final prototype was a compromise between the first two approaches. I returned to the idea of simply applying versioning to an arbitrary data structure and discarded the idea of schema in favor of self-describing objects. This version retained the idea of using msgpack for compact storage of individual objects, but introduced the idea of using git trees to represent various storage structures for efficient storage and retrieval, with various different implementations optimized for different situations. For example:

A key-value map with few items but whose values are large objects could be represented as a single tree whose entries are named after the keys and whose values are the blobs representing the large objects.
A larger map could instead be handled by splitting the key into multiple chunks to allow a smaller bucket to be quickly located rather than scanning the entire set.
A list data structure could be represented as a binary search tree with keys representing ranges of integer indices. These ranges can grow or shrink to different sizes to allow for single-item inserts and removals to happen while changing only a single leaf node of the tree, for efficient diffing.

Diffing and Merging

An area that particularly interested me for this project was a way to exploit git's distributed nature to create collaborative datastores in a new and interesting way.

For example, MusicBrainz and OpenStreetMap are both large databases that are collaboratively edited but centrally maintained; there is a single online data store and the organizations behind these projects must provide online tools to apply changes to this single authoritative store. It would be interesting if such databases could instead be edited in a distributed manner similar to software projects.

The key requirement for distributed change management is effective diffing and merging. Effectiveness here can mean a few different things:

Ability to accurately discern the "meaning" of a change so that it can be automatically re-applied to a different tree without conflicts.
Ability to detect differences quickly without visiting every entry in the store.
Ability to describe the scope of a change to a human user as an aid to correctly resolving a conflict that cannot be handled automatically.

For version control of source code the first of these problems is often reduced to the problem of categorizing each source code line as either added, removed or unchanged. This approximates the nature of software development, where often that is exactly what is going on: new features primarily add new lines, removing features removes lines, and really only large-scale refactoring tends to create sticky situations that this simple model can't comprehend.

Databases are a trickier beast. Consider, for example, an integer that describes how many of something there are. If the value starts off at 5 and two other changes set it to 6, that doesn't necessarily mean that the correct answer is 6. In fact, in the case of a counter in this case the correct outcome is arguably 7, since both changes added 1.

Traditional database engines handle this sort of situation via locks, since they are able to control all writes and serialize them. We do not have this luxury in a decentralized system, since the equivalent behavior would be to simply refuse to merge automatically and defer to a human operator to resolve the situation.

An approach I investigated for Quantum Keep was to define a merge strategy for each field. The default strategy is to simply blow up and ask for human intervention, since this is the only safe default. However, other automatic merge strategies could include a numeric diff as in the above example, or append-only behavior for a list, or traditional word/line based diffs for textual data.

This additional metadata helps the automatic merging system understand the intent of a change, and a data model designed around the available set of merge strategies could be pretty effective.

Conclusions

I did not develop Quantum Keep into a working, finished product, but the thought experiments and prototypes described above convince me that such a storage strategy is viable, and that for certain problem spaces distributed data modification could work well.

However, git's expectation that each participant has a complete copy of the repository means that ths approach is unlikely to be pleasant for larger datasets like OpenStreetMap, unless a different synchronization strategy were used to allow for partial clones.

The source code for these prototypes, such that it is, is available in a github repository, with each of the three prototypes living in a separate (poorly-named) branch. I welcome interested parties to study the code, but note well that none of these prototypes are finished, working products and I worked on them long enough ago that I can't make any representations about what state they are in.

My colleague Sam Vilain coincidentally also worked on a git data store research project, and many of the above ideas and conclusions came from us sharing our ideas and experiences; Sam stuck with the table-based approach with full ACID compliance and took that line of inquiry further than I did, so it may be an interesting read for anyone who is interested in exploring that idea further.

(The sorta-logo I used for Quantum Keep on my home page incorporates the Git logo by Jason Long under the terms of the Creative Commons Attribution 3.0 Unported License.)