Category Archives: RDF

Versioning / Revisioning for Data, Databases and Domain Models: Copy-on-Write and Diffs

There are several ways to implement revisioning (versioning) of domain model and Databases and data generally):

  • Copy on write – so one has a ‘full’ copy of the model/DB at each version.
  • Diffs: store diffs between versions (plus, usually, a full version of the model at a given point in time e.g. store HEAD)

In both cases one will usually want an explicit Revision/Changeset object to which :

  • timestamp
  • author of change
  • log message

In more complex revisioning models this metadata may also be used to store key data relevant to the revisioning structure (e.g. revision parents)

Copy on write

In its simplest form copy-on-write (CoW) would copy entire DB on each change. However, this is cleary very inefficient and hence one usually restricts the copy-on-write to relevant changed “objects”. The advantage of doing this is that it limits the the changes we have to store (in essence objects unchanged between revision X and revision Y get “merged” into a single object).

For example, if our domain model had Person, Address, Job, a change to Person X would only require a copy of Person X record (an even more standard example is wiki pages). Obviously, for this to work, one needs to able to partition the data (domain model). With normal domain model this is trivial: pick the object types e.g. Person, Address, Job etc. However, for a graph setup (as with RDF) this is not so trivial.

Why? In essence, for copy on write to work we need:

  1. a way to reference entities/records
  2. support for putting objects in a deleted state

The (RDF) graph model has poor way for referencing triples (we could use named graphs, quads or reification but none are great). We could move to the object level and only work with groups of triples (e.g. those corresponding to a “Person”). You’d also need to add a state triple to every base entity (be that a triple or named graph) and add that to every query statement. This seems painful.

Diffs

The diff models involves computing diffs (forward or backward) for each change. A given version of the model is then computed by composing diffs.

Usually for performance reasons full representations of the model/DB at a given version are cached — most commonly HEAD is kept available. It is also possible to cache more frequently and, like copy-on-write, to cache selectively (i.e. only cache items which have change since the last cache period).

The disadvantage of the diff model is the need (and cost) of creating and composing diffs (CoW is, generally, easier to implement and use). However, it is more efficient in storage terms and works better with general data (one can always compute diffs), especially that which doesn’t have such a clear domain model — e.g. the RDF case discussed above.

Usage

  • Wikis: Many wikis implement a full copy-on-write model with a full copy of each page being made on each write.
  • Source control: diff model (usually with HEAD cached and backwards diffs)
  • vdm: copy-on-write using SQL tables as core ‘domain objects’
  • ordf: (RDF) diffs with HEAD caching

Howto Install 4store

My experiences (with the assistance of Will Waites) of installing 4store On Ubuntu Jaunty.

No packaged versions of code (there is one in fact from Yves Raimond from mid 2009 but now out of date …), so need to get from github.

Recommend using will waites fork which adds useful features like:

  • multiple connections
  • triple deletion

Note I had to make various fixes to get this to compile on my ubuntu machine. See diff below.

Install standard ubuntu/debian dependencies:

  • See 4store wiki
  • rasqal needs to be latest version
    • Get it
    • ./configure –prefix=/usr –sysconfdir=/etc –localstatedir=/var
    • make, make install
  • Now install

Now to start a DB:

  • 4s-backend-setup {db-name}
  • 4s-backend {db-name}

Now for the python bindings also created by will waites and which can be found here

  • On my Jaunty needed to convert size_t to int everywhere
  • Needed to run with latest cython (v0.12) installed via pip/easy_install
  • To run tests need backend db called py4s_test (harcoded)

To run multiple backends at once you will probably need to have avahi dev libraries (not sure which!).

Diff for wwaites 4store fork (updated diff as of 2010-04-28)


diff --git a/src/backend/Makefile b/src/backend/Makefile
index 51a957c..e64eb13 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -2,7 +2,7 @@ include ../discovery.mk
 include ../rev.mk
 include ../darwin.mk

-CFLAGS = -Wall -Wstrict-prototypes -Werror -g -std=gnu99 -O2 -I.. -DGIT_REV=\"$(gitrev)\" pkg-config --cflags raptor glib-2.0 +CFLAGS = -Wall -Wstrict-prototypes -g -std=gnu99 -O2 -I.. -DGIT_REV=\"$(gitrev)\" pkg-config --cflags raptor glib-2.0 LDFLAGS = $(ldfdarwin) $(ldflinux) -lz pkg-config --libs raptor glib-2.0 $(avahi)

LIB_OBJS = chain.o bucket.o list.o tlist.o rhash.o mhash.o sort.o \ diff --git a/src/common/Makefile b/src/common/Makefile index 9b33e94..60cd04f 100644 --- a/src/common/Makefile +++ b/src/common/Makefile @@ -21,7 +21,7 @@ ifdef dnssd mdns_flags = -DUSE_DNS_SD endif

-CFLAGS = -std=gnu99 -fno-strict-aliasing -Wall -Werror -Wstrict-prototypes -g -O2 -I../ -DGIT_REV=\"$(gitrev)\" $(mdns_flags) pkg-config --cflags $(pkgs) +CFLAGS = -std=gnu99 -fno-strict-aliasing -Wall -Wstrict-prototypes -g -O2 -I../ -DGIT_REV=\"$(gitrev)\" $(mdns_flags) pkg-config --cflags $(pkgs) LDFLAGS = $(ldfdarwin) $(lfdlinux) LIBS = pkg-config --libs $(pkgs)

diff --git a/src/frontend/results.c b/src/frontend/results.c index 485ac31..162aa3d 100644 --- a/src/frontend/results.c +++ b/src/frontend/results.c @@ -381,12 +381,12 @@ fs_value fs_expression_eval(fs_query *q, int row, int block, rasqal_expression * return v; }

  • case RASQAL_EXPR_SUM:
  • case RASQAL_EXPR_AVG:
  • case RASQAL_EXPR_MIN:
  • case RASQAL_EXPR_MAX:
  • case RASQAL_EXPR_LAST:
  • return fs_value_error(FS_ERROR_INVALID_TYPE, "unsupported aggregate operation");
  • //case RASQAL_EXPR_SUM:
  • //case RASQAL_EXPR_AVG:
  • //case RASQAL_EXPR_MIN:
  • //case RASQAL_EXPR_MAX:
  • //case RASQAL_EXPR_LAST:
  •  //    return fs_value_error(FS_ERROR_INVALID_TYPE, "unsupported aggregate operation");
    

    endif

Diff to wwaites py4s (updated diff as of 2010-04-28)


diff --git a/_py4s.pxd b/_py4s.pxd
index 5251289..0e26250 100644
--- a/_py4s.pxd
+++ b/_py4s.pxd
@@ -110,7 +110,7 @@ cdef extern from "frontend/results.h":

cdef extern from "frontend/import.h": int fs_import_stream_start(fsp_link *link, char *model_uri, char *mimety - int fs_import_stream_data(fsp_link *link, unsigned char *data, size_t co + int fs_import_stream_data(fsp_link *link, unsigned char *data, int count int fs_import_stream_finish(fsp_link *link, int *count, int *errors)

cdef extern from "frontend/update.h":