Category Archives: Knowledge Systems

Datapkg 0.7 Released

A major new release (v0.7) of datapkg is out!

There’s a quick getting started section below (also see the docs).

About the release

This release brings major new functionality to datapkg especially in regard to its integration with CKAN. datapkg now supports uploading as well as downloading and can now be easily extended via plugins. See the full changelog below for more details.

Get started fast

# 1. Install: (requires python and easy_install)
$ easy_install datapkg
# Or, if you don't like easy_install
$ pip install datapkg or even the raw source!

# 2. [optional] Take a look at the manual
$ datapkg man

# 3. Search for something
$ datapkg search ckan:// gold
gold-prices -- Gold Prices in London 1950-2008 (Monthly)

# 4. Get some data
# This will result in a csv file at /tmp/gold-prices/data
$ datapkg download ckan://gold-prices /tmp

Find out more » — including how to create, register and distribute your own ‘data packages’.

Changelog

  • MAJOR: Support for uploading datapkgs (upload.py)
  • MAJOR: Much improved and extended documenation
  • MAJOR: New sqlite-based DB index giving support for a simple, central, ‘local’ index (ticket:360)
  • MAJOR: Make datapkg easily extendable

    • Support for adding new Index types with plugins
    • Support for adding new Commands with command plugins
    • Support for adding new Distributions with distribution plugins
  • Improved package download support (also now pluggable)

  • Reimplement url download using only python std lib (removing urlgrabber requirment and simplifying installation)
  • Improved spec: support for db type index + better documentation
  • Better configuration management (especially internally)
  • Reduce dependencies by removing usage of PasteScript and PasteDeploy
  • Various minor bugfixes and code improvements

Versioning / Revisioning for Data, Databases and Domain Models: Copy-on-Write and Diffs

There are several ways to implement revisioning (versioning) of domain model and Databases and data generally):

  • Copy on write – so one has a ‘full’ copy of the model/DB at each version.
  • Diffs: store diffs between versions (plus, usually, a full version of the model at a given point in time e.g. store HEAD)

In both cases one will usually want an explicit Revision/Changeset object to which :

  • timestamp
  • author of change
  • log message

In more complex revisioning models this metadata may also be used to store key data relevant to the revisioning structure (e.g. revision parents)

Copy on write

In its simplest form copy-on-write (CoW) would copy entire DB on each change. However, this is cleary very inefficient and hence one usually restricts the copy-on-write to relevant changed “objects”. The advantage of doing this is that it limits the the changes we have to store (in essence objects unchanged between revision X and revision Y get “merged” into a single object).

For example, if our domain model had Person, Address, Job, a change to Person X would only require a copy of Person X record (an even more standard example is wiki pages). Obviously, for this to work, one needs to able to partition the data (domain model). With normal domain model this is trivial: pick the object types e.g. Person, Address, Job etc. However, for a graph setup (as with RDF) this is not so trivial.

Why? In essence, for copy on write to work we need:

  1. a way to reference entities/records
  2. support for putting objects in a deleted state

The (RDF) graph model has poor way for referencing triples (we could use named graphs, quads or reification but none are great). We could move to the object level and only work with groups of triples (e.g. those corresponding to a “Person”). You’d also need to add a state triple to every base entity (be that a triple or named graph) and add that to every query statement. This seems painful.

Diffs

The diff models involves computing diffs (forward or backward) for each change. A given version of the model is then computed by composing diffs.

Usually for performance reasons full representations of the model/DB at a given version are cached — most commonly HEAD is kept available. It is also possible to cache more frequently and, like copy-on-write, to cache selectively (i.e. only cache items which have change since the last cache period).

The disadvantage of the diff model is the need (and cost) of creating and composing diffs (CoW is, generally, easier to implement and use). However, it is more efficient in storage terms and works better with general data (one can always compute diffs), especially that which doesn’t have such a clear domain model — e.g. the RDF case discussed above.

Usage

  • Wikis: Many wikis implement a full copy-on-write model with a full copy of each page being made on each write.
  • Source control: diff model (usually with HEAD cached and backwards diffs)
  • vdm: copy-on-write using SQL tables as core ‘domain objects’
  • ordf: (RDF) diffs with HEAD caching

Author “Significance” From Catalogue Data

Continues the series of post related to analyzing catalogue data, here are some stats on author “significance” as measured by the number of book entries (‘items’) for that author in the Cambridge University Library catalogue from 1400-1960 (there being 1m+ such entries).

I’ve termed this measure “significance” (with intentional quotes) as it co-mingles a variety of factors:

  • Prolificness — how many distinct works an author produced (since usually each work will get an item)
  • Popularity — this influences how many times the same work gets reissued as a new ‘item’ and the library decision to keep the item
  • Merit — as for popularity

The following table shows the top 50 authors by “significance”. Some of the authors aren’t real people but entities such as “Great Britain. Parliament” and for our purposes can be ignored. What’s most striking to me is how closely the listing correlates with the standard literary canon. Other features of note:

  • Shakespeare is number 1 (2)
  • Classics (latin/greek) authors are well-represented with Cicero at number 2 (4), Horace at 5 (9) followed Homer, Euripides, Ovid, Plato, Aeschylus, Xenophon, Sophocles, Aristophanes and Euclid.
  • Surprise entries (from a contemporary perspective): Hannah More, Oliver Goldsmith, Gilbert Burnet (perhaps accounted by his prolificity).
  • Also surprising is limited entries from 19th century UK with only Scott (26), Dickens (28) and Byron (41)

<

table class=”data”>RankNo. of ItemsName 13112Great Britain. Parliament. 21154Shakespeare, WilliamHere’s 31076Church of England. 4973Cicero, Marcus Tullius 5825Great Britain. 6766Catholic Church. 7721Erasmus, Desiderius 8654Defoe, Daniel 9620Horace 10599Aristotle 11547Voltaire 12539Virgil 13527Swift, Jonathan 14520Goethe, Johann Wolfgang Von 15486Rousseau, Jean-Jacques 16479Homer 17444Milton, John 18388Sterne, Laurence 19387England and Wales. Sovereign (1660-1685 : Charles II) 20386Euripides 21372Ovid 22358Goldsmith, Oliver 23358Plato 24351Wang 25349Alighieri, Dante 26338Scott, Walter (Sir) 27326More, Hannah 28322Dickens, Charles 29315Aeschylus 30304Burnet, Gilbert 31302Luther, Martin 32295Dryden, John 33290Xenophon 34280Sophocles 35262Pope, Alexander 36259Fielding, Henry 37258Li 38250Calvin, Jean 39248Zhang 40247Aristophanes 41247Byron, George Gordon Byron (Baron) 42247Bacon, Francis 4324have 7Chen 44245Terence 45241Euclid 46235Augustine (Saint, Bishop of Hippo.) 47232Burke, Edmund 48223Johnson, Samuel 49222Bunyan, John 50222De la Mare, Walter

Top 50 authors based on CUL Catalogue 1400-1960

The other thing we could look at is the overall distribution of titles per author (and how it varies with rank — a classic “is it a power law” question). Below are the histogram (NB log scale for counts) together with a plot of rank against count (which equates, v. crudely, to a transposed plot of the tail of the histogram …). In both cases it looks (!) like a power-law is a reasonable fit given the (approximate) linearity but this should be backed up with a proper K-S test.

culbooks_person-item-hist-logxlogy.png

Histogram of items-per-author distribution (log-log)

culbooks_person-item-by-rank-logxlogy.png

Rank versus no. of items (log-log)

TODO

  • K-S tests
  • Extend data to present day
  • Check against other catalogue data
  • Look at occurrence of people in title names
  • Look at when items appear over time

Colophon

Code to generate table and graphs in the open Public Domain Works repository, specifically method ‘person_work_and_item_counts’ in this file: http://knowledgeforge.net/pdw/hg/file/tip/contrib/stats.py

Exploring Patterns of Knowledge Production

I’m posting up some work-in-progress entitled Exploring Patterns of Knowledge Production (link to full pdf) that follows up to my earlier post of a year and a bit ago. Below I’ve excerpted the introduction plus list of motivational questions. Comments (and critique) very welcome!

Exploring Patterns of Knowledge Production Paper ‘Alpha’ (pdf)

Introduction

In what follows the term ‘knowledge’ is here used broadly to signify all forms of information production including those involved in technological innovation, cultural creativity and academic advance.

Today, thanks to rapid advances in IT, we have available substantial datasets pertaining both to the extent and the structure of knowledge production across disciplines, space and time.

Especially recent is the availability of good ‘structural’ data — that is data on the linkages and relationships of different pieces of knowledge, for example as provided by citation information. This new material allows us to explore the “patterns of knowledge production” in deeper and richer ways than ever previously possible and often using entirely new methods.

For example, it has long been accepted that innovation and creativity are cumulative processes, in which new ideas build upon old. However, other than anecdotal and case-study material provided by historians of ideas and sociologists of science there has been little data with which to study this issue — and almost none of a comprehensive kind that would make possible a systematic examination.

However, the recent availability of comprehensive databases containing ‘citation’ information have allowed us to begin really examining the extent to which new work builds upon old — be it a new technology as represented by a patent or a new idea in academia as represented by a paper, builds upon old.

Similar opportunities present themselves in relation to identifying the creation of new fields of research or technology, and tracing their evolution over time. Here the existence of extensive “structural information” as presented, for example, by citation databases, enables new systematic approaches — for example, can new fields be identified (or perhaps defined) as points in ‘knowledge space’ far away from the existing loci of effort? or, alternatively, by the nature of its connections to the existing body of work?

Structural information of this kind can also be used in charting other changes in the life-cycle of knowledge creation. For example, to offer a specific conjecture, a field entering decline, though still exhibiting a similar level of output (papers etc) and even citations to a field in rude health, may display a citation structure which is markedly different — for example, more clustered within the field itself. Thus, by using this additional structural information we may be able to gain insights not available with simpler approaches.

At the same time, structure must also play a central role in any attempt to estimate knowledge related ‘output’ measures. This is of course not true for other forms of ‘output’, for example that of corn of steel, where we have relatively well-defined objective measures available: tonnes of such-and-such a quality.

But knowledge is different: the most obvious metrics, such as number of patents or papers produced, seem entirely inadequate: one particular innovation or paper may be ‘worth’ as much as a hundred or a thousand others.

The issue here is that, compared to corn or steel, knowledge is extremely inhomogeneous, or put slightly differently, quality (or significance) differs very substantially across the individual pieces of knowledge (papers, patents etc).

Thus, any serious attempt to measure the progress of knowledge must must find some way to do this quality-adjustment and structural information seems essential to this.

What specific questions might we explore with such datasets?

The following is a (non-exhaustive) list of the kinds of questions one might explore using these new datasets:

  • Can we use structure to infer information about quality of individual items? Clearly the answer is yes, for example by using a citation-based metric where a work’s value is estimated based on its citation by others.
  • Can we then use this information together with more global structure of the production network to gain a better idea of total (quality-adjusted) output. This would allow one to chart progress, or the lack of it, over time?
  • Can we use structural information to investigate the life-cycle of fields? For example, can we see fields ‘dying out’ or the onset of diminishing returns? Can we see new fields coming into existence and their initial growth patterns?
  • What about productivity per capita and its variation across the population? It is likely that one would need to focus here within a discipline as it would be difficult to directly compare across disciplines, at least when using quality adjusted productivity.
  • Do the structures of knowledge production vary over time and across disciplines and does this have implications for their productivity? Can we compare the structure of evolution in technology or economics with that in ‘natural’ evolution and, if not, what are the primary differences?
  • How do other (observable) attributes related to the producers of knowledge (their collaboration with others, their geographical location) affect the structures we observe and the associated outcomes (output, productivity) already discussed above?
  • Do different policies (for example openness vs. closedness — weak vs. strong IP) have implications for the structure of production and hence for output and productivity?
  • Is knowledge production (in a particular area) ergodic or path-dependent? Crudely: do we always end up in the same place or do small shocks have large long-term effects?

Colophon

Update: 2011-01-31: have now broken out data worked into dedicated repos on bitbucket.

Exploring Patterns of Knowledge Production

A definition: the term ‘knowledge’ is here used broadly to signify all forms of information production including those involved in technological innovation, cultural creativity and academic advance.

Largely as a result of better ICT we now have available some very substantial datasets regarding both the extent and structure of knowledge production across different jurisdictions and different disciplines.

Of particular interest here is this is second aspect: the structure of knowledge production; as it has long been accepted that innovation and creativity are cumulative processes, in which new ideas build upon old.

However, other than the anecdotal and case-study material provided by historians of ideas and sociologists of science there has been little evidence on this issue — and almost none of a comprehensive kind that would make a systematic examination possible.

In particular, the existence of databases containing ‘citation’ information allows us to, at least partially, determine the extent to which new work, be it a new technology as represented by a patent or a new idea in academia as represented by a paper, builds upon old.

What specific issues might we explore with such datasets?

Given the availability of these new datasets and the basic cumulative nature of most knowledge production what specific issues and question might we explore? The following provides a basic, but non-exhaustive, list:

  • Can we use structure to infer information about quality of individual items? Clearly the answer is yes, for example by using a citation-based metric where a work’s value is computed on its citation by others.
  • Can we then use this information together with more global structure of the production network to gain a better idea of total (quality-adjusted) output. This would allow one to chart progress, or the lack of it, over time?
  • What about productivity per capita and its variation across the population? It is likely that one would need to focus here within a discipline as it would be difficult to directly compare across disciplines, at least when using quality adjusted productivity.
  • Do the structures of knowledge production vary over time and across disciplines and does this have implications for their productivity? Can we compare the structure of evolution in technology or economics with that in ‘natural’ evolution and, if not, what are the primary differences?
  • How do other (observable) attributes related to the producers of knowledge (their collaboration with others, their geographical location) affect the structures we observe and the associated outcomes (output, productivity) already discussed above?
  • Do different policies (for example openness vs. closedness — weak vs. strong IP) have implications for the structure of production and hence for output and productivity?
  • Is knowledge production (in a particular area) ergodic or path-dependent? Crudely: do we always end up in the same place and do small shocks have small or large effects in the long term?

Overlord: D-Day and the Battle for Normandy 1944 by Max Hastings

7.5/10. Finished a few weeks ago this is another (rather earlier) example of Hastings’ skill in writing penetrating and engaging military history, as well as his willingness to be critical of existing ‘sacred cows’. Among other things Hastings:

  • Argues that the famous Mulberrys were probably a waste of time and resources.
  • Shows how the Air Force extreme unhelpfulness (largely driven by their own ambitions and obsession with civilian bombing) was a serious handicap to the whole campaign.
  • Supplies a sharp corrective regarding Patton’s reputation, pointing out that up against reasonable German opposition Patton did little better than anyone else.
  • Shows clearly how it was Hitler, almost more than anyone else, who contributed to the disastrous collapse of German forces in August-October 1944 by his insistence that no retreat of any kind be considered.
  • Provides many examples of the poor quality of equipment, leadership, and men, especially among the American forces and how these deficiencies hindered the Allied campaign. In particular, Allied tanks were almost never a match for their German counterparts and on any occasion that Allied and German troops met on anything near equal footing the Germans won.1 In addition he details several clear cases of simple cowardice or unwillingness to fight among the Allied troops and/or extremely poor leadership stretching from the lowest levels to the highest. This is not to criticize — who can say what they would do in such circumstances — and in many reflects the fact that while the Germans were a nation that had for many years been ‘obsessed’ with soldiering the Allied troops were ‘civilians in uniform’, but it does supply a useful corrective to those rose-tinted visions supplied by films such as The Longest Day or the newsreel footage showing Allied soldiers racing past cheering French civilians.

Finally, and as an aside, while good, the book also displays the limitations of the traditional book format as a method for presenting this sort of material (i.e. military history with its strong connections between the temporal and spatial aspects of events). At least for me, the attempt to render particular troop movements, or the direction of battles, in prose never really succeeds and one finds oneself constantly flicking back to the (rather limited) maps in an attempt to connect the descriptions of events, the failures and successes of particular thrusts, with their location, both geographically and within the overall direction of the campaign. Thus, it seems to me that it is that this kind of subject is the sort thing most suited to being integrated with the kind of approach proposed by the Microfacts / Weaving History project currently in the early stages of its development at the Open Knowledge Foundation. Here one would be able to marry maps with descriptions, photos with actions, time with space to provide a much clearer insight into what was going on.

On a man for man basis, the German ground soldier consistently inflicted casualties at about a 50% higher rate than they incurred from opposing British and American troops UNDER ALL CIRCUMSTANCES . [emphasis in original] This was true when they were attacking and when they were defending, when they had local numerical superiority and when, as was usually the case, they were outnumbered, when they had air superiority and when they did not, when they won and when they lost.

It is undoubtedly true that the Germans were much more efficient than the Americans in making use of available manpower. An American army corps staff contained 55 per cent more officers and 44 per cent fewer other ranks than its German equivalent. …

Events on the Normandy battlefield demonstrated that most British or American troops continued a given operation for as long as reasonable me could. Then – when they had fought for many hours, suffered many casualties, or were running low on fuel or ammunition – they disengaged. The story of German operations, however, is landmarked with repeated examples of what could be achieved by soldiers prepared to attempt more than reasonable men could.”


  1. From p. 84 ff. “The American Colonel Trevor Dupuy has conducted a detailed statistical study of German actions in the Second World War. Some of his explanations as to why Hitler’s armies performed so much more impressively than their enemies seem fanciful. But no critic has challenged his essential finding that on almost every battlefield of the war, including Normandy, the German soldier performed more impressively than his opponents: 

Path-Dependent vs. Ergodic Systems

Consider a metal arm fixed by a pin. If it is hung vertically then the arm, no matter where it starts, will always end up in the same position. However, if you fix the arm (perfectly) horizontally it will stay forever in its initial position. The first case is ergodic: we converge independent of the starting point to some particular configuration; while the second is ‘path-dependent’ (or dependent on initial conditions): where you end up depends crucially on where you start. The question:

Is animal/technological/historical/linguistic evolution ergodic or path dependent?

More generally, how ergodic or path-dependent are the following processes?

  • (Natural) Evolution
  • Technological change
  • Human history
  • Communication systems such as natural languages
  • Other symbol systems (e.g. games or mathematics)

Versioned Domain Models

I’ve been thinking about how to have a versioned domain model similar to the way we have versioned filesystems (e.g. subversion) for over two years. Over the last few months whatever bits of free time I’ve had have gone into developing a prototype built on top of sqlobject and I’ve now got a rough and ready (but fully functional) library:

http://project.knowledgeforge.net/ckan/svn/vdm/branches/sqlobj/

A demo of how it is used is best shown by the tests:

http://project.knowledgeforge.net/ckan/svn/vdm/branches/sqlobj/vdm/dm_test.py

Why be tied to SQLObject: obviously being so directly tied to sqlobject is not such a great thing but I intentionally chose to build on it because so many people will already be writing their domain models using SQLObject.

Thinking about Annotation

Annotation means the adding of comments/notes/etc to an underlying resource. For the present I’ll focus on the situation where the underlying resource is textual (as opposed to being an image, or a piece of film or some data). Various things to consider when implementing an annotation/comment system:

  1. Addressing and atomisation: Are annotations specific to particular parts of the resource. If so how do we store this address (relatedly: how is the resource ‘atomised’ and how to we address these atoms, or range of atoms). For example, do we address by word, by character, by paragraph or by section? Do we wish to store ranges rather than a single address? Do we wish to allow a given annotation to be associated with multiple ranges/atoms?

  2. Permissions: Are there restrictions on the creation (deletion/updating etc) of annotations.

  3. Will the underlying resource change and if so are annotations intended to be robust to those changes.

Let’s concentrate on the first issue for the time being as it is the most immediately important. Furthermore, defining the ‘atoms’ of the resource sharply narrows the implementation options.

The Simple Case: Mod a Blog

If one is happy to have fairly large atoms (pages, or even sections of some piece of text) then implementing an annotation system can be reduced to grabbing your favourite CMS or blogging software and feeding the text in in appropriate chunks. This is often satisfactory and is a simple, low tech solution that will pretty much work out of the box. A classic example of this approach is http://www.pepysdiary.com/ which works so well because the subject matter (Samuel Pepy’s diary) has a very obvious atomisation (namely the daily diary entries) suited perfectly suited to blog software (in this case movable type).

You can even start doing a bit of modding, for example to present recent annotations (http://www.pepysdiary.com/recent/) or to present the text plus annotations all in one piece. (Given that commentonpower seems to fall neatly into this category with most commentable atoms of the right size for ‘blog’ entries I wonder why they didn’t just implement it as a plugin for wordpress — perhaps it was such a simple app that it easier to ‘roll their own’).

Getting More Atomic

Once you want to have atoms below a size comfortable for individual html pages/blog entries, wish to allow people to comment on chunks too large for an individual page, or to comment on ranges one starts to have problems with this approach. The main challenge at this point is to find some way to extract the addressing information from the client doing the annotation. Confining ourselves to the web the challenge becomes way to structure the interface and the text so that one can determine range start and end points. This is a non-trivial matter. Possible options include:

  • Javascript: in theory the selection/range objects should help us out here unfortunately cross-browser support is patch (firefox as usual is excellent and IE pretty bad). If one does not want to be as precise as to get ranges javascript could also be used to extract e.g. element ids.
  • Copy and paste of the quote to annotate with some backend algorithm to determine the actual range. Nice and simple but not clear that one can ‘invert’ (i.e. find a unique range from a given selection) unless the selection is large.
  • If addressing fairly large atoms (e.g. a paragraph or large) one could just insert a unique piece of user interface equipment (e.g. a button or link) with each atom. Note however that this prevents support for ranges.

Separating Data and Presentation

Whatever one chooses to do it does seem sensible to clearly separate data and presentation. This is particularly important when there is so much uncertainty over the user interface. In particular, it would be good to clearly specify the annotation format and implement a programmatic interface to it independent of the standard (human) user interface. That way is easy to switch interfaces (or have multiple ones). Given that annotations are essentially just a comment it would seem sensible to try and reuse an existing format such as Atom (or RSS) for the machine interface to the comment store. [marginalia] already had such a format based on atom. I’ve recently reimplemented a stripped down version of this format for the annotation store backend in python in preparation for adding annotation support to openshakespeare web interface, see:

http://project.knowledgeforge.net/shakespeare/svn/annotater/trunk/

Of course as discussed above this isn’t quite as simple as it looks as your user interface can constrain what you can and can’t store (using a blog approach you can’t store ranges and from what I have read getting reliable character offsets is problematic). Nevertheless it seems the best place to start.