Category Archives: Data Digging

OpenHDI: Open Human Development Index

A few members of the Open Knowledge Foundation’s nascent open economics working group are having a code-sprint this Friday and Saturday to work on an app for the world bank competition currently called ‘Open HDI’ (Human Development Index):

The idea is to look at ‘development beyond GDP’ by collecting weightings on particular aspects of ‘development’ (health, education, gdp, inequality) from users and using that to build our own human development index.

We first talked about this a few months ago at the open economics online meetup. Dirk Heine and Guo Xu then put together an excellent demo version: and now we’re working to take that to the status of a full app!

Progress in the last 3 months

As part of my Shuttleworth Fellowship I’m preparing quarterly reports on what I’ve been up to. So, herewith are some some highlights from the last 3 months.

Talks and Events

Open Data Projects


Where Does My Money Go? Spending Explorer using Protovis and jQuery

Over the last couple of months I’ve been playing around with Protovis in my spare time to create an interactive pure javascript Government Spending Explorer for Where Does My Money Go? (datastore api):

Warning: won’t work in IE (atm due to lack of svg support) and works best (i.e. fastest) in Chrome!

I’d be interested in any feedback and any suggestions for experience with protovis or any other javascript libraries (I’ve also used flot and thejit a bit). In particular one thing a bit lacking currently in protovis is any animation (something that’s goodin thejit …).


  • True ‘explorer’: you can choose any set of breakdown ‘keys’ to visualize
  • Primary ‘financial bubbles’ view with interactive navigation into bubbles
    • Support for arbitrary depth of data ‘tree’ so you can keep navigating down (though currently limited by user interface to select at most 3 levels)
  • Multiple other visualizations including treemap, sunburst, dendrogram and ‘icicle’
  • Time support
  • View the source data in table or as json

I’m sure there’s tons to improve especially on the usability (e.g. should default labels have amounts in them?) so if you take a look please let me know any feedback.

Some specific limitations:

  • Does not work in IE — but hope to fix this using svg.js soon
  • Colours and general ‘look’ could be improved — help wanted!
  • Occasional bugs e.g. weird redraws — if you find one please let me know

Author “Significance” From Catalogue Data

Continues the series of post related to analyzing catalogue data, here are some stats on author “significance” as measured by the number of book entries (‘items’) for that author in the Cambridge University Library catalogue from 1400-1960 (there being 1m+ such entries).

I’ve termed this measure “significance” (with intentional quotes) as it co-mingles a variety of factors:

  • Prolificness — how many distinct works an author produced (since usually each work will get an item)
  • Popularity — this influences how many times the same work gets reissued as a new ‘item’ and the library decision to keep the item
  • Merit — as for popularity

The following table shows the top 50 authors by “significance”. Some of the authors aren’t real people but entities such as “Great Britain. Parliament” and for our purposes can be ignored. What’s most striking to me is how closely the listing correlates with the standard literary canon. Other features of note:

  • Shakespeare is number 1 (2)
  • Classics (latin/greek) authors are well-represented with Cicero at number 2 (4), Horace at 5 (9) followed Homer, Euripides, Ovid, Plato, Aeschylus, Xenophon, Sophocles, Aristophanes and Euclid.
  • Surprise entries (from a contemporary perspective): Hannah More, Oliver Goldsmith, Gilbert Burnet (perhaps accounted by his prolificity).
  • Also surprising is limited entries from 19th century UK with only Scott (26), Dickens (28) and Byron (41)


table class=”data”>RankNo. of ItemsName 13112Great Britain. Parliament. 21154Shakespeare, WilliamHere’s 31076Church of England. 4973Cicero, Marcus Tullius 5825Great Britain. 6766Catholic Church. 7721Erasmus, Desiderius 8654Defoe, Daniel 9620Horace 10599Aristotle 11547Voltaire 12539Virgil 13527Swift, Jonathan 14520Goethe, Johann Wolfgang Von 15486Rousseau, Jean-Jacques 16479Homer 17444Milton, John 18388Sterne, Laurence 19387England and Wales. Sovereign (1660-1685 : Charles II) 20386Euripides 21372Ovid 22358Goldsmith, Oliver 23358Plato 24351Wang 25349Alighieri, Dante 26338Scott, Walter (Sir) 27326More, Hannah 28322Dickens, Charles 29315Aeschylus 30304Burnet, Gilbert 31302Luther, Martin 32295Dryden, John 33290Xenophon 34280Sophocles 35262Pope, Alexander 36259Fielding, Henry 37258Li 38250Calvin, Jean 39248Zhang 40247Aristophanes 41247Byron, George Gordon Byron (Baron) 42247Bacon, Francis 4324have 7Chen 44245Terence 45241Euclid 46235Augustine (Saint, Bishop of Hippo.) 47232Burke, Edmund 48223Johnson, Samuel 49222Bunyan, John 50222De la Mare, Walter

Top 50 authors based on CUL Catalogue 1400-1960

The other thing we could look at is the overall distribution of titles per author (and how it varies with rank — a classic “is it a power law” question). Below are the histogram (NB log scale for counts) together with a plot of rank against count (which equates, v. crudely, to a transposed plot of the tail of the histogram …). In both cases it looks (!) like a power-law is a reasonable fit given the (approximate) linearity but this should be backed up with a proper K-S test.


Histogram of items-per-author distribution (log-log)


Rank versus no. of items (log-log)


  • K-S tests
  • Extend data to present day
  • Check against other catalogue data
  • Look at occurrence of people in title names
  • Look at when items appear over time


Code to generate table and graphs in the open Public Domain Works repository, specifically method ‘person_work_and_item_counts’ in this file: