A few members of the Open Knowledge Foundation’s nascent open economics working group are having a code-sprint this Friday and Saturday to work on an app for the world bank competition currently called ‘Open HDI’ (Human Development Index):
The idea is to look at ‘development beyond GDP’ by collecting weightings on particular aspects of ‘development’ (health, education, gdp, inequality) from users and using that to build our own human development index.
We first talked about this a few months ago at the open economics online meetup. Dirk Heine and Guo Xu then put together an excellent demo version: http://eutopia.guoxu.org/ and now we’re working to take that to the status of a full app!
As part of my Shuttleworth Fellowship I’m preparing quarterly reports on what I’ve been up to. So, herewith are some some highlights from the last 3 months.
Talks and Events
Open Data Projects
Explorer for Where Does My Money Go? (datastore api):
Warning: won’t work in IE (atm due to lack of svg support) and works
best (i.e. fastest) in Chrome!
- True ‘explorer’: you can choose any set of breakdown ‘keys’ to visualize
- Primary ‘financial bubbles’ view with interactive navigation into bubbles
- Support for arbitrary depth of data ‘tree’ so you can keep
navigating down (though currently limited by user interface to select
at most 3 levels)
- Multiple other visualizations including treemap, sunburst,
dendrogram and ‘icicle’
- Time support
- View the source data in table or as json
I’m sure there’s tons to improve especially on the usability (e.g.
should default labels have amounts in them?) so if you take a look
please let me know any feedback.
Some specific limitations:
- Does not work in IE — but hope to fix this using svg.js soon
- Colours and general ‘look’ could be improved — help wanted!
- Occasional bugs e.g. weird redraws — if you find one please let me know
Continues the series of post related to analyzing catalogue data, here are some stats on author “significance” as measured by the number of book entries (‘items’) for that author in the Cambridge University Library catalogue from 1400-1960 (there being 1m+ such entries).
I’ve termed this measure “significance” (with intentional quotes) as it co-mingles a variety of factors:
- Prolificness — how many distinct works an author produced (since usually each work will get an item)
- Popularity — this influences how many times the same work gets reissued as a new ‘item’ and the library decision to keep the item
- Merit — as for popularity
The following table shows the top 50 authors by “significance”. Some of the authors aren’t real people but entities such as “Great Britain. Parliament” and for our purposes can be ignored. What’s most striking to me is how closely the listing correlates with the standard literary canon. Other features of note:
- Shakespeare is number 1 (2)
- Classics (latin/greek) authors are well-represented with Cicero at number 2 (4), Horace at 5 (9) followed Homer, Euripides, Ovid, Plato, Aeschylus, Xenophon, Sophocles, Aristophanes and Euclid.
- Surprise entries (from a contemporary perspective): Hannah More, Oliver Goldsmith, Gilbert Burnet (perhaps accounted by his prolificity).
- Also surprising is limited entries from 19th century UK with only Scott (26), Dickens (28) and Byron (41)
|Rank||No. of Items||Name|
|1||3112||Great Britain. Parliament.|
|3||1076||Church of England.|
|4||973||Cicero, Marcus Tullius|
|14||520||Goethe, Johann Wolfgang Von|
|19||387||England and Wales. Sovereign (1660-1685 : Charles II)|
|26||338||Scott, Walter (Sir)|
|41||247||Byron, George Gordon Byron (Baron)|
|46||235||Augustine (Saint, Bishop of Hippo.)|
|50||222||De la Mare, Walter|
Top 50 authors based on CUL Catalogue 1400-1960
The other thing we could look at is the overall distribution of titles per author (and how it varies with rank — a classic “is it a power law” question). Below are the histogram (NB log scale for counts) together with a plot of rank against count (which equates, v. crudely, to a transposed plot of the tail of the histogram …). In both cases it looks (!) like a power-law is a reasonable fit given the (approximate) linearity but this should be backed up with a proper K-S test.
Histogram of items-per-author distribution (log-log)
Rank versus no. of items (log-log)
- K-S tests
- Extend data to present day
- Check against other catalogue data
- Look at occurrence of people in title names
- Look at when items appear over time
Code to generate table and graphs in the open Public Domain Works repository, specifically method ‘person_work_and_item_counts’ in this file: http://knowledgeforge.net/pdw/hg/file/tip/contrib/stats.py