Category Archives: Open Data

Putting Open at the Heart of the Digital Age

--- ---


I’m Rufus Pollock.

In 2004 I founded a non-profit called Open Knowledge

The mission we set ourselves was to open up all public interest information – and see it used to create insight that drives change.

What sort of public interest information? In short, all of it. From big issues like how our government spends our taxes or how fast climate change is happening to simple, everyday, things like when the next bus is arriving or the exact address of that coffee shop down the street.

For the last decade, we have been pioneers and leaders in the open data and open knowledge movement. We wrote the original definition of open data in 2005, we’ve helped unlock thousands of datasets. And we’ve built tools like CKAN, that powers dozens of open data portals, like in the US and in the UK. We’ve created a network of individuals and organizations in more than 30 countries, who are all working to make information open, because they want to drive insight and change.

But today I’m not here to talk specifically about Open Knowledge or what we do.

Instead, I want to step back and talk about the bigger picture. I want to talk to you about digital age, where all that glitters is bits, and why we need to put openness at its heart.

Gutenberg and Tyndale

To do that I first want to tell you a story. Its a true story and it happened a while ago – nearly 500 years ago. It involves two people. The first one is Johannes Gutenberg. In 1450 Gutenberg invented this: the printing press. Like the Internet in our own time, it was revolutionary. It is estimated that before the printing press was invented, there were just 30,000 books in all of Europe. 50 years later, there were more than 10 million. Revolutionary, then, though it moved at the pace of the fifteenth century, a pace of decades not years. Over the next five hundred years, Gutenberg’s invention would transform our ability to share knowledge and help create the modern world.

The second is William Tyndale. He was born in England around 1494, so he grew up in world of Gutenberg’s invention.

Tyndale followed the classic path of a scholar at the time and was ordained as a priest. In the 1510s, when he was still a young man, the Reformation still hadn’t happened and the Pope was supreme ruler of a united church across Europe. The Church – and the papacy – guarded its power over knowledge, forbidding the translation of the bible from Latin so that only its official priests could understand and interpret it.

Tyndale had an independent mind. There’s a story that he got into an argument with a local priest. The priest told him:

“We are better to be without God’s laws than the Pope’s.”

Tyndale replied:

“If God spare my life ere many years, I will cause the boy that drives the plow to know more of the scriptures than you!”

What Tyndale meant was that he would open up the Bible to everyone.

Tyndale made good on his promise. Having fled abroad to avoid persecution, between 1524 and 1527 he produced the first printed English translation of the Bible which was secretly shipped back to England hidden in the barrels of merchant ships. Despite being banned and publicly burnt, his translation spread rapidly, giving ordinary people access to the Bible and sowing the seeds of the Reformation in England.

However, Tyndale did not live to see it. In hiding because of his efforts to liberate knowledge, he was betrayed and captured in 1534. Convicted of heresy for his work, on the 6th October 1536, he was strangled then burnt at the stake in a prison yard at Vilvoorden castle just north of modern day Brussels. He was just over 40 years old.


So let’s fast forward now back to today, or not quite today – the late 1990s.

I go to college and I discover the Internet.

It just hit me: wow! I remember days spent just surfing around. I’d always been an information junkie, and I felt like I’d found this incredible, never-ending information funfair.

And I got that I was going to grow up in a special moment, at the transition to an information age. We’d be living in this magical world, where the the main thing we create and use – information – could be instantaneously and freely shared with everyone on the whole planet.

But … why Openness

So, OK the Internet’s awesome …

Bet you haven’t heard that before!

BUT … – and this is the big but.

The Internet is NOT my religion.

The Internet – and digital technology – are not enough.

I’m not sure I have a religion at all, but if I believe in something in this digital age, I believe in openness.

This talk is not about technology. It’s about how putting openness at the heart of the digital age is essential if we really want to make a difference, really create change, really challenge inequity and injustice.

Which brings me back to Tyndale and Gutenberg.

Tyndale revisited

Because, you see, the person that inspired me wasn’t Gutenberg. It was Tyndale.

Gutenberg created the technology that laid the groundwork for change. But the printing press could very well have been used to pump out more Latin bibles, which would then only have made it easier for local priests to be in charge of telling their congregations the word of God every Sunday. More of the same, basically.

Tyndale did something different. Something so threatening to the powers that be that he was executed for it.

What did he do? He translated the Bible into English.

Of course, he needed the printing press. In a world of hand-copying by scribes or painstaking woodcut printing, it wouldn’t make much difference if the Bible was in English or not because so few people could get their hands on a copy.

But, the printing press was just the means: it was Tyndale’s work putting the Bible in everyday language that actually opened it up. And he did this with the express purpose of empowering and liberating ordinary people – giving them the opportunity to understand, think and decide for themselves. This was open knowledge as freedom, open knowledge as systematic change.

Now I’m not religious, but when I talk about opening up knowledge I am coming from a similar place: I want anyone and everyone to be able to access, build on and share that knowledge for themselves and for any purpose. I want everyone to have the power and freedom to use, create and share knowledge.

Knowledge power in the 16th century was controlling the Bible. Today, in our data driven world it’s much broader: it’s about everything from maps to medicines, sonnets to statistics. Its about opening up all the essential information and building insight and knowledge together.

This isn’t just dreaming – we have inspiring, concrete examples of what this means. Right now I’ll highlight just two: medicines and maps.

Example: Medicines

Everyday, millions of people around the world take billions of pills, of medicines.

Whether those drugs actually do you good – and what side effects they have – is obviously essential information for researchers, for doctors, for patients, for regulators – pretty much everyone.

We have a great way of assessing the effectiveness of drugs: randomized control trials in which a drug is compared to its next best alternative.

So all we need is all the data on all those trials (this would be non-personal information only – any information that could identify individuals would be removed). In an Internet age you’d imagine that that this would be a simple matter – we just need all the data openly available and maybe some way to search it.

You’d be wrong.

Many studies, especially negative ones, are never published – the vast majority of studies are funded by industry who use restrictive contracts to control what gets published. Even where pharmaceutical companies are required to report on the clinical trials they perform, the regulator often keeps the information secret or publishes it as 8,000 page PDFs each page hand-scanned and unreadable by a computer.

If you think I’m joking I’ll give just one very quick example which comes straight from Ben Goldacre’s Bad Pharma. In 2007 researchers in Europe wanted to review the evidence on a diet drug called rimonabant. They asked the European regulator for access to the original clinical trials information submitted when the drug was approved. For three years they were refused access on a variety of grounds. When they did get access this is what they got initially – that’s right 60 pages of blacked out PDF.

We might think this was funny if it weren’t so deadly serious: in 2009, just before the researchers finally got access to the data, rimonabant was removed from the market on the grounds that it increased the risk of serious psychiatric problems and suicide.

This situation needs to change.

And I’m happy to say something is happening. Working with Ben Goldacre, author of Bad Pharma, we’ve just started the OpenTrials project. This will bring together all the data, on all the trials and link it together and make it open so that everyone from researchers to regulators, doctors to patients can find it, access it and use it.

Example: Maps

Our second example is maps. If you were looking for the “scriptures” of this age of digital data, you might well pick maps, or, more specifically the geographic data on which they are built. Geodata is everywhere: from every online purchase to the response to the recent earthquakes in Nepal.

Though you may not realize it, most maps are closed and proprietary – you can’t get the raw data that underpins the map, you can’t alter it or adapt it yourself.

But since 2004 a project called OpenStreetMap has been creating a completely open map of the planet – raw geodata and all. Not only is it open for access and reuse use the database itself is collaboratively built by hundreds of thousands of contributors from all over the world.

What does this mean? Just one example. Because of its openness OpenStreetMap is perfect for rapid updating when disaster strikes – showing which bridges are out, which roads are still passable, what buildings are still standing. For example, when a disastrous earthquake struck Nepal in April this year, volunteers updated 13,199 miles of roads and 110,681 buildings in under 48 hours providing crucial support to relief efforts.

The Message not the Medium

To repeat then: technology is NOT teleology. The medium is NOT the message – and it’s the message that matters.

The printing press made possible an “open” bible but it was Tyndale who made it open – and it was the openness that mattered.

Digital technology gives us unprecedented potential for creativity, sharing, for freedom. But they are possible not inevitable. Technology alone does not make a choice for us.

Remember that we’ve been here before: the printing press was revolutionary but we still ended up with a print media that was often dominated by the few and the powerful.

Think of radio. If you read about how people talked about it in the 1910s and 1920s, it sounds like the way we used to talk about the Internet today. The radio was going to revolutionize human communications and society. It was going to enable a peer to peer world where everyone can broadcast, it was going to allow new forms of democracy and politics, etc. What happened? We got a one way medium, controlled by the state and a few huge corporations.

Look around you today.

The Internet’s costless transmission can – and is – just as easily creating information empires and information robber barons as it can creating digital democracy and information equality.

We already know that this technology offers unprecedented opportunities for surveillance, for monitoring, for tracking. It can just as easily exploit us as empower us.

We need to put openness at the heart of this information age, and at the heart of the Net, if we are really to realize its possibilities for freedom, empowerment, and connection.

The fight then is on the soul of this information age and we have a choice.

A choice of open versus closed.

Of collaboration versus control.

Of empowerment versus exploitation.

Its a long road ahead – longer perhaps than our lifetimes. But we can walk it together.

In this 21st century knowledge revolution, William Tyndale isn’t one person. It’s all of us, making small and big choices: from getting governments and private companies to release their data, to building open databases and infrastructures together, from choosing apps on your phone that are built on open to using social networks that give you control of your data rather than taking it from you.

Let’s choose openness, let’s choose freedom, let’s choose the infinite possibilities of this digital age by putting openness at its heart.

Thank you.

Open Data Can Speed up Research – Andy Beck of Harvard Medical School

Dr Andy Beck of Harvard Medical School in Reddit AMA thread:

Interesting question. I think there is a lot of value in actually showing the utility of open data, by using it creatively to answer important research questions. There are now huge public databases available and growing everyday (e.g., , I think it’s powerful to show a student that using open data they can answer a question in 5 minutes that previously may have taken an entire PhD dissertation to complete. In addition, to advocating through use of data, supporting high quality open access journals is also a great way to advocate. [Source]

A Data Revolution that Works for All of Us

Many of today’s global challenges are not new. Economic inequality, the unfettered power of corporations and markets, the need to cooperate to address global problems and the unsatisfactory levels of accountability in democratic governance – these were as much problems a century ago as they remain today.

What has changed, however – and most markedly – is the role that new forms of information and information technology could potentially play in responding to these challenges.

What’s going on?

The incredible advances in digital technology mean we have an unprecedented ability to create, share and access information. Furthermore, these technologies are increasingly not just the preserve of the rich, but are available to everyone – including the world’s poorest. As a result, we are living in a (veritable) data revolution – never before has so much data – public and personal – been collected, analysed and shared.

However, the benefits of this revolution are far from being shared equally.

On the one hand, some governments and corporations are already using this data to greatly increase their ability to understand – and shape – the world around them. Others, however, including much of civil society, lack the necessary access and capabilities to truly take advantage of this opportunity. Faced with this information inequality, what can we do?

How can we enable people to hold governments and corporations to account for the decisions they make, the money they spend and the contracts they sign? How can we unleash the potential for this information to be used for good – from accelerating research to tackling climate change? And, finally, how can we make sure that personal data collected by governments and corporations is used to empower rather than exploit us?

So how should we respond?

Fundamentally, we need to make sure that the data revolution works for all of us. We believe that key to achieving this is to put “open” at the heart of the digital age. We need an open data revolution.

We must ensure that essential public-interest data is open, freely available to everyone. Conversely, we must ensure that data about me – whether collected by governments, corporations or others – is controlled by and accessible to me. And finally, we have to empower individuals and communities – especially the most disadvantaged – with the capabilities to turn data into the knowledge and insight that can drive the change they seek.

In this rapidly changing information age – where the rules of the game are still up for grabs – we must be active, seizing the opportunities we have, if we are to ensure that the knowledge society we create is an open knowledge society, benefiting the many not the few, built on principles of collaboration not control, sharing not monopoly, and empowerment not exploitation.

Save the Date – OGP Pre-Conference, London Wednesday 30th October

This Autumn the Open Government Partnership Annual Conference is coming to London and will place on the 31st October and 1st November. As a lead into the main event, OGP is planning a 1-day civil society Pre-Conference event on Wednesday 30th October and we here at the Open Knowledge Foundation will be collaborating with them on it.

An informal group discussion on open government data

The aim is for this to be informal with lots of open space and a collaboratively organized schedule with activities and discussions like:

  • What does civil society want from OGP?
  • What’s next for open government data and open government?
  • Small group conversations about challenges and what can be learnt from other initiatives like EITI, IATI and the like
  • Workshops and data expeditions
  • Space for individual communities groups to meet, share and plan
  • Your suggestion here

If you’re interested you can pre-register now so as to notified once registration opens and more information becomes available.

Pre-register now »

Further details coming soon!

Git (and Github) for Data

The ability to do “version control” for data is a big deal. There are various options but one of the most attractive is to reuse existing tools for doing this with code, like git and mercurial. This post describes a simple “data pattern” for storing and versioning data using those tools which we’ve been using for some time and found to be very effective.


The ability to do revisioning and versioning data – store changes made and share them with others – especially in a distributed way would be a huge benefit to the (open) data community. I’ve discussed why at some length before (see also this earlier post) but to summarize:

  • It allows effective distributed collaboration – you can take my dataset, make changes, and share those back with me (and different people can do this at once!)
  • It allows one to track provenance better (i.e. what changes came from where)
  • It allows for sharing updates and synchronizing datasets in a simple, effective, way – e.g. an automated way to get the last months GDP or employment data without pulling the whole file again

There are several ways to address the “revision control for data” problem. The approach here is to get data in a form that means we can take existing powerful distributed version control systems designed for code like git and mercurial and apply them to the data. As such, the best github for data may, in fact, be github (of course, you may want to layer data-specific interfaces on on top of git(hub) – this is what we do with

There are limitations to this approach and I discuss some of these and alternative models below. In particular, it’s best for “small (or even micro) data” – say, under 10Mb or 100k rows. (One alternative model can be found in the very interesting Dat project recently started by Max Ogden — with whom I’ve talked many times on this topic).

However, given the maturity and power of the tooling – and its likely evolution – and the fact that so much data is small we think this approach is very attractive.

The Pattern

The essence of the pattern is:

  1. Storing data as line-oriented text and specifically as CSV1 (comma-separated variable) files. “Line oriented text” just indicates that individual units of the data such as a row of a table (or an individual cell) corresponds to one line2.

  2. Use best of breed (code) versioning like git mercurial to store and manage the data.

Line-oriented text is important because it enables the powerful distributed version control tools like git and mercurial to work effectively (this, in turn, is because those tools are built for code which is (usually) line-oriented text). It’s not just version control though: there is a large and mature set of tools for managing and manipulating these types of files (from grep to Excel!).

In addition to the basic pattern, there are several a few optional extras you can add:

  • Store the data in GitHub (or Gitorious or Bitbucket or …) – all the examples below follow this approach
  • Turn the collection of data into a Simple Data Format data package by adding a datapackage.json file which provides a small set of essential information like the license, sources, and schema (this column is a number, this one is a string)
  • Add the scripts you used to process and manage data — that way everything is nicely together in one repository

What’s good about this approach?

The set of tools that exists for managing and manipulating line-oriented files is huge and mature. In particular, powerful distributed version control systems like git and mercurial are already extremely robust ways to do distributed, peer-to-peer collaboration around code, and this pattern takes that model and makes it applicable to data. Here are some concrete examples of why its good.

Provenance tracking

Git and mercurial provide a complete history of individual contributions with “simple” provenance via commit messages and diffs.

Example of commit messages

Peer-to-peer collaboration

Forking and pulling data allows independent contributors to work on it simultaneously.

Timeline of pull requests

Data review

By using git or mercurial, tools for code review can be repurposed for data review.

Pull screen

Simple packaging

The repo model provides a simple way to store data, code, and metadata in a single place.

A repo for data


This method of storing and versioning data is very low-tech. The format and tools are both very mature and are ubiquitous. For example, every spreadsheet and every relational database can handle CSV. Every unix platform has a suite of tools like grep, sed, cut that can be used on these kind of files.


We’ve been using with this approach for a long-time: in 2005 we first stored CSV’s in subversion, then in mercurial, and then when we switched to git (and github) 3 years ago we started storing them there. In 2011 we started the datasets organization on github which contains a whole list of of datasets managed according to the pattern above. Here are a couple of specific examples:

Note Most of these examples not only show CSVs being managed in github but are also simple data format data packages – see the datapackage.json they contain.


Limitations and Alternatives

Line-oriented text and its tools are, of course, far from perfect solutions to data storage and versioning. They will not work for datasets of every shape and size, and in some respects they are awkward tools for tracking and merging changes to tabular data. For example:

  • Simple actions on data stored as line-oriented text can lead to a very large changeset. For example, swapping the order of two fields (= columns) leads to a change in every single line. Given that diffs, merges, etc. are line-oriented, this is unfortunate.3
  • It works best for smallish data (e.g. < 100k rows, < 50mb files, optimally < 5mb files). git and mercurial don’t handle big files that well, and features like diffs get more cumbersome with larger files.4
  • It works best for data made up of lots of similar records, ideally tabular data. In order for line-oriented storage and tools to be appropriate, you need the record structure of the data to fit with the CSV line-oriented structure. The pattern is less good if your CSV is not very line-oriented (e.g. you have a lot of fields with line breaks in them), causing problems for diff and merge.
  • CSV lacks a lot of information, e.g. information on the types of fields (everything is a string). There is no way to add metadata to a CSV without compromising its simplicity or making it no longer usable as pure data. You can, however, add this kind of information in a separate file, and this exactly what the Data Package standard provides with its datapackage.json file.

The most fundamental limitations above all arise from applying line-oriented diffs and merges to structured data whose atomic unit is not a line (its a cell, or a transform of some kind like swapping two columns)

The first issue discussed below, where a simple change to a table is treated as a change to every line of the file, is a clear example. In a perfect world, we’d have both a convenient structure and a whole set of robust tools to support it, e.g. tools that recognize swapping two columns of a CSV as a single, simple change or that work at the level of individual cells.

Fundamentally a revision system is built around a diff format and a merge protocol. Get these right and much of the rest follows. The basic 3 options you have are: * Serialize to line-oriented text and use the great tools like git (what’s we’ve described above) * Identify atomic structure (e.g. document) and apply diff at that level (think CouchDB or standard copy-on-write for RDBMS at row level) * Recording transforms (e.g. Refine)

At the Open Knowledge Foundation we built a system along the lines of (2) and been involved in exploring and researching both (2) and (3) – see changes and syncing for data on on These options are definitely worth exploring — and, for example, Max Ogden, with whom I’ve had many great discussions on this topic, is currently working on an exciting project called Dat, a collaborative data tool which will use the “sleep” protocol.

However, our experience so far is that the line-oriented approach beats any currently available options along those other lines (at least for smaller sized files!).

Having already been storing data in github like this for several years, we recently launched which is explicitly based on this approach:

  • Data is CSV stored in git repos on GitHub at
  • All datasets are data packages with datapackage.json metadata
  • Frontend site is ultra-simple – it just provides catalog and API and pulls data directly from github

Why line-oriented

Line-oriented text is the natural form of code and so is supported by a huge number of excellent tools. But line-oriented text is also the simplest and most parsimonious form for storing general record-oriented data—and most data can be turned into records.

At its most basic, structured data requires a delimiter for fields and a delimiter for records. Comma- or tab-separated values (CSV, TSV) files are a very simple and natural implementation of this encoding. They delimit records with the most natural separation character besides the space, the line break. For a field delimiter, since spaces are too common in values to be appropriate, they naturally resort to commas or tabs.

Version control systems require an atomic unit to operate on. A versioning system for data can quite usefully treat records as the atomic units. Using line-oriented text as the encoding for record-oriented data automatically gives us a record-oriented versioning system in the form of existing tools built for versioning code.

  1. Note that, by CSV, we really mean “DSV”, as the delimiter in the file does not have to be a comma. However, the row terminator should be a line break (or a line break plus carriage return). 

  2. CSVs do not always have one row to one line (it is possible to have line-breaks in a field with quoting). However, most CSVs are one-row-to-one-line. CSVs are pretty much the simplest possible structured data format you can have. 

  3. As a concrete example, the merge function will probably work quite well in reconciling two sets of changes that affect different sets of records, hence lines. Two sets of changes which each move a column will not merge well, however. 

  4. For larger data, we suggest swapping out git (and e.g. GitHub) for simple file storage like s3. Note that s3 can support basic copy-on-write versioning. However, being copy-on-write, it is comparatively very inefficient. 

Shuttleworth Fellowship Quarterly Review – Feb 2012

As part of my Shuttleworth Fellowship I’m preparing quarterly reviews of what I and the Open Knowledge Foundation have been up to. So, herewith are some some highlights from the last 3 months.


  • Substantial new project support from several funders including support for Science working group and Economics working group
  • Our CKAN Data Management System selected in 2 major new data portal initatives
  • Continuing advance of projects across the board with several projects reaching key milestones (v1.0 or beta release, adoption by third parties)
  • Rapid expansion of chapters and local groups — e.g. London Meetup now has more than 100 participants, new chapters in Belgium and Switzerland are nearly finalized
  • Completion of major upgrade of core web-presence with new branding and theme used on and across our network of sites (now numbering more than 40)
  • Announcement of School of Data which drew huge attention from the community. This is will be a joint Open Knowledge Foundation / P2PU project.
  • Major strengthening of organizational capacity with new staff


Major new project support including:

CKAN and the DataHub


  • Major breakthrough with achievement of simple data upload and management process – result of more than 9 months of work
  • OpenSpending now contains more than 30 datasets with ~7 million spending items (up from 2 datasets and ~200k items a year ago, and under 10 datasets a 1.5m items just 4 months ago)
  • Substantial expansion in set of collaborators and a variety of new funding opportunities

Other Projects

  • BibServer and BibSoup, our bibliogrpahic software and service, reached beta and have been receiving increasing attention

  • Public Domain Review celebrated its 1st Birthday. Some stats:

    • The Review now has more than 800+ email subscribers, ~800 followers on Twitter
    • 20k visitors with over 40k page views per month
    • An increasing number of supporters making a monthly donation
  • Initiated a substantive collaboration on the PyBossa crowdsourcing platform with Shuttleworth Fellow Emeritus Francois Grey and his Citizen Cyberscience Centre

  • Annotator and AnnotateIt v1.0 Completed and Released

    • Annotator is now seeing uptake from several third-party projects and developers
    • Project components now have more than 100 followers on GitHub (up from ~20 in December)

Working Groups and Local Groups and Chapters

Working groups have continued to develop well:

  • New dedicated Working Group coordinator (Laura Newman)
  • Panton Fellowships run under auspices of Science Working Group
  • Funding of Economics Working Group

Rapid Chapter and local group development:

Additional items

Events and Meetings

Participated in numerous events and meetings including:

Shuttleworth Fellowship Bi-Annual Review

As part of my Shuttleworth Fellowship I’m preparing bi-annual reviews of what I — and projects I’m involved in — have been up to. So, herewith are some some highlights from the last 6 months.

CKAN and the theDataHub


  • Two major point releases of OpenSpending software v0.10 and v0.11 (v0.11 just last week!). Huge maturing and development of the system. Backend architecture now finalized after a major refactor and reworking.
  • Community has grown significantly with now almost 50 OpenSpending datasets on and growing group of core “data wranglers”
  • Spending Stories was a winner of the Knight News Challenge. Spending Stories will build on and extend OpenSpending.

Open Bibliography and the Public Domain

Open Knowledge Foundation and the Community

  • In September we received a 3 year grant from the Omidyar Network to help the Open Knowledge Foundation sustain and expand its community especially in the formation of new chapters
  • Completed a major recruitment process in (Summer-Autumn 2011) to bring on more paid OKFN team members including community coordinators, foundation coordinator and developers
  • The Foundation participated in launch of Open Government Partnership and CSO events surrounding the meeting
  • Working groups continuing to develop. Too much activity to summarize it all here but some highlights include:
    • WG Science Coordinator Jenny Molloy travelling to OSS2011 in SF to present Open Research Reports with Peter Murray-Rust
    • Open Economics WG developing and Open Knowledge Index in August
    • Open Bibliography working group’s work on an Metadata guide.
    • Open Humanities / Open Literature working group winning Inventare Il Futuro competition with their idea to use the Annotator
  • Development of new Local Groups and Chapters
    • Lots of ongoing activities in existing local groups and chapters such as those in Germany and Italy have
    • In addition, interest from a variety of areas in the establishment of new chapters and local groups, for example in Brazil and Belgium
  • Start of work on OKFN labs

Meetups and Events

Talks and Events

  • Attended Open Government Partnership meeting in July in Washington DC and launch event in New York in September
  • Attended Chaos Computer Camp with other OKFNers in August near Berlin
  • September: Spoke at PICNIC in Amsterdam
  • October: Code for America Summit in San Francisco (plus meetings) – see partial writeup
  • October: Open Government Data Camp in Warsaw (organized by Open Knowledge Foundation)
  • November: South Africa – see this post on Africa@Home and Open Knowledge meetup in Cape Town


Talking at Legal Aspects of Public Sector Information (LAPSI) Conference in Milan

This week on Thursday and Friday I’ll be in Milan to speak at the 1st LAPSI (Legal Aspects of Public Sector Information) Primer & Public Conference.

I’m contributing to a “primer” session on The Perspective of Open Data Communities and then giving a conference talk on Collective Costs and Benefits in Opening PSI for Re-use in a session on PSI Re-use: a Tool for Enhancing Competitive Markets where I’ll be covering work by myself and others on pricing and regulation of PSI (see e.g. the “Cambridge Study” and the paper on the Economics of the Public Sector of Information).

Update: slides are up.

Community, Openness And Technology

PSI: Costs And Benefits Of Openness

Creative Commons and the Commons

Background: I first got involved with Creative Commons (CC) in 2004 soon after its UK chapter started. Along with Damian Tambini, the then UK ‘project lead’ for CC, and the few other members of ‘CC UK’, I spent time working to promote CC and its licenses in the UK (and elsewhere). By mid-2007 I was no longer very actively involved and to most intents and purposes was no longer associated with the organization. I explain this to give some background to what follows.

Creative Commons as a brand has been fantastically successful and is now very widely recognized. While in many ways this success has been beneficial for those interested in free/open material it has also raised some issues that are worth highlighting.

Creative Commons is not a Commons

Ironically, despite its name, Creative Commons, or more precisely its licenses, do not produce a commons. The CC licenses are not mutually compatible, for example, material with a CC Attribution-Sharealike (by-sa) license cannot be intermixed with material licensed with any of the CC NonCommercial licenses (e.g. Attribution-NonCommercial, Attribution-Sharealike-Noncommercial).

Given that a) the majority of CC licenses in use are ‘non-commercial’ b) there is also large usage of ShareAlike (e.g. Wikipedia), this is an issue affects a large set of ‘Creative Commons’ material.

Unfortunately, the presence of the word ‘Commons’ in CC’s name and the prominence of ‘remix’ in the advocacy around CC tends to make people think, falsely, that all CC licenses as in some way similar or substitutable.

The ‘Brand’ versus the Licenses

More and more frequently I hear people say (or more significantly write) things like: “This material is CC-licensed”. But as just discussed there is large, and very significant, variation in the terms of the different CC licenses. It appears that for many people the overall ‘Brand’ dominates the actual specifics of the licenses.

This is in marked contrast to the Free/Open Source software community, where even in the case of the Free Software Foundation’s licenses people tend to specify the exact license they are talking about.

Standards and interoperability are what really matter for licenses (cf the “Commons” terminology). Licensing and rights discussions are pretty dull for most people — and should be. They are important only because they determine what you and I can and can’t do, and specifically what material you and I can ‘intermix’ — possible only where licenses are ‘interoperable’.

To put it the other way round: licenses are interoperable if you can intermix freely material licensed under one of those licenses with material licensed under another. This interoperability is crucial and it is, in license terms, what underlies a true commons.

More broadly we are interested in a ‘license standard’, in knowing, not only that a set of licenses are interoperable, but that they all allow certain things, for example for anyone to use, reuse and redistribute the licensed material (or to put in terms of freedom, that they guarantee those freedoms to users). This very need for a standard is why we created the Open Definition for content and data building directly on the work on a similar standard (the Open Source Definition) in the Free/Open Source software community.

The existence of non-commercial

CC took a crucial decision in including NonCommercial licenses in their suite. Given the ‘Brand’ success of Creative Commons the inclusion of NC licenses has been to give them a status close to, if not identical, with the truly open, commons-supporting, licenses in the CC suite.

This is a noticeable difference here with the software world, where NC is also active, but under the ‘freeware’ and ‘shareware’ names (these terms aren’t always used consistently), and with this material clearly distinguished from the Free/Open Source software community.

As the CC brand has grown, there is a desire by some individuals and institutions to use CC licenses simply because they are CC licenses (this is also encouraged by the baking in of CC licenses to many products and services). Faced with choosing a license, many people, and certainly many institutions, tend to go for the more restrictive option available (especially when the word commercial is in there — who wants to sanction exploitation for gain of their work by some third-party!). Thus, it is no surprise that non-commercial licenses appear to be by far the most popular.

Without the NC option, some of these people would have chosen one of the open CC licenses instead. Of course, some would not have licensed at all (or, at least not with a CC license), sticking with pure copyright or some other set of terms. Nevertheless, the benefit in gaining a clear dividing line, and in creating brand-pressure for a real commons, and real openness would have been substantial, and worth, in my opinion, the loss of the non-commercial option.

Structure and community

It is notable in the F/OSS community that most licenses, especially the most popular, are either not ‘owned’ by anyone (MIT/BSD) or are run by an organization with a strong community base (e.g. the Free Software Foundation). Creative Commons seem rather different. While there are public mailing lists ultimately decisions regarding the licenses, and about crucial features thereof such as compatibility with 3rd party licenses, remains with CC central based in San Francisco.

Originally, there was a fair amount of autonomy given to country projects but over time this autonomy has gradually been reduced (there are good reasons for this — such as a need for greater standardization across licenses). This has concrete affects for the terms in licenses.

For example, for v3.0 the Netherlands were requested to remove their provisions which included things like DB rights in their share-alike provision and instead standardize on a waiver for these additional rights (rights which are pretty important if you are doing data(base) licensing). Most crucially the CC licenses reserve the right to Creative Commons as an organization to determine compatibility decisions. This is arguably the single most important aspect of licensing, at least in respect of interoperability and the Commons.

Creative Commons and Data

Update: as September 2011 there has been further discussion between Open Data Commons and Creative Commons on these matters, especially regarding interoperability and Creative Commons v4.0.

From my first involvement in the ‘free/open’ area, I’d been interested in data licensing, both because of personal projects and requests from other people.

When first asked how to deal with this I’d recommended ‘modding’ a specific CC license (e.g. Attribution-Sharealike) to include provisions for data and data(bases). However, starting from 2006 there was a strong push from John Wilbanks, then at Science Commons but with the apparent backing of CC generally, against this practice as part of a general argument for ‘PD-only’ for data(bases) (with the associated implication that the existing CC licenses were content-only). While I respect John, I didn’t really agree with his arguments about PD-only and furthermore it was clear that there was a need in the community for open but non-PD licenses for data(bases).

In late 2007 I spoke with Jordan Hatcher and discovered about the work he and Charlotte Waelde were doing for Talis, to draft a new ‘open’ license for data(bases). I was delighted and started helping Jordan with these licenses — licenses that became the Open Data Commons PDDL and the ODbL. We sought input from CC during the drafting of these licenses, specifically the ODbL, but the primary response we had (from John Wilbanks and colleagues) was just “don’t do this”.

Once the ODbL was finalized we then contacted CC further about potential compatibility issues.

The initial response then was that, as CC did not recommend use of its licenses (other than CCZero) for data(bases), there should not be an issue since, as with CC licenses and software, there should be an ‘orthogonality’ of activity — CC licenses would license content, F/OSS licenses would license code, and data(base) licenses (such as the ODC ones) would license data. We pressed about this and had a phone con about this with Diane Peters and John Wilbanks in January 2010, with a follow-up email detailing the issues a bit later.

We’ve also explained on several occasions to senior members of CC central our desire to hear from CC on this issue and our willingness to look at ways to make any necessary amendments to ODC licenses (though obviously such changes would be conditional on full scrutiny by the Advisory Council and consultation with the community).

No response has been forthcoming. To this date, over a year later, we are yet to receive any response from CC despite having though we have now been promised a response at least 3 times (we’ve basically given up asking).

Further to this lack response, without any notice or discussion to ODC, CC recently put out a blog post in which they stated, in marked contrast to previous statements, that CC licenses were entirely suited to data. In many ways this is a welcome step (cf. my original efforts to use CC licenses for data above) but CC have made no statement about a) how they would seek to address data properly b) mention of the relationship of these efforts to existing work in Open Data Commons and especially re. the ODbL. One can only assume, at least in the latter case, that the omission was intentional.

All of this has led me, at least, to wonder what exactly CC’s aims are here. In particular, is CC genuinely concerned with interoperability (beyond a simple ‘everyone uses CC’) and the broader interests of the community who use and apply their licenses?


Creating a true commons for content and data is incredibly important (it’s one of the main things I work on day to day). Creative Commons have done amazing work in this area but as I outline above there is an important distinction between the (open) commons and CC licenses.

Many organisations, institutions, governments and individuals are currently making important decisions about licensing and legal tools – in relation to opening up everything from scientific information, to library catalogues to government data. CC could play an important role in the creation of an interoperable commons of open material. The open CC licenses (CC0, CC-BY and CC-BY-SA) are an important part of the legal toolbox which enables this commons.

I hope that CC will be willing to engage constructively with others in the ‘open’ community to promote licenses and standards which enable a true commons, particularly in relation to data where interoperability is especially crucial.