Shuttleworth Fellowship: Bootstrapping the Open Data Ecosystem

This month, I’m starting a year long Shuttleworth Foundation Fellowship. Thanks to Shuttleworth Foundation’s support I’ll be able to dedicate myself full-time to open knowledge and the Open Knowledge Foundation.

I’ll be working to promote open knowledge and open data around the world – open knowledge being any kind of content or data from sonnets to statistics, genes to geodata, that can be freely used, reused and redistributed.

Specifically I’ll be:

Promoting open knowledge in different domains such as the governmental, scientific, economic and bibliographic. This will involve working to develop communities of advocates and practitioners â€“ organising regular meetings, bringing people together for events, working on standards and consensus building. Initiating and sustaining independent and active communities, using, and promoting open data in different fields is key to advancing open knowledge around the world.
Helping to grow the open data ecosystem, for example by adapting the tools and methodologies of the free/open source software community for use with open data. For example, I’ll be working heavily to develop CKAN, an open source registry for datasets. CKAN, which I helped initiate as an Open Knowledge Foundation project, is being used by the UK in its official data catalogue, data.gov.uk, and already has community instances in many other countries around the world â€“ including Austria, Canada, Finland, France, Germany, Hungary, Italy, New Zealand, and Norway. There’s lot of interesting work both to extend CKAN and to improve associated tools like datapkg which enable “data developers” to automate working with datasets.
Working on specific projects that exemplify the open knowledge development process from end to end â€“ going from opening up the raw data, to cleaning and aggregation, to re-exporting for reuse or integration into end user applications that explore, analyze and present the data. For example, Where Does My Money Go?, a project to allow users to explore and visually represent UK public spending, and Open Biblio, which will be bringing together a substantial of open bibliographic data as well as tools for its use and reuse.

Below is the full version of my proposal.

Bootstrapping the Open Data Ecosystem

Describe the world as it is.

During the last several decades the world has seen an explosion of digital technologies which have the potential to transform the way that knowledge is disseminated in society. The combination of powerful computers, increasingly cheap storage and global communication networks should in principle allow us to find what we’re looking for faster, to represent and explore vast and complicated datasets more intuitively, and to connect and cross reference different pieces of information more comprehensively. We now have the means to create a shared ecosystem of knowledge that would bring countless social and economic benefits – whether this is enabling collaboration on the development of life-saving pharmaceutical drugs, increasing transparency and improving public service delivery, or democratising access to cultural and educational materials.

However we are still in the process of making the transition from analogue to digital – from books to bits – and we’re still working out the details of laws, policies, practices, and technologies that will help us to optimise and improve the way knowledge is shared. In many areas of knowledge production we still have a long way to go. Our copyright laws mean that in many cases we are not permitted to republish or combine different sources of information available online. Publication workflows in government still revolve around polished documents for people to read in print rather than datasets which can be manipulated, analysed, and represented by computers. Scientists often do not publish the raw data underlying their research publications – meaning that potentially valuable experimental data or analysis can sit gathering dust. In many countries public bodies are often protective of their information assets, hoping to sell them to private companies rather than opening them up for reuse by the public. Across the board we still have vast silos of data that is not shared.

In some cases we have overcome some of the various obstacles to sharing knowledge. We have licenses and legal tools which can be used to give the green light to those wishing to reuse documents or datasets, akin those used for open source software. We have technologies such as wikis and versioned databases to enable widespread collaboration in knowledge development. We have policy documents and good examples to point to which indicate the benefits of opening up data for others to reuse. But these are the exception rather than the rule. Many institutions and communities are now facing decisions which will help to shape the future of how knowledge is shared – and which will help to determine whether we will have plethora of poorly connected walled gardens, or a shared ecosystem that everyone can benefit from.

What change do you want to make?

Ultimately I’d like to see a world in which open knowledge – knowledge that can be freely shared and used without restriction – becomes ubiquitous and routine. In particular I’d like to see the growth of an ecosystem of open data, using tools and methodologies similar to those used in the development of open source software. In order to achieve this we will need good tools and good documentation to help people open up their data and good examples to encourage them to do so. We will also need an active community of people creating, using, and promoting open data in different fields.

There are some amazing examples of open content and open data out there. Community-driven projects such as Wikipedia and Open Street Map are now almost household names. Over the past year there have been major official initiatives to open up government data for the public to reuse such as data.gov or data.gov.uk. Several national research bodies have official policies requiring open access to publicly funded research publications – and there is increasing support for open access to data in fields such as bioinformatics and chemistry. There is a tremendous opportunity to build on these examples and to encourage others to follow suit by opening up their information.

In open source software, perhaps we have the most sophisticated example of widespread and decentralised collaboration on the development of material which anyone is free to reuse. There are also many key similarities between the development of code and the development of data – indeed, in many ways, the distinction between code and data are beginning to blur.

The most important similarity is that both lend themselves naturally to being broken down into smaller chunks, which can then be reused and recombined. You can break down projects, whether they are data sets or software programs, into pieces of a manageable size – after all, the human brain can only handle so much data – and do it in a way that makes it easier to put the pieces back together again. And splitting things up means people can work independently on different pieces of a project, while others can work on putting the pieces back together .

By creating and sharing “packages” of data, using the same principles you see at work in Linux distributions – we can start to move towards something like Debian for data. Debian has currently got something like 18,000 software packages, and these are maintained by hundreds, if not thousands, of people – many of whom have never met. We envision various communities being able to do the same thing with scientific and other types of data. This way, we can begin to divide and conquer the complexity inherent in the vast amounts of material being produced – complexity I don’t see us being able to manage any other way.

What do you want to explore?

I’m interested in learning more about the roadblocks to opening up data in different domains and in different countries – and how they can be overcome. There are several areas that I’m thinking of specifically here: encouraging academic institutions and researchers to share the results of their research openly and getting more governments to follow the example of countries like the UK and the US and open up official data. An important point to make here is that it is essential to understand the idiosyncrasies of a given discipline or organization if one is to make a compelling argument for going the open route.

I’m also interested questions related to way we develop, represent and analyse large public datasets. What will the open data ecosystem look like? How we can learn from software development to make something that is incremental, decentralised, collaborative, and componentised. How can we build robust and scalable infrastructures for the collaborative development of data? How can we build the technology to allow citizens to combine multiple sources of official data in a wiki-like manner – so that changes can be tracked, and provenance can be traced? How can we break down large and complex datasets into smaller manageable components, and then successfully recombine them again? How can we ‘package’ data and create knowledge APIs to enable automated distribution and reuse of datasets? How can we achieve real read/write status for official information – not just access alone? How can we build better interfaces to make large complex datasets easier for citizens and journalists to understand?

In practical terms I am keen to learn about how we can build tools for the discovery and machine automated installation, analysis, and representation of data that respond to the concrete needs and requirements of those working with large datasets – whether this is civic hackers trying to build better web services for citizens, scientists working on climate models or NGOs formulating strategies for international development.

What are you going to do to get there?

First and foremost I would like to continue to promote open knowledge in different domains. This will include continuing to advocate open government data in different countries around the world and continuing to encourage scientists to open up research data. Also I would like to look at ways of encouraging greater openness in other domains – including in libraries and disciplines such as economics and linguistics. This will involve working to build communities of advocates and practitioners – by organising regular meetings, bringing people together for events, starting work on standards and consensus building, and encouraging people to become evangelists for open data and open content in their field.

I’m also very keen to continue to work on tools for finding and working with open data. A key project in this area is CKAN, which is an open source registry for open data. It is being used by the UK in its official data catalogue, data.gov.uk, and I am currently helping open government data advocates around the world to set up instances in over 10 countries – including Austria, Canada, Finland, France, Germany, Hungary, Italy, New Zealand, and Norway. I would like to continue work on support for multiple languages, getting different instances to talk to each other, allowing federated search for datasets in different countries, tagging in multiple languages and so on. I would also like to continue work that I started on a project called ‘datapkg’, which will allow users to work with large datasets in increasingly sophisticated and machine automatable ways. Developing these kinds of tools and working closely alongside open data advocates and reusers to implement them will help to bootstrap a critical backbone for the open data ecosystem.

Finally I would like to work on several projects that will exemplify the open knowledge development process from end to end – from opening up datasets, to cleaning them up and aggregating them in a nice machine readable format for others to reuse, to creating an intuitive way to explore, represent, and add to the data. These include Where Does My Money Go?, a project to allow users to explore and visually represent UK public spending, and Open Biblio, which will combine together numerous large sources of bibliographic information with various web services orientated towards different groups of users. All projects will have a strong community-driven component and all code, content, and data will be fully open. Each project will aim to demonstrate the concrete benefits of open material, and to help to refine key technologies for the collaborative development of data – such as versioning for data, exchanging data between different instances of a web service, and so on.

### Key milestones will include:

10 active working groups promoting open knowledge in a different key area – e.g. open data in science, open government data, open data in international development, and so on
10 actively used instances of CKAN for open data in different countries – each led by a partner organisation or advocate in each country who will help to build a community of users around the registry
A major international workshop on open government data – bringing together key government representatives, advocates, policymakers, stakeholders, and reusers
Reaching version 1.0 with two major open data projects: Where Does My Money Go? and Open Biblio – each with an active user community and with significant publicity in its respective field