Open Knowledge Foundation
The next Open Data Maker Night London will be on Tuesday 16th July 6-9pm (you can drop in any time during the evening). Like the last two it is kindly hosted by the wonderful Centre for Creative Collaboration, 16 Acton Street, London.
- When: Tuesday 16th July 2013
- Where: Centre for Creative Collaboration, 16 Acton Street, London.
- Signup: on Meetup page (optional but nice to know numbers!)
Look forward to seeing folks there!
Open Data Maker Nights are informal events focused on “making” with open data – whether that’s creating apps or insights. They aren’t a general meetup – if you come, expect to get pulled into actually building something, though we won’t force you!
The events usually have short introductory talks about specific projects and suggestions for things to work on – it’s absolutely fine to turn up knowing nothing about data or openness or tech as, there’ll an activity for you to help and someone to guide you in contributing!
Organize your own!
This is the first of regular updates on Labs project http://data.okfn.org/ and summarizes some of the changes and improvements over the last few weeks.
1. Refactor of site layout and focus.
We’ve done a refactor of the site to have stronger focus on the data. Front page tagline is now:
We’re providing key datasets in high quality, easy-to-use and open form
Tools and standards are there in a clear supporting role. Thanks to all the suggestions and feedback on this and welcome more - we’re still iterating.
2. Pull request data workflow
There was a nice example of the pull request data workflow being used (by a complete stranger!): https://github.com/datasets/house-prices-uk/pull/1
3. New datasets
- US house prices http://data.okfn.org/data/house-prices-us
- Annual consumer price index http://data.okfn.org/data/cpi
- We have a DataPackage.JSON creator tool in progress at http://data.okfn.org/tools/dp/create (here’s the relevant github issue)
- We have a new data package viewer created by James Smith
5. Feedback on standards
There’s been a lot of valuable feedback on the data package and json table schema standards including some quite major suggestions (e.g. substantial change to JSON Table Schema to align more closely with JSON Schema - thx to jpmckinney)
There’s plenty more coming up soon in terms of data and the site and tools.
- Complete the datapackage.json generator (support for gdocs especially)
- Complete the datapackage.json validator
- More datasets especially key indices
Anyone can contribute and its easy – if you can use a spreadsheet you can help!
Instructions for getting involved here: http://data.okfn.org/about/contribute
PublicBodies.org is a database and website of “Public Bodies” – that is Government-run or controlled organizations (which may or may not have distinct corporate existence). Examples would include government ministries or departments, state-run organizations such as libraries, police and fire departments and more.
We run into public bodies all the time in projects like OpenSpending (either as spenders or recipients). Back in 2011 as part of the “Organizations” data workshop at OGD Camp 2011, Labs member Friedrich Lindenberg scraped together a first database and site of “public bodies” from various sources (primarily FoI sites like WhatDoTheyKnow, FragDenStaat and AskTheEU).
We’ve recently redone the site converting the sqlite DB to simple flat CSV files:
- Main github repo: https://github.com/okfn/publicbodies
- Example raw CSV: https://raw.github.com/okfn/publicbodies/master/data/gb.csv
The site itself is now super-simple flat-files hosted on s3 (build code here). Here’s an example of the output:
- European Parliament: http://publicbodies.org/eu/european-parliament.html
- Associated JSON API (with CORS!) http://publicbodies.org/eu/european-parliament.json
The simplicity of CSV for data plus simple templating to flat-files is very attractive. There are some drawbacks such as changes to primary template resulting in a full rebuild and upload of ~6k files so, especially as the data grows, we may want to look into something a bit nicer but for the time being this works well.
There’s plenty that could be improved e.g.
- More data - other jurisdictions (we only cover EU, UK and Germany) + descriptions for the bodies (this could be a nice crowdcrafting app)
- Search and Reconciliation (via nomenklatura)
- Making it easier to submit corrections or additions
The full list of issues is on github here: https://github.com/okfn/publicbodies/issues
Help is most definitely wanted! Just grab one of the issues or get in touch …
I’m playing around with some large(ish) CSV files as part of a OpenSpending related data investigation to look at UK government spending last year – example question: which companies were the top 10 recipients of government money? (More details can be found in this issue on OpenSpending’s things-to-do repo).
The dataset I’m working with is the consolidated spending (over £25k) by all UK goverment departments. Thanks to the efforts of of OpenSpending folks (and specifically Friedrich Lindenberg) this data is already nicely ETL’d from thousands of individual CSV (and xls) files into one big 3.7 Gb file (see below for links and details).
My question is what is the best way to do quick and dirty analysis on this?
Examples of the kinds of options I was considering were:
- Simple scripting (python, perl etc)
- Postgresql - load, build indexes and then sum, avg etc
- Elastic MapReduce (AWS Hadoop)
- Google BigQuery
Love to hear what folks think and if there are tools or approaches they would specifically recommend.
- Here’s the 3.7 Gb CSV
- A Data Package file for the data describing the fields: https://raw.github.com/openspending/dpkg-uk25k/master/datapackage.json
I’ve been working to get Greater London Authority spending data cleaned up and into OpenSpending. Primary motivation comes from this question:
Which companies got paid the most (and for doing what)? (see this issue for more)
I wanted to share where I’m up to and some of the experience so far as I think these can inform our wider efforts - and illustrate the challenges just getting and cleaning up data. I note that the code and README for this ongoing work is in a repo on github: https://github.com/rgrp/dataset-gla
Data Quality Issues
There are 61 CSV files as of March 2013 (a list can be found in scrape.json).
Unfortunately the “format” varies substantially across files (even though they are all CSV!) which makes using this data real pain. Some examples:
- no of fields and there names vary across files (e.g. SAP Document no vs Document no)
- number of blank columns or blank lines (some files have no blank lines (good!), many have blank lines plus some metadata etc etc)
- There is also at least one “bad” file which looks to be an excel file saved as CSV
- Amounts are frequently formatted with “,” making them appear as strings to computers.
- Dates vary substantially in format e.g. “16 Mar 2011”, “21.01.2011” etc
- No unique transaction number (possibly document number)
They also switched from monthly reporting to period reporting (where there are 13 periods of approx 28d each).
Progress so far
I do have one month loaded (Jan 2013) with a nice breakdown by “Expenditure Account”:
Interestingly after some fairly standard grants to other bodies, “Claim Settlements” comes in as the biggest item at £2.3m
- Data getting archived at http://data.openspending.org/datasets/gb-local-gla/
- Clean up script: https://github.com/rgrp/dataset-gla/blob/master/scripts/process.js