Category Archives: Code

Open Data Maker Night London No 3 – Tuesday 16th July

The next Open Data Maker Night London will be on Tuesday 16th July 6-9pm (you can drop in any time during the evening). Like the last two it is kindly hosted by the wonderful Centre for Creative Collaboration, 16 Acton Street, London.

Look forward to seeing folks there!

What

Open Data Maker Nights are informal events focused on “making” with open data – whether that’s creating apps or insights. They aren’t a general meetup – if you come, expect to get pulled into actually building something, though we won’t force you!

Who

The events usually have short introductory talks about specific projects and suggestions for things to work on – it’s absolutely fine to turn up knowing nothing about data or openness or tech as, there’ll an activity for you to help and someone to guide you in contributing!

Organize your own!

Not in London? Why not organize your own Open Data Maker night in your city? Anyone can and it’s easy to do – find out more »

data.okfn.org – update no. 1

This is the first of regular updates on Labs project http://data.okfn.org/ and summarizes some of the changes and improvements over the last few weeks.

1. Refactor of site layout and focus.

We’ve done a refactor of the site to have stronger focus on the data. Front page tagline is now:

We’re providing key datasets in high quality, easy-to-use and open form

Tools and standards are there in a clear supporting role. Thanks to all the suggestions and feedback on this and welcome more - we’re still iterating.

2. Pull request data workflow

There was a nice example of the pull request data workflow being used (by a complete stranger!): https://github.com/datasets/house-prices-uk/pull/1

3. New datasets

For example:

Looking to contribute data check out the instructions http://data.okfn.org/about/contribute#data and the outstanding requests: https://github.com/datasets/registry/issues

4. Tooling

5. Feedback on standards

There’s been a lot of valuable feedback on the data package and json table schema standards including some quite major suggestions (e.g. substantial change to JSON Table Schema to align more closely with JSON Schema - thx to jpmckinney)

Next steps

There’s plenty more coming up soon in terms of data and the site and tools.

Get Involved

Anyone can contribute and its easy – if you can use a spreadsheet you can help!

Instructions for getting involved here: http://data.okfn.org/about/contribute

Update on PublicBodies.org – a URL for every part of Government

This is an update on PublicBodies.org - a Labs project whose aim is to provide a “URL for every part of Government”: http://publicbodies.org/

PublicBodies.org is a database and website of “Public Bodies” – that is Government-run or controlled organizations (which may or may not have distinct corporate existence). Examples would include government ministries or departments, state-run organizations such as libraries, police and fire departments and more.

We run into public bodies all the time in projects like OpenSpending (either as spenders or recipients). Back in 2011 as part of the “Organizations” data workshop at OGD Camp 2011, Labs member Friedrich Lindenberg scraped together a first database and site of “public bodies” from various sources (primarily FoI sites like WhatDoTheyKnow, FragDenStaat and AskTheEU).

We’ve recently redone the site converting the sqlite DB to simple flat CSV files:

The site itself is now super-simple flat-files hosted on s3 (build code here). Here’s an example of the output:

The simplicity of CSV for data plus simple templating to flat-files is very attractive. There are some drawbacks such as changes to primary template resulting in a full rebuild and upload of ~6k files so, especially as the data grows, we may want to look into something a bit nicer but for the time being this works well.

Next Steps

There’s plenty that could be improved e.g.

  • More data - other jurisdictions (we only cover EU, UK and Germany) + descriptions for the bodies (this could be a nice crowdcrafting app)
  • Search and Reconciliation (via nomenklatura)
  • Making it easier to submit corrections or additions

The full list of issues is on github here: https://github.com/okfn/publicbodies/issues

Help is most definitely wanted! Just grab one of the issues or get in touch

Quick and Dirty Analysis on Large CSVs

I’m playing around with some large(ish) CSV files as part of a OpenSpending related data investigation to look at UK government spending last year – example question: which companies were the top 10 recipients of government money? (More details can be found in this issue on OpenSpending’s things-to-do repo).

The dataset I’m working with is the consolidated spending (over £25k) by all UK goverment departments. Thanks to the efforts of of OpenSpending folks (and specifically Friedrich Lindenberg) this data is already nicely ETL’d from thousands of individual CSV (and xls) files into one big 3.7 Gb file (see below for links and details).

My question is what is the best way to do quick and dirty analysis on this?

Examples of the kinds of options I was considering were:

  • Simple scripting (python, perl etc)
  • Postgresql - load, build indexes and then sum, avg etc
  • Elastic MapReduce (AWS Hadoop)
  • Google BigQuery

Love to hear what folks think and if there are tools or approaches they would specifically recommend.

The Data

Cleaning up Greater London Authority Spending (for OpenSpending)

I’ve been working to get Greater London Authority spending data cleaned up and into OpenSpending. Primary motivation comes from this question:

Which companies got paid the most (and for doing what)? (see this issue for more)

I wanted to share where I’m up to and some of the experience so far as I think these can inform our wider efforts - and illustrate the challenges just getting and cleaning up data. I note that the code and README for this ongoing work is in a repo on github: https://github.com/rgrp/dataset-gla

Data Quality Issues

There are 61 CSV files as of March 2013 (a list can be found in scrape.json).

Unfortunately the “format” varies substantially across files (even though they are all CSV!) which makes using this data real pain. Some examples:

  • no of fields and there names vary across files (e.g. SAP Document no vs Document no)
  • number of blank columns or blank lines (some files have no blank lines (good!), many have blank lines plus some metadata etc etc)
  • There is also at least one “bad” file which looks to be an excel file saved as CSV
  • Amounts are frequently formatted with “,” making them appear as strings to computers.
  • Dates vary substantially in format e.g. “16 Mar 2011”, “21.01.2011” etc
  • No unique transaction number (possibly document number)

They also switched from monthly reporting to period reporting (where there are 13 periods of approx 28d each).

Progress so far

I do have one month loaded (Jan 2013) with a nice breakdown by “Expenditure Account”:

http://openspending.org/gb-local-gla

Interestingly after some fairly standard grants to other bodies, “Claim Settlements” comes in as the biggest item at £2.3m

Progress on the Data Explorer

This is an update on progress with the Data Explorer (aka Data Transformer).

Progress is best seen from this demo which takes you on a tour of house prices and the difference between real and nominal values.

More information on recent developments can be found below. Feedback is very welcome - either here or the issues https://github.com/okfn/dataexplorer.

House prices tutorial

What is the Data Explorer

For those not familiar, the Data Explorer is a HTML+JS app to view, visualize and process data just in the browser (no backend!). It draws heavily on the Recline library and features now include:

  • Importing data from various sources (the UX of this could be much improved!)
  • Viewing and visualizing using Recline to create grids, graphs and maps
  • Cleaning and transforming data using a scripting component that allows you to write and run javascript
  • Saving and sharing: everything you create (scripts, graphs etc) can be saved and then shared via public URL.

Note, that persistence (for sharing) is to Gists (here’s the gist for the House Prices demo linked above). This has some nice benefits such as versioning; offline editing (clone the gist, edit and push); and bl.ocks.org-style ability to create a gist and have it result in public viewable output (though with substantial differences vs blocks …).

What’s Next

There are many areas that could be worked on – a full list of issues is in github. The most important I think at the moment are:

I’d very interested in people’s thoughts on the app so far and what should be done next and code contributions are also very welcome (the app has already benefitted from the efforts of many people including the likes of Martin Keegan and Michael Aufreiter to the app itself; and from folks like Max Ogden, Friedrich Lindenberg, James Casbon, Gregor Aisch, Nigel Babu (and many more) in the form of ideas, feedback, work on Recline etc).

Recline JS – Componentization and a Smaller Core

Over time Recline JS has grown. In particular, since the first public announce of Recline last summer we’ve had several people producing new backends and views (e.g. backends for Couch, a view for d3, a map view based on Ordnance Survey’s tiles etc etc).

As I wrote to the labs list recently, continually adding these to core Recline runs the risk of bloat. Instead, we think it’s better to keep the core lean and move more of these “extensions” out of core with a clear listing and curation process - the design of Recline means that new backends and views can extend the core easily and without any complex dependencies.

This approach is useful in other ways. For example, Recline backends are designed to support standalone use as well as use with Recline core (they have no dependency on any other part of Recline - including core) but this is not very obvious as it stands (where the backend is bundled with Recline). To take a concrete example, the Google Docs backend is a useful wrapper for the Google Spreadsheets API in its own right. While this is already true, when this code is in the main Recline repository it isn’t very obvious but having the repo split out with its own README would make this much clearer.

So the plan is …

  • Announce this approach of a leaner core and more “Extensions”
  • Identify first items to split out from core - see this issue
  • Identify what components should remain in core? (I’m thinking Dataset + Memory DataStore plus one Grid, Graph and Map)

So far I’ve already started the process of factoring out some backends (and soon views) into standalone repos, e.g. here’s GDocs:

https://github.com/okfn/recline.backend.gdocs

Any thoughts very welcome and if you already have Recline extensions lurking in your repos please add them to the wiki page

Web Scraping with CSS Selectors in Node using JSDOM or Cheerio

I’ve traditionally used python for web scraping but I’d been increasingly thinking about using Node given that it is pure JS and therefore could be a more natural fit when getting info out of web pages.

In particular, when my first steps when looking to extract information from a website is to open up the Chrome Developer tools (or Firebug in Firefox) and try and extract information by inspecting the page and playing around in the console - the latter is especially attractive if jQuery is available.

What I often end up with from this is a few lines of jQuery selectors. My desire here was to find a way to directly reuse these same css selectors I use in my browser experimentation directly in the scraping script. Now, things like pyquery do exist in python (and there is some css selector support in the brilliant BeautifulSoup) but a connection with something like Node seems even more natural - it is after the JS engine from a browser!

UK Crime Data

My immediate motivation for this work was wanting to play around with the UK Crime data (all open data now!).

To do this I needed to:

  1. Get the data in consolidated form by scraping the file list and data files from http://police.uk/data/ - while they commendably provide the data in bulk there is no single file to download, instead there is one file per force per month.
  2. Do data cleaning and analysis - this included some fun geo-conversion and csv parsing

I’m just going to talk about the first part in what folllows - though I hope to cover the second part in a follow up post.

I should also note that all the code used for scraping and working with this data can be found in the UK Crime dataset data package on GitHub on Github - scrape.js file is here. You can also see some of the ongoing results of these data experiments in an experimental UK crime “dashboard” here.

Scraping using CSS Selectors in Node

Two options present themselves when doing simple scraping using css selectors in node.js:

For the UK crime work I used jsdom but I’ve subsequently used cheerio as it is substantially faster so I’ll cover both here (I didn’t discover cheerio until I’d started on the crime work!).

Here’s an excerpted code example (full example in the source file):

var url = 'http://police.uk/data';
// holder for results
var out = {
  'streets': []
}
jsdom.env({
  html: url,
  scripts: [
    'http://code.jquery.com/jquery.js'
  ],
  done: function(errors, window) {
    var $ = window.$;
    // find all the html links to the street zip files
    $('#downloads .months table tr td:nth-child(2) a').each(function(idx, elem) {
      // push the url (href attribute) onto the list
      out['streets'].push( $(elem).attr('href') );
    });
  });
});

As an example of Cheerio scraping here’s an example from work scraping info the EU’s TED database (sample html file):

var url = 'http://files.opented.org.s3.amazonaws.com/scraped/100120-2011/summary.html';
// place to store results
var data = {};
// do the request using the request library
request(url, function(err, resp, body){
  $ = cheerio.load(body);

  data.winnerDetails = $('.txtmark .addr').html();

  $('.mlioccur .txtmark').each(function(i, html) {
    var spans = $(html).find('span');
    var span0 = $(spans[0]);
    if (span0.text() == 'Initial estimated total value of the contract ') {
      var amount = $(spans[4]).text()
      data.finalamount = cleanAmount(amount);
      data.initialamount = cleanAmount($(spans[1]).text());
    }
  });
});

Archiving Twitter the Hacky Way

There are many circumstances where you want to archive a tweets - maybe just from your own account or perhaps for a hashtag for an event or topic.

Unfortunately Twitter search queries do not give data more than 7 days old and for a given account you can only get approximately the last 3200 of your tweets and 800 items from your timeline. [Update: People have pointed out that Twitter released a feature to download an archive of your personal tweets at the end of December - this, of course, still doesn’t help with queries or hashtags]

Thus, if you want to archive twitter you’ll need to come up with another solution (or pay them, or a reseller, a bunch of money - see Appendix below!). Sadly, most of the online solutions have tended to disappear or be acquired over time (e.g. twapperkeeper). So a DIY solution would be attractive. After reading various proposals on the web I’ve found the following to work pretty well (but see also this excellent google spreadsheet based solution).

The proposed process involves 3 steps:

  1. Locate the Twitter Atom Feed for your Search
  2. Use Google Reader as your Archiver
  3. Get your data out of Google Reader (a 1000 items at a time!)

One current drawback of this solution is that each stage has to be done by hand. It could be possible to automate more of this, and especially the important third step, if I could work out how to do more with the Google Reader API. Contributions or suggestions here would be very welcome!

Note that the above method will become obsolete as of March 5 2013 when Twitter close down RSS and Atom feeds - continuing their long march to becoming a fully more closed and controlled ecosystem.

As you struggle, like me, to get precious archival information out of Twitter it may be worth reflecting on just how much information you’ve given to Twitter that you are now unable to retrieve (at least without paying) …

Twitter Atom Feed

Twitter still have Atom feeds for their search queries:

http://search.twitter.com/search.atom?q=my_search

Note that if you want to search for a hash tag like #OpenData or a user e.g. @someone you’ll need to escape the symbols:

http://search.twitter.com/search.atom?q=%23OpenData

Unfortunately twitter atom queries are limited to only a few items (around 20) so we’ll need to continuously archive that feed to get full coverage.

Archiving in Google Reader

Just add the previous feed URL in your Google Reader account. It will then start archiving.

Aside: because the twitter atom feed is limited to a small number of items and the check in google reader only happens every 3 hours (1h if someone else is archiving the same feed) you can miss a lot of tweets. One option could be to use Topsy’s RSS feeds http://otter.topsy.com/searchdate.rss?q=%23okfn (though not clear how to get more items from this feed either!)

Gettting Data out of Google Reader

Google Reader offers a decent (though still beta) API. Unoffical docs for it can be found here: http://undoc.in/

The key URL we need is:

http://www.google.com/reader/atom/feed/[feed_address]?n=1000

Note that the feed is limited to a maximum of 1000 items and you can only access it for your account if you are logged in. This means:

  • If you have more than a 1000 items you need to find the continuation token in each set of results and then at &c={continuation-token} to your query.
  • Because you need to be logged in your browser you need to do this by hand :-( (it may be possible to automate via the API but I couldn’t get anything work - any tips much appreciated!)

Here’s a concrete example (note, as you need to be logged in this won’t work for you):

http://www.google.com/reader/atom/feed/http://search.twitter.com/search.atom%3Fq%3D%2523OpenData?n=1000

And that’s it! You should now have a local archive of all your tweets!

Appendix

Increasing Twitter is selling access to the full Twitter archive and there are a variety of 3rd services (such as Gnip, DataSift, Topsy and possibly more) who are offering full or partial access for a fee.

Recline JS Search Demo

Recline JS

We’ve recently finished a demo for ReclineJS showing how it can be used to build JS-based (ajax-style) search interfaces in minutes (or even seconds!): http://reclinejs.com/demos/search/

Because of Recline’s pluggable backends you get out of the box support for data sources such as SOLR, Google Spreadsheet, ElasticSearch, or plain old JSON or CSV – see examples below for live examples of using different backends.

Interested in using this yourself? The (prettified) source JS for the demo is available (plus the raw version) and it shows how simple it is to build an app like this using Recline – plus it has tips on how to customize and extend).

demo

More Examples

In addition to the simple example with local data there are several other examples showing how one can use this with other data sources including Google Docs and SOLR:

  1. A search example using a google docs listing Shell Oil spills in the Niger delta

  2. A search example running of OpenSpending SOLR API – we suggest searching for something interesting like “Drugs” or “Nuclear power”!

Code

The full (prettified) source JS for the demo is available (plus the raw version) but here’s a key code sample to give a flavour:

// ## Simple Search View
//
// This is a simple bespoke Backbone view for the Search. It Pulls together
// various Recline UI components and the central Dataset and Query (state)
// object
//
// It also provides simple support for customization e.g. of template for list of results
// 
//      var view = new SearchView({
//        el: $('some-element'),
//        model: dataset
//        // EITHER a mustache template (passed a JSON version of recline.Model.Record
//        // OR a function which receives a record in JSON form and returns html
//        template: mustache-template-or-function
//      });
var SearchView = Backbone.View.extend({
  initialize: function(options) {
    this.el = $(this.el);
    _.bindAll(this, 'render');
    this.recordTemplate = options.template;
    // Every time we do a search the recline.Dataset.records Backbone
    // collection will get reset. We want to re-render each time!
    this.model.records.bind('reset', this.render);
    this.templateResults = options.template;
  },

  // overall template for this view
  template: ' \
    <div class="controls"> \
      <div class="query-here"></div> \
    </div> \
    <div class="total"><h2><span></span> records found</h2></div> \
    <div class="body"> \
      <div class="sidebar"></div> \
      <div class="results"> \
        } \
      </div> \
    </div> \
    <div class="pager-here"></div> \
  ',
 
  // render the view
  render: function() {
    var results = '';
    if (_.isFunction(this.templateResults)) {
      var results = _.map(this.model.records.toJSON(), this.templateResults).join('\n');
    } else {
      // templateResults is just for one result ...
      var tmpl = '' + this.templateResults + ''; 
      var results = Mustache.render(tmpl, {
        records: this.model.records.toJSON()
      });
    }
    var html = Mustache.render(this.template, {
      results: results
    });
    this.el.html(html);

    // Set the total records found info
    this.el.find('.total span').text(this.model.recordCount);

    // ### Now setup all the extra mini-widgets
    // 
    // Facets, Pager, QueryEditor etc

    var view = new recline.View.FacetViewer({
      model: this.model
    });
    view.render();
    this.el.find('.sidebar').append(view.el);

    var pager = new recline.View.Pager({
      model: this.model.queryState
    });
    this.el.find('.pager-here').append(pager.el);

    var queryEditor = new recline.View.QueryEditor({
      model: this.model.queryState
    });
    this.el.find('.query-here').append(queryEditor.el);
  }
});