Bad Data: real-world examples of how not to do data

We’ve just started a mini-project called Bad Data. Bad Data provides real-world examples of how not to publish data. It showcases the poorly structured, the mis-formatted, and the just plain ugly.

This isn’t about being critical but about educating—providing examples of how not to do something may be one of the best ways of showing how to do it right. It also provides a source of good practice material for budding data wranglers!

Bad Data: ASCII spreadsheet

Each “bad” dataset on gets a simple page that shows what’s wrong along with a preview or screenshot.

We’ve started to stock the site with some of the better examples of bad data that we’ve come across over the years. This includes machine-unreadable Transport for London passenger numbers from the London Datastore and a classic “ASCII spreadsheet” from the US Bureau of Labor Statistics.

We welcome contributions of new examples! Submit them here.

Labs newsletter: 14 November, 2013

Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post.

Data Pipes: lots of improvements

Data Pipes is a Labs project that provides a web API for a set of simple data-transforming operations that can be chained together in the style of Unix pipes.

This past week, Andy Lulham has made a huge number of improvements to Data Pipes. Just a few of the new features and fixes:

  • new operations: strip (removes empty rows), tail (truncate dataset to its last rows)
  • new features: a range function and a “complement” switch for cut; options for grep
  • all operations in pipeline are now trimmed for whitespace
  • basic tests have been added

Have a look at the closed issues to see more of what Andy has been up to.

Webshot: new homepage and feature

Last week we introduced you to Webshot, a web API for screenshots of web pages.

Back then, Webshot’s home page was just a screenshot of GitHub. Now Webshot has a proper home page with a form interface to the API.

Webshot has also added support for full page screenshots. Now you can capture the whole page rather than just its visible portion.

On the blog: natural language processing with Python

Labs member Tarek Amr has contributed an awesome post on Python natural language processing with the NLTK toolkit to the Labs blog.

“The beauty of NLP,” Tarek says, “is that it enables computers to extract knowledge from unstructured data inside textual documents.” Read his post to learn how to do text normalization, frequency analysis, and text classification with Python.

Data Packages workflow à la Node

Wouldn’t it be nice to be able to initialize new Data Packages as easily as you can initialize a Node module with npm init?

Max Ogden started a discussion thread around this enticing idea, eventually leading to Rufus Pollock booting a new repo for dpm, the Data Package Manager. Check out dpm’s Issues to see what needs to happen next with this project.

Nomenklatura: looking forward

Nomenklatura is a Labs project that does data reconciliation, making it possible “to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names, against that canonical list”.

Friedrich Lindenberg has noted on the Labs mailing list that Nomenklatura has some serious problems, and he has proposed “a fairly radical re-framing of the service”.

The conversation around what this re-framing should look like is still underway—check out the discussion thread and jump in with your ideas.

Data Issues: following issues

Last week, the idea of Data Issues was floated: using GitHub Issues to track problems with public datasets. The idea has generated a few comments, and we’d love to hear more.

Discussion on the Labs list highlighted another benefit of using GitHub. Alioune Dia suggested that Data Issues should let users register to be notified when a particular issue is fixed. But Chris Mear pointed out that GitHub already makes this possible: “Any GitHub user can ‘follow’ a specific issue by using the notification button at the bottom of the issue page.”

Get involved

Anyone can join the Labs community and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement. Also check out the Ideas Page to see what’s cooking in the Labs.

Northern Mariana Islands Retirement Fund Bankruptcy

Back on April 17 2012 the Northern Mariana Islands Retirement Fund attempted to file for bankruptcy under Chapter 11. There was some pretty interesting reading in their petition for bankruptcy including this section (para 10) which suggests some pretty bad public financial management (emphasis added): “Debtor has had difficulty maintaining healthy funding levels due to […]

Labs newsletter: 7 November, 2013

There was lots of interesting activity around Labs this week, with two launched projects, a new initiative in the works, and an Open Data Maker Night in London.

Webshot: online screenshot service

webshot.okfnlabs.org, an online service for taking screenshots of websites, is now live, thanks to Oliver Searle-Barnes and Simon Gaeremynck.

Try it out with an API call like this:

http://webshot.okfnlabs.org/?url=http://okfnlabs.org&width=640&height=480

Read more about the development behind the service here.

Product Open Data Android app

The first version of the Android app for Product Open Data has launched, allowing you to conveniently look up open data associated with a product on your phone.

The source code for the app is available on GitHub.

Crowdcrafting for Public Bodies

PublicBodies.org aims to provide “a URL for every part of government”. Many entries in the database lack good description text, though, making them harder to use effectively. Fixing this would be a good use of CrowdCrafting.org, the crowd-sourcing platform powered by PyBossa.

Rufus suggests this start small and begin with EU public bodies. It should be easy to build a CrowdCrafting app to cover those, says Daniel Lombraña González. Friedrich Lindenberg thinks this approach could work for other datasets as well.

Discussion of this idea is still happening on the list, so jump in and say what you think—or help build the app!

Open Data Maker Night #6

The sixth Open Data Maker Night took place this past Tuesday in London. Open Data Maker Nights are informal events where people make things with open data, whether apps or insights.

This night’s focus was on adding more UK and London data to OpenSpending, and it featured special guest Max Ogden. It was hosted by the Centre for Creative Collaboration.

Our next Open Data Maker Night will happen in early December. If you want to organize your own, though, it’s super easy: just see the Open Data Maker Night website for help booting, promoting, and running the event.

Tracking Issues with Data the Simple Way

Data Issues is a prototype initiative to track “issues” with data using a simple bug tracker—in this case, GitHub Issues.

We’ve all come across “issues” with data, whether it’s “data” that turns out to be provided as a PDF, the many ways to badly format tabular data (empty rows, empty columns, inlined metadata …), “ASCII spreadsheets”, or simply erroneous data.

Key to starting to improve data quality is a way to report and record these issues.

We’ve thought about ways to address this for quite some time and, led by Labs member Friedrich Lindenberg, even experimented with building our own service. But recently, thanks to a comment from Labs member David Miller, we were hit with a blinding insight: why not do the simplest thing possible and just use an existing bug tracker tool? And so was born the current version of Data Issues based on a github issue tracker!

Data Issues

Aside: Before you decide we were completely crazy not to see this in the first place, it should be said that doing data issues “properly” (in the medium term) probably does require something a bit more than a normal bug tracker. For example, it would be nice to be able to both pinpoint an issue precisely (e.g. the date in column 5 on line 3751 is invalid) and group similar issues (e.g. all amounts in column 7 have a commas in them). Doing this would require a tracker that was customized for data. The solution described in this post, however, seems like a great way to get started.

Introducing Data Issues

Given the existence of so many excellent issue-tracking systems, we thought the best way to start is to reuse one—in the simplest possible way.

With Data Issues, we’re using GitHub Issues to track issues with datasets. Data Issues is essentially just a GitHub repository whose Issues are used to report problems on open datasets. Any problem with any dataset can be reported on Data Issues.

To report an issue with some data, just open an issue in the tracker, add relevant info on the data (its URL, who’s responsible for it, the line number of the bug, etc.), and explain the problem. You can add labels to group related issues—for example, if multiple datasets from the same site have problems, you can add a label that identifies the dataset’s site of origin.

Straightaway, the issue you raise becomes a public notice of the problem with the dataset. Everyone interested in the dataset has access to the issue. The issue is also actionable: each issue contains a thread of comments that can be used to track the issue’s status, and the issue can be closed when it has been fixed. All issues submitted to Data Issues are visible in a central list, which can be filtered by keyword or label to zoom in on relevant issues. All of these great features come for free because we’re using GitHub Issues.

Get Involved

For Data Issues to work, people need to use it. If civic hackers, journalists, and other data wranglers learn about Data Issues and start using it to track their work on datasets, we might find that the problem of tracking issues with datasets has already been solved.

You can also contribute by helping develop the project into something richer than a simple Issues page. One limitation of Data Issues is that raising an issue does not actually contact the parties responsible for the data. Our next goal is to automate sending along feedback from Data Issues, making it a more effective bug tracker.

If you want to discuss new directions for Data Issues or point out something you’ve built that contributes to the project, get in touch via the Labs mailing list.

Introducing TimeMapper – Create Elegant TimeMaps in Seconds

TimeMapper lets you create elegant and embeddable timemaps quickly and easily from a simple spreadsheet.

Medieval philosophers timemap

A timemap is an interactive timeline whose items connect to a geomap. Creating a timemap with TimeMapper is as easy as filling in a spreadsheet template and copying its URL.

In this quick walkthrough, we’ll learn how to recreate the timemap of medieval philosophers shown above using TimeMapper.

Getting started with TimeMapper

To get started, go to the TimeMapper website and sign in using your Twitter account. Then click Create a new Timeline or TimeMap to start a new project. As you’ll see, it really is as easy as 1-2-3.

TimeMapper projects are generated from Google Sheets spreadsheets. Each item on the timemap – an event, an individual, or anything else associated with a date (or two, for the start and end of a period) – is a spreadsheet row.

What can you put in the spreadsheet? Check out the TimeMapper template. It contains all of the columns that TimeMapper understands, plus a row of cells explaining what each of them means. Your timemap doesn’t have to use all of these columns, though—it just requires a Start date, a Title, and a Description for each item, plus geographical coordinates for the map.

So you’ve put your data in a Google spreadsheet—how can you make it into a timemap? Easy! From Google Sheets, go to File -> Publish to the web and hit Start publishing. Then click on your sheet’s Share button and set the sheet’s visibility to Anyone who has the link can view. You can either copy the URL from Link to share and paste that URL into the box in Step 2 of the TimeMapper creation process or click on Select from Your Google Drive to just browse to the sheet. Whichever you do, then hit Connect and Publish—and voilà!

Share your spreadsheet

Embedding your new timemap is just as easy as creating it. Click on Embed in the top right corner. It will pop up a snippet of HTML which you can paste into your webpage to embed the timemap. And that’s all it takes!

Embed your timemap

Coming next

We have big plans for TimeMapper, including:

  • Support for indicating size and time on the map
  • Quickly create TimeMaps using information from Wikipedia
  • Connect markers in maps to form a route
  • Options for timeline- and map-only project layouts
  • Disqus-based comments
  • Core JS library, timemapper.js, so you can build your own apps with timemaps

Check out the TimeMapper issues list to see what ideas we’ve got and to leave suggestions.

Code

In terms of the internals the app is a simple node.js app with storage into s3. The timemap visualization is pure JS built using KnightLabs excellent Timeline.js for the timeline and Leaflet (with OSM) for the maps. For those interested in the code it can be found at: https://github.com/okfn/timemapper/

History and credits

TimeMapper is made possible by awesome open source libraries like TimelineJS, Backbone, and Leaflet, not to mention open data from OpenStreetMap. When we first built a TimeMapper-style site in 2007 under the title “Weaving History”, it was a real struggle over many months to build a responsive JavaScript-heavy app. Today, thanks to libraries like these and advances in browsers, it’s now a matter of weeks.

PublicBodies.org – Update no. 2

Herewith is a report on recent improvements to PublicBodies.org, our project in Open Knowledge Foundation Labs project to provide “a URL (and information) on every “public body” - that’s every government funded agency, department or organization.

New data

New data contributed over the last couple of months is now validated and live - this includes new data for Switzerland, Greece, Brazil and the US. Huge thank-you to contributors here including Hannes, Charalampos, Augusto and Todd.

We also have pending data for Italy and China to get in once it has been reviewed, and we have data in progress for Canada!

We’d love to have more data - if you’re interested in contributing see https://github.com/okfn/publicbodies#contribute-data

Updated Schema for Data

Thanks to input from James McKinney and others we’ve reworked the schema quite extensively to match up as much as possible with the Popolo spec. You can see the new schema in the datapackage.json.

Or if you don’t like raw JSON so much in prettier HTML version on: http://data.okfn.org/community/okfn/publicbodies

Search support

We now have basic search support via google custom search: http://publicbodies.org/search

Get Involved

As always we’d love help! There is a full list of issues here and example items:

Data as Code Deja-Vu

Someone just pointed me at this post from Ben Balter about Data as Code in which he emphasizes the analogies between data and code (and especially open data and open-source – e.g. “data is where code was 2 decades ago” …).

I was delighted to see this post as it makes many points I deeply agree with - and have for some time. In fact, reading it gave me something a sense of (very positive) deja-vu since it made similar points to several posts I and others had written several years ago - suggesting that perhaps we’re now getting close to the critical mass we need to create a real distributed and collaborative open data ecosystem!

It also suggested it was worth dusting off and recapping some of this earlier material as much of it was written more than 6 years ago, a period which, in tech terms, can seem like the stone age.

Previous Thinking

For example, there is this essay from 2007 on Componentization and Open Data that Jo Walsh and I wrote for our XTech talk that year on CKAN. It emphasized analogies with code and the importance of componentization and packaging.

This, in turn, was based on Four principles for Open Knowledge Development and What do we mean by componentization for knowledge. We also emphasized the importance of “version control” in facilitating distributed collaboration, for example in Collaborative Development of Data (2006/2007) and, more recently, in Distributed Revision Control for Data (2010) and this year in Git (and GitHub) for Data.

Package Managers and CKAN

This also brings me to a point relevant both to Ben’s post and Michal’s comment: the original purpose (and design) of CKAN was precisely to be a package manager a la rubygems, pypi, debian etc. It has evolved a lot from that into more of a “wordpress for data” - i.e. a platform for publishing, managing (and storing) data because of user demand. (Note that in early CKAN “datasets” were called packages in both the interface and code - a poor UX decision ;-) that illustrated we were definitely ahead of our time - or just wrong!)

Some sense of what was intended is evidenced by the fact that in 2007 we were writing a command line tool called datapkg (since changed to dpm for data package manager) to act at the command equivalent of gem / pip / apt-get - see this Introducting DataPkg post which included this diagram illustrating how things were supposed to work.

Recent Developments

As CKAN has evolved into a more general-purpose tool – with less of a focus on just being a registry supported automated access – we’ve continued to develop those ideas. For example:

Data Packages and Frictionless Data - from data.okfn.org

Data Pipes – streaming online data transformations

Data Pipes provides an online service built in NodeJS to do simple data transformations – deleting rows and columns, find and replace, filtering, viewing as HTML – and, furthermore, to connect these transformations together Unix pipes style to make more complex transformations. Because Data Pipes is a web service, data transformation with Data Pipes takes place entirely online and the results and process are completely shareable simply by sharing the URL.

An example

This takes the input data (sourced from this original Greater London Authority financial data), slices out the first 50 rows (head), deletes the first column (its blank!) (cut), deletes rows 1 through 7 (delete) and finally renders the result as HTML (html).

http://datapipes.okfn.labs.org/csv/head -n 50/cut 0/delete 1:7/html?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv

Before

Data pipes: GLA data, HTML view

After

Data pipes: GLA data, trimmed

Motivation - Data Wrangling, Pipes, NodeJS and the Unix Philosophy

When you find data in the wild you usually need to poke around in it and then to some cleaning for it to be usable.

Much of the inspiration for Data Pipes comes from our experience using Unix command lines tools like grep, sed, and head to do this kind of work. These tools a powerful way to operate on streams of text (or more precisely streams of lines of text, since Unix tools process text files line by line). By using streams, they can scale to large files easily (they don’t load the whole file but process it bit by bit) and, more importantly, allow “piping” – that is, direct connection of the output of one command with the input of another.

This already provides quite a powerful way to do data wrangling (see here for more). But there are limits: data isn’t always line-oriented, plus command line tools aren’t online, so it’s difficult to share and repeat what you are doing. Inspired by a combination of Unix pipes and the possibilities of NodeJS’s great streaming capabilities, we wanted to take the pipes online for data processing – and so Data Pipes was born.

We wanted to use the Unix philosophy that teaches us to solve problems with cascades of simple, composable operations that manipulate streams, an approach which has proven almost universally effective.

Data Pipes brings the Unix philosophy and the Unix pipes style to online data. Any CSV data can be piped through a cascade of transformations to produce a modified dataset, without ever downloading the data and with no need for your own backend. Being online means that the operations are immediately shareable and linkable.

More Examples

Take, for example, this copy of a set of Greater London Authority financial data. It’s unusable for most purposes, simply because it doesn’t abide by the CSV convention that the first line should contain the headers of the table. The header is preceded by six lines of useless commentary. Another problem is that the first column is totally empty.

Data pipes: Greater London Authority financial data, in the raw

First of all, let’s use the Data Pipes html operation to get a nicer-looking view of the table.

GET /csv/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

Data pipes: GLA data, HTML view

Now let’s get rid of those first six lines and the empty column. We can do this by chaining together the delete operation and the cut operation:

GET /csv/delete 0:6/cut 0/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

And just like that, we’ve got a well-formed CSV!

Data pipes: GLA data, trimmed

But why stop there? Why not take the output of that transformation and, say, search it for the string “LONDON” with the grep transform, then take just the first 20 entries with head?

GET /csv/delete 0:6/cut 0/grep LONDON/head -n 20/html/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

Data pipes: GLA data, final view

Awesome!

What’s next?

Data Pipes already supports a useful collection of operations, but it’s still in development, and more are yet to come, including find-and-replace operation sed plus support for arbitrary map and filter functions.

You can see the full list on the Data Pipes site, and you can suggest more transforms to implement by raising an issue.

Data Pipes needs more operations for its toolkit. That means its developers need to know what you do with data – and to think about how it can be broken down in the grand old Unix fashion. To join in, check out Data Pipes on GitHub and let us know what you think.

data.okfn.org – update no. 2

data.okfn.org is the Labs’ repository of high-quality, easy-to-use open data. This update summarizes some of the improvements to data.okfn.org that have taken place over the past two months.

New tools

Several tools which make it easier to use the Data Package standard are now operational. These include a Data Package creator, a Data Package viewer, and there’s progress on a validator for Data Packages.

Data Package Creator

Turning a CSV into a Data Package means creating a file, datapackage.json, which houses the metadata associated with the CSV. The Data Package Creator simplifies this process.

Provide the Creator with the URL of a CSV and it will return a well-formed JSON object with the required fields, as well as a raw JSON URL (the JSON URL serves as a basic machine accessible API).

Data Package Creator in action

Data Package Viewer

The metadata included with Data Packages makes it possible to construct a simple view of the data. We now provide an online Data Package Viewer to do this for you.

Just provide the link to your Data Package and Viewer generates a user-friendly description, a graph of the data, and a summary of the data fields. Here, for example, is the Viewer’s display of US wheat production data.

Data Package Viewer in action

New datasets

The biggest data news was having our first ‘out-of-the-blue’ contribution of an ‘official’ dataset! Evan Wheeler pinged us to offer a comprehensive collection of country codes for the world’s countries in Simple Data Format. Here is the:

Country codes data, table view

Also new:

If you want to contribute a new dataset, check out the instructions and the outstanding requests.

New standards pages

Among data.okfn.org’s chief purpose is promoting simple standards for data transport in the form of Data Package and Simple Data Format - helping to create a world of frictionless data.

Key here is providing simple, easy-to-understand, information and so we’ve revamped the standards page and created two new pages dedicated to providing simple introduction and overview for Data Package and Simple Data Format:

Get involved

Anyone can contribute, and it’s easy – if you can use a spreadsheet, you can help!

Instructions for getting involved can be found here.