Labs newsletter: 12 December, 2013

We’re back after taking a break last week with a bumper crop of updates. A few things have changed: Labs activities are now coordinated entirely through GitHub. Meanwhile, there’s been some updates around the Nomenklatura, Annotator, and Data Protocols projects and some new posts on the Labs blog.

Migration from Trello to GitHub

For some time now, Labs activities requiring coordination have been organized on Trello—but those days are now over. Labs has moved its organizational setup over to GitHub, coordinating actions and making plans by means of GitHub issues. This change comes as a big relief to the many Labs members who already use GitHub as their main platform for collaboration.

General Labs-related activities are now tracked on the Labs site’s issues, and activities around individual projects are managed (as before!) through those projects’ own issues.

New Bad Data

New examples of bad data continue to roll in—and we invite even more new submissions.

Bad datasets added since last newsletter include the UK’s Greater London Authority spend data (65+ files with 25+ different structures!), Nature Magazine’s supplementary data (an awful PDF jumble), and more.

Nomenklatura: new alpha

As we’ve previously noted, Labs member Friedrich Lindenberg has been thinking about producing “a fairly radical re-framing” of the Nomenklatura data reconciliation service.

Friedrich has now released an alpha version of a new release of Nomenklatura at nk-dev.pudo.org. The major changes with this alpha include:

  • A fully JavaScript-driven frontend
  • String matching now happens inside the PostgreSQL database
  • Better introductory text explaining what Nomenklatura does
  • “entity” and “alias” domain objects have been merged into “entity”

Friedrich is keen to hear what people think about this prototype—so jump in, give it a try, and leave your comments at the Nomenklatura repo.

Annotator v1.2.9

A new maintenance release of Annotator came out ten days ago. This new version is intended to be one of the last in the v1.2.x series—indeed, v1.2.8 itself was intended to be the last, but that version had some significant issues that this new release corrects.

Fixes in this version include:

  • Fixed a major packaging error in v1.2.8. Annotator no longer exports an excessive number of tokens to the page namespace.
  • Notification display bugfixes. Notification levels are now correctly removed after notifications are hidden.

The new Annotator is available, as always, from GitHub.

Data Protocols updates

Data Protocols is a project to develop simple protocols and formats for working with open data. Rufus Pollock wrote a cross-post to the list about several new developments with Data Protocols of interest to Labs. These included:

  • Close to final agreement on a spec for adding “primary keys” to the JSON Table Schema (discussion)
  • Close to consensus on spec for “foreign keys” (discussion)
  • Proposal for a JSON spec for views of data, e.g. graphs or maps (discussion)

For more, check out Rufus’s message and the Data Protocols issues.

On the blog

Labs members have added a couple new posts to the blog since the last newsletter. Yours truly (with extensive help from Rufus) posted on using Data Pipes to view a CSV. Michael Bauer, meanwhile, wrote about the new Reconcile-CSV service he developed while working on education data in Tanzania. Look to the Labs blog for the full scoop.

Get involved

If you have some spare time this holiday season, why not spend it helping out with Labs? We’re always always looking for new people to join the community—visit the Labs issues and the Ideas Page to get some ideas for how you can join in.

View a CSV (Comma Separated Values) in Your Browser

This post introduces one of the handiest features of Data Pipes: fast (pre) viewing of CSV files in your browser (and you can share the result by just copying a URL).

The Raw CSV

CSV files are frequently used for storing tabular data and are widely supported by spreadsheets and databases. However, you can’t usually look at a CSV file in your browser - usually your browser will automatically download a CSV file. And even if you could look at a CSV file, it is not very pleasant to look at:

Raw CSV

The Result

But using the Data Pipes html feature, you can turn an online CSV into a pretty HTML table in a few seconds. For example, the CSV you’ve just seen would become this pretty table:

CSV, HTML view

Using it

To use this service, just visit http://datapipes.okfnlabs.org/html/ and paste your CSV’s URL into the form.

For power users (or for use from the command line or API), you can just append your CSV url to:

http://datapipes.okfnlabs.org/csv/html/?url=

Previewing Just the First Part of a CSV File

You can also extend this basic previewing using other datapipes features. For example, suppose you have a big CSV file (say with more than a few thousand rows). If you tried to turn this into an HTML table and then view in your browser, it would probably crash it.

So what if you could just see a part of the file? After all, you may well only be interested in seeing what that CSV file looks like, not every row. Fortunately, Data Pipes supports only showing the first 10 lines of a CSV file using a head operation. To demonstrate, let’s just extend our example above to use head. This gives us the following URL (click to see the live result):

http://datapipes.okfnlabs.org/csv/head/html/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv

Colophon

Data Pipes is a free and open service run by Open Knowledge Foundation Labs. You can find the source code on GitHub at: https://github.com/okfn/datapipes. It also available as a Node library and command line tool.

If you like previewing CSV files in your browser, you might also be interested in the Recline CSV Viewer, a Chrome plugin that automatically turns CSVs into searchable HTML tables in your browser.

Labs newsletter: 28 November, 2013

Another busy week at the Labs! We’ve had lots of discussion around the idea of “bad data”, a blog post about Mark’s aid tracker, new PyBossa developments, and a call for help with a couple of projects. Next week we can look forward to another Open Data Maker Night in London.

Bad Data

Last Friday, Rufus announced Bad Data, a new educational mini-project that highlights real-world examples of how data shouldn’t be published.

This announcement was greeted with glee and with contributions of new examples. Open government activist Ivan Begtin chimed in with the Russian Ministry of the Interior’s list of regional offices and the Russian government’s tax rates for municipalities. Labs member Friedrich Lindenberg added the German finance ministry’s new open data initiative. As Andy Lulham said, “bad data” will be very useful for testing the new Data Pipes operators.

You can follow the whole discussion thread in the list archive.

Blog post: Looking at aid in the Philippines

At last week’s hangout, you heard about Mark Brough’s new project, a browser for aid projects in the Philippines generated from IATI data.

Now you can read more about Mark’s project on the blog, learning about where the data comes from, how the site is generated from the data (interestingly, it uses the Python-based static site generator Frozen-Flask), and what Mark plans to do next.

New PyBossa cache system

Labs member and citizen science expert Daniel Lombraña González has been “working really hard to add a new cache system to PyBossa”, the open source crowdsourcing platform.

As Daniel has discovered, the Redis key-value store meets all his requirements for a load-balanced, high-availability, persistent cache. As he put it: “Redis is amazing. Let me repeat it: amazing.”

Read the blog post to learn more about the new Redis-based PyBossa setup and its benefits.

Contributions needed: iOS and Python development

Philippe Plagnol of Product Open Data needs a few good developers to help with some projects.

Firstly, the Product Open Data Android app has been out for a while (source code), and it’s high time there was a port for Apple devices. If you’re interested in contributing to the port, leave a comment at this GitHub issue.

Secondly, work is now underway on a brand repository which will assign a Brand Standard Identifier Number (BSIN) to each brand worldwide, making it possible to integrate products in the product repository. Python developers are needed to help make this happen. If you want to help out, join in this GitHub thread. (Lots of people have already signed up!)

Next week: Open Data Maker Night London #7

On the 4th of December next week, the seventh London Open Data Maker Night is taking place. Anyone interested in building tools or insights from data is invited to drop in at any time after 6:30 and join the fun. (Please note that the event will take place on Wednesday rather than the usual Tuesday.)

What is an Open Data Maker Night? Read more about them here.

Get involved

Labs is always looking for new contributors. Read more about how you can join the community, whether you’re a coder, a data wrangler, or a communicator, and check out the Ideas Page to see what else is brewing.

Labs newsletter: 21 November, 2013

This week, Labs members gathered in an online hangout to discuss what they’ve been up to and what’s next for Labs. This special edition of the newsletter recaps that hangout for those who weren’t there (or who want a reminder).

Data Pipes update

Last week you heard about Andy Lulham’s improvements to Data Pipes, the online streaming data transformations service. He didn’t stop there, and in this week’s hangout, Andy described some of the new features he has been adding:

  • parse and render are now streaming operations
  • option parsing now uses optimist
  • a basic command-line interface
  • … and much, much more

Coming up next: map & filter with arbitrary functions!

Crowdcrafting: progress and projects

New Shuttleworth fellow Daniel Lombraña González reported on progress with CrowdCrafting, the citizen science platform built with PyBossa.

CrowdCrafting now has more than 3,500 users (though Daniel cautions that this doesn’t mean much in terms of participation), and the site now has more answers than tasks.

Last week, the team at MicroMappers used CrowdCrafting to classify tweets about the typhoon disaster in the Philippines. Digital mapping activists SkyTruth, meanwhile, have used CrowdCrafting to map and track fracking sites in the northeast United States. Daniel has also been in contact with EpiCollect about a project on trash collection in Spain.

Open Data Button

Labs member Oleg Lavrovsky discussed the Open Data Button, an interesting fork of the recently-launched Open Access Button.

The Open Access Button, an idea of the Open Science working group at OKCon 2013, is a bookmarklet that allows users to report their experiences of having their research blocked by paywalls. The Open Data Button applies this same idea to Open Data: users can use it to report their problems with legal and technical restrictions on data. (As Rufus pointed out, this ties in nicely with the IsItOpenData project.)

Queremos Saber

Labs ally Vítor Baptista reported on a new development with Queremos Saber, the Brazilian FOI request portal.

Changes in the way the Brazilian federal government accepts FOI requests have caused Queremos Saber problems. The federal government no longer accepts requests by email, forcing the use of a specialized FOI system which they are now promoting for local governments as well. This limits the number of places that will accept requests from Queremos Saber.

A solution to this problem is underway: an email-based API that will take emails received at certain addresses (e.g. ministryofhealthcare@queremossaber.org.br) and turn them into instructions for a web crawler to create an FOI request in the appropriate system. An interesting side effect of this would be the creation of an anonymization layer, allowing users to bypass the legal requirement that FOI requests not be placed anonymously.

Philippines Projects

Labs data wrangler Mark Brough showed off a test project collecting data on aid activities in the Philippines. Mark’s small static site, updated each night, collects IATI aid data on projects in the Philippines and republishes it in a more browsable form.

Mark also discussed another data-mashup project, still in the planning stage, that would combine budget and aid data for Tanzania (or any other developing country)—similar to Publish What You Fund’s old Uganda project but based on a non-static dataset.

Global Economic Map

Alex Peek discussed his initiative to create the Global Economic Map, “a collection of standardized data set of economic statistics that can be applied to every country, region and city in the world”.

The GEM will draw data from sources like government publications and SEC filings and will cover eleven statistics that touch on GDP, employment, corporations, and budgets. The GEM aims to be fully integrated with Wikidata.

Frictionless data

Finally, Rufus Pollock discussed data.okfn.org and the mission of “frictionless data”: making it “as simple as possible to get the data you want into the tool of your choice.”

data.okfn.org aims to help achieve this goal by promoting, among other things, simple data standards and the tooling to support them. As reported in last week’s newsletter, this now includes a Data Package Manager based on npm, now working at a very basic level. It also includes the data.okfn.org Data Package Viewer, which provides a nice view on data packages hosted on GitHub, S3, or wherever else.

Improving the Labs site

The hangout wrapped up with a discussion of how to improve the Labs site. Besides some discussion of the possibility of a one-click creation system for Open Data Maker Nights, talk focused on improving the projects page.

Oleg, who has volunteered to take the lead in reforming the projects page, highlighted the need for a way to differentiate projects by their activity level and their need for more contributors. Mark agreed, suggesting also that it would be nice to be able to filter projects by the languages and technologies they use. Both ideas were proposed as a way to fill out Tod Robbins’s suggestion that the projects page needs categories.

See the Labs hangout notes for the full details of this discussion.

Get involved

As always, Labs wants you to join in and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement, and have a look at the Ideas Page to see what other members have been thinking.

Bad Data: real-world examples of how not to do data

We’ve just started a mini-project called Bad Data. Bad Data provides real-world examples of how not to publish data. It showcases the poorly structured, the mis-formatted, and the just plain ugly.

This isn’t about being critical but about educating—providing examples of how not to do something may be one of the best ways of showing how to do it right. It also provides a source of good practice material for budding data wranglers!

Bad Data: ASCII spreadsheet

Each “bad” dataset on gets a simple page that shows what’s wrong along with a preview or screenshot.

We’ve started to stock the site with some of the better examples of bad data that we’ve come across over the years. This includes machine-unreadable Transport for London passenger numbers from the London Datastore and a classic “ASCII spreadsheet” from the US Bureau of Labor Statistics.

We welcome contributions of new examples! Submit them here.

Labs newsletter: 14 November, 2013

Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post.

Data Pipes: lots of improvements

Data Pipes is a Labs project that provides a web API for a set of simple data-transforming operations that can be chained together in the style of Unix pipes.

This past week, Andy Lulham has made a huge number of improvements to Data Pipes. Just a few of the new features and fixes:

  • new operations: strip (removes empty rows), tail (truncate dataset to its last rows)
  • new features: a range function and a “complement” switch for cut; options for grep
  • all operations in pipeline are now trimmed for whitespace
  • basic tests have been added

Have a look at the closed issues to see more of what Andy has been up to.

Webshot: new homepage and feature

Last week we introduced you to Webshot, a web API for screenshots of web pages.

Back then, Webshot’s home page was just a screenshot of GitHub. Now Webshot has a proper home page with a form interface to the API.

Webshot has also added support for full page screenshots. Now you can capture the whole page rather than just its visible portion.

On the blog: natural language processing with Python

Labs member Tarek Amr has contributed an awesome post on Python natural language processing with the NLTK toolkit to the Labs blog.

“The beauty of NLP,” Tarek says, “is that it enables computers to extract knowledge from unstructured data inside textual documents.” Read his post to learn how to do text normalization, frequency analysis, and text classification with Python.

Data Packages workflow à la Node

Wouldn’t it be nice to be able to initialize new Data Packages as easily as you can initialize a Node module with npm init?

Max Ogden started a discussion thread around this enticing idea, eventually leading to Rufus Pollock booting a new repo for dpm, the Data Package Manager. Check out dpm’s Issues to see what needs to happen next with this project.

Nomenklatura: looking forward

Nomenklatura is a Labs project that does data reconciliation, making it possible “to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names, against that canonical list”.

Friedrich Lindenberg has noted on the Labs mailing list that Nomenklatura has some serious problems, and he has proposed “a fairly radical re-framing of the service”.

The conversation around what this re-framing should look like is still underway—check out the discussion thread and jump in with your ideas.

Data Issues: following issues

Last week, the idea of Data Issues was floated: using GitHub Issues to track problems with public datasets. The idea has generated a few comments, and we’d love to hear more.

Discussion on the Labs list highlighted another benefit of using GitHub. Alioune Dia suggested that Data Issues should let users register to be notified when a particular issue is fixed. But Chris Mear pointed out that GitHub already makes this possible: “Any GitHub user can ‘follow’ a specific issue by using the notification button at the bottom of the issue page.”

Get involved

Anyone can join the Labs community and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement. Also check out the Ideas Page to see what’s cooking in the Labs.

Northern Mariana Islands Retirement Fund Bankruptcy

Back on April 17 2012 the Northern Mariana Islands Retirement Fund attempted to file for bankruptcy under Chapter 11. There was some pretty interesting reading in their petition for bankruptcy including this section (para 10) which suggests some pretty bad public financial management (emphasis added): “Debtor has had difficulty maintaining healthy funding levels due to […]

Labs newsletter: 7 November, 2013

There was lots of interesting activity around Labs this week, with two launched projects, a new initiative in the works, and an Open Data Maker Night in London.

Webshot: online screenshot service

webshot.okfnlabs.org, an online service for taking screenshots of websites, is now live, thanks to Oliver Searle-Barnes and Simon Gaeremynck.

Try it out with an API call like this:

http://webshot.okfnlabs.org/?url=http://okfnlabs.org&width=640&height=480

Read more about the development behind the service here.

Product Open Data Android app

The first version of the Android app for Product Open Data has launched, allowing you to conveniently look up open data associated with a product on your phone.

The source code for the app is available on GitHub.

Crowdcrafting for Public Bodies

PublicBodies.org aims to provide “a URL for every part of government”. Many entries in the database lack good description text, though, making them harder to use effectively. Fixing this would be a good use of CrowdCrafting.org, the crowd-sourcing platform powered by PyBossa.

Rufus suggests this start small and begin with EU public bodies. It should be easy to build a CrowdCrafting app to cover those, says Daniel Lombraña González. Friedrich Lindenberg thinks this approach could work for other datasets as well.

Discussion of this idea is still happening on the list, so jump in and say what you think—or help build the app!

Open Data Maker Night #6

The sixth Open Data Maker Night took place this past Tuesday in London. Open Data Maker Nights are informal events where people make things with open data, whether apps or insights.

This night’s focus was on adding more UK and London data to OpenSpending, and it featured special guest Max Ogden. It was hosted by the Centre for Creative Collaboration.

Our next Open Data Maker Night will happen in early December. If you want to organize your own, though, it’s super easy: just see the Open Data Maker Night website for help booting, promoting, and running the event.

Tracking Issues with Data the Simple Way

Data Issues is a prototype initiative to track “issues” with data using a simple bug tracker—in this case, GitHub Issues.

We’ve all come across “issues” with data, whether it’s “data” that turns out to be provided as a PDF, the many ways to badly format tabular data (empty rows, empty columns, inlined metadata …), “ASCII spreadsheets”, or simply erroneous data.

Key to starting to improve data quality is a way to report and record these issues.

We’ve thought about ways to address this for quite some time and, led by Labs member Friedrich Lindenberg, even experimented with building our own service. But recently, thanks to a comment from Labs member David Miller, we were hit with a blinding insight: why not do the simplest thing possible and just use an existing bug tracker tool? And so was born the current version of Data Issues based on a github issue tracker!

Data Issues

Aside: Before you decide we were completely crazy not to see this in the first place, it should be said that doing data issues “properly” (in the medium term) probably does require something a bit more than a normal bug tracker. For example, it would be nice to be able to both pinpoint an issue precisely (e.g. the date in column 5 on line 3751 is invalid) and group similar issues (e.g. all amounts in column 7 have a commas in them). Doing this would require a tracker that was customized for data. The solution described in this post, however, seems like a great way to get started.

Introducing Data Issues

Given the existence of so many excellent issue-tracking systems, we thought the best way to start is to reuse one—in the simplest possible way.

With Data Issues, we’re using GitHub Issues to track issues with datasets. Data Issues is essentially just a GitHub repository whose Issues are used to report problems on open datasets. Any problem with any dataset can be reported on Data Issues.

To report an issue with some data, just open an issue in the tracker, add relevant info on the data (its URL, who’s responsible for it, the line number of the bug, etc.), and explain the problem. You can add labels to group related issues—for example, if multiple datasets from the same site have problems, you can add a label that identifies the dataset’s site of origin.

Straightaway, the issue you raise becomes a public notice of the problem with the dataset. Everyone interested in the dataset has access to the issue. The issue is also actionable: each issue contains a thread of comments that can be used to track the issue’s status, and the issue can be closed when it has been fixed. All issues submitted to Data Issues are visible in a central list, which can be filtered by keyword or label to zoom in on relevant issues. All of these great features come for free because we’re using GitHub Issues.

Get Involved

For Data Issues to work, people need to use it. If civic hackers, journalists, and other data wranglers learn about Data Issues and start using it to track their work on datasets, we might find that the problem of tracking issues with datasets has already been solved.

You can also contribute by helping develop the project into something richer than a simple Issues page. One limitation of Data Issues is that raising an issue does not actually contact the parties responsible for the data. Our next goal is to automate sending along feedback from Data Issues, making it a more effective bug tracker.

If you want to discuss new directions for Data Issues or point out something you’ve built that contributes to the project, get in touch via the Labs mailing list.

Introducing TimeMapper – Create Elegant TimeMaps in Seconds

TimeMapper lets you create elegant and embeddable timemaps quickly and easily from a simple spreadsheet.

Medieval philosophers timemap

A timemap is an interactive timeline whose items connect to a geomap. Creating a timemap with TimeMapper is as easy as filling in a spreadsheet template and copying its URL.

In this quick walkthrough, we’ll learn how to recreate the timemap of medieval philosophers shown above using TimeMapper.

Getting started with TimeMapper

To get started, go to the TimeMapper website and sign in using your Twitter account. Then click Create a new Timeline or TimeMap to start a new project. As you’ll see, it really is as easy as 1-2-3.

TimeMapper projects are generated from Google Sheets spreadsheets. Each item on the timemap – an event, an individual, or anything else associated with a date (or two, for the start and end of a period) – is a spreadsheet row.

What can you put in the spreadsheet? Check out the TimeMapper template. It contains all of the columns that TimeMapper understands, plus a row of cells explaining what each of them means. Your timemap doesn’t have to use all of these columns, though—it just requires a Start date, a Title, and a Description for each item, plus geographical coordinates for the map.

So you’ve put your data in a Google spreadsheet—how can you make it into a timemap? Easy! From Google Sheets, go to File -> Publish to the web and hit Start publishing. Then click on your sheet’s Share button and set the sheet’s visibility to Anyone who has the link can view. You can either copy the URL from Link to share and paste that URL into the box in Step 2 of the TimeMapper creation process or click on Select from Your Google Drive to just browse to the sheet. Whichever you do, then hit Connect and Publish—and voilà!

Share your spreadsheet

Embedding your new timemap is just as easy as creating it. Click on Embed in the top right corner. It will pop up a snippet of HTML which you can paste into your webpage to embed the timemap. And that’s all it takes!

Embed your timemap

Coming next

We have big plans for TimeMapper, including:

  • Support for indicating size and time on the map
  • Quickly create TimeMaps using information from Wikipedia
  • Connect markers in maps to form a route
  • Options for timeline- and map-only project layouts
  • Disqus-based comments
  • Core JS library, timemapper.js, so you can build your own apps with timemaps

Check out the TimeMapper issues list to see what ideas we’ve got and to leave suggestions.

Code

In terms of the internals the app is a simple node.js app with storage into s3. The timemap visualization is pure JS built using KnightLabs excellent Timeline.js for the timeline and Leaflet (with OSM) for the maps. For those interested in the code it can be found at: https://github.com/okfn/timemapper/

History and credits

TimeMapper is made possible by awesome open source libraries like TimelineJS, Backbone, and Leaflet, not to mention open data from OpenStreetMap. When we first built a TimeMapper-style site in 2007 under the title “Weaving History”, it was a real struggle over many months to build a responsive JavaScript-heavy app. Today, thanks to libraries like these and advances in browsers, it’s now a matter of weeks.