Labs newsletter: 20 February, 2014

The past few weeks have seen major improvements to the Labs website, another Open Data Maker Night in London, updates to the TimeMapper project, and more.

Labs Hangout: today

The next Labs online hangout is taking place today in just a few hours—now’s your chance to sign up on the hangout’s Etherpad!

Labs hangouts are informal online gatherings held on Google Hangout at which Labs members and friends get together to discuss their work and to set the agenda for Labs activities.

Today’s hangout will take place at 1700 - 1800 GMT. Check the hangout pad for more details, and watch the pad for notes from the meeting.

Crowdcrafting at Citizen Cyberscience Summit 2014

In today’s other news, Labs’s Daniel Lombraña González is presenting Crowdcrafting at the Citizen Cyberscience Summit 2014. You can read more about his presentation here.

Crowdcrafting is an open-source citizen science platform that “empowers citizens to become active players in scientific projects by donating their time in order to solve micro-task problems”. Crowdcrafting has been used by institutions including CERN, the United Nations, and the National Institute of Space Research of Brazil.

Labs site updates

Labs has been discussing improving the website for some time now, and the past weeks have seen many of those proposed improvements being put into action.

One of the biggest changes is a new projects page. Besides having a beautiful new layout, the new projects page implements filtering by tags, language, and more.

The site now also features a reciprocal linking of users and projects. The projects page now shows projects’ maintainers (n.b. plural!), and users pages now show which projects users contribute to (e.g. Andy Lulham’s page highlights his Data Pipes contributions).

TimeMapper improvements

TimeMapper is a Labs project allowing you to create elegant timelines with inline maps from Google Spreadsheets in a matter of seconds.

A number of improvements have been made to TimeMapper:

Open Data Maker Night February

Two weeks ago today, the ninth Open Data Maker Night London was hosted by Andy Lulham. This edition was a mapping special, featuring OpenStreetMap contributor Harry Wood.

Open Data Maker Nights are informal, action-oriented get-togethers where things get made with open data. Visit the Labs website for more information on them, including info on how to host your own.

DataPackage + Bubbles

On last week’s newsletter, you heard about S?tefan Urbánek’s abstract data processing framework Bubbles. S?tefan just notified the OKFN Labs list that he has created a demo of Bubbles using Data Packages, Labs’s simple standard for data publication.

“The example is artificial”, S?tefan says, but it highlights the power of the Bubbles framework and the potential of the Data Package format.

Get involved

We’re always looking for new contributions at the Labs. Read about how you can join, and see the Ideas Page to get in on the ground floor of a Labs project—or just join the Labs mailing list to participate by offering feedback.

Labs newsletter: 30 January, 2014

From now on, the Labs newsletter will arrive through a special announce-only mailing list, newsletter@okfnlabs.org, more details on which can be found below.

Keep reading for other new developments including the fifth Labs Hangout, the launch of SayIt, and new developments in the vision of “Frictionless Data”.

New newsletter format

Not everyone who wants to know about Labs activities wants or needs to observe those activities unfolding on the main Labs list. For friends of Labs who just want occasional updates, we’ve created a new, Sendy-based announce-only list that will bring you a Labs newsletter every two weeks.

Everyone currently subscribed to okfn-labs@lists.okfn.org has been added to the new list. To join the new announce list, see the Labs Contact page, where there’s a form.

Labs Hangout no. 5

Last Thursday, Andy Lulham hosted the fifth OKFN Labs Hangout. The Labs Hangouts are a way for people curious about Labs projects to informally get together, share their work, and talk about the future of Labs.

For full details, check out the minutes from the hangout. Highlights included:

SayIt

SayIt, an open-source tool for publishing and sharing transcripts, has just been launched by Poplus. At last week’s Labs Hangout, Tom Steinberg of mySociety (one half of Poplus, alongside Ciudadano Inteligente) shared some of the motivations behind the creation of the tool, which was also discussed on the okfn-discuss mailing list.

As Tom explained, mySociety’s They Work For You has proven the popularity of transcript data. But making the transcripts available in a nice way (e.g. with a decent API) has so far called for bespoke software development. SayIt is designed to encourage “nice” publication as the starting-point—and to serve as a pedagogical example of what a good data publication tool looks like.

Frictionless data: vision, roadmap, composability

We’ve heard about Rufus’s vision for an ecosystem of “frictionless data” in the past. Now the discussion is starting to get serious. data.okfn.org now hosts two key documents generated through the conversation:

  • the vision: what will create a dynamic, productive, and attractive open data ecosystem?
  • the roadmap: what has to happen to bring this vision to life?

The new roadmap is a particularly lucid overview of how the frictionless data vision connects with concrete actions. Would-be creators of this new ecosystem should consult the roadmap to see where to join in.

Discussion on the Labs list has also generated some interesting insights. Data Unity’s Kev Kirkland discussed his work with Semantic Web formalization of composable data manipulation processes, and S?tefan Urbánek made a connection with his work on “abstracting datasets and operations” in the ETL framework Bubbles.

On the blog: OLAP part two

Last week, S?tefan Urbánek wrote us an introduction to Online Analytical Processing. Shortly afterwards, he followed up with a second post taking a closer look at how OLAP data is structured and why.

Check out S?tefan’s post to learn about how OLAP represents data as multidimensional “cubes” that users can slice and dice to explore the data along its many dimensions.

TimeMapper improvements

Andy Lulham has started working on TimeMapper, Labs’s easy-to-use tool for the creation of interactive timelines linked to geomaps.

Some of the improvements he has made so far have been bugfixes (e.g. preventing overflowing form controls, fixing the template settings file), but one of them is a new user feature: adding a way to change the starting event on a timeline so that they don’t always have to start at the beginning.

Get involved

Want to get involved with Labs’s projects? Now is a great time to join in! Check out the Ideas Page to see some of the many things you can do once you join Labs, or just jump on the Labs mailing list and take part in a conversation.

Labs newsletter: 16 January, 2014

Welcome back from the holidays! A new year of Labs activities is well underway, with long-discussed improvements to the Labs projects page, many new PyBossa developments, a forthcoming community hangout, and more.

Labs projects page

Getting the Labs project page organized better has been high on the agenda for some time now. In the past little while, significant progress has been made. New improvements to the project page include:

Oleg Lavrosky, Daniel Lombraña González, and Andy Lulham have all contributed to this development—and work is still ongoing, with further enhancements to attributes and more work on the UI still to come.

Lots of PyBossa milestones

PyBossa has achieved so many milestones since the last newsletter that it’s hard to know where to begin.

PyBossa v0.2.1 was released by Daniel Lombraña González, becoming a more robust service through the inclusion of a new rate-limiting feature for API calls. Alongside rate limits, the new PyBossa has improved security through the addition of a secure cookie-based solution for posting task runs. Full details can be found in the documentation.

Daniel also released a new PyBossa template for annotating pictures. The template, which incorporates the Annotorious.JS JavaScript library, “allow[s] anyone to extract structured information from pictures or photos in a very simple way”.

The Enki package for analyzing PyBossa applications was also released over the break. Enki makes it possible to download completed PyBossa tasks and associated task runs, analyze them with Pandas, and share the result as an IPython Notebook. Check out Daniel’s blog post on Enki to see what it’s about.

New on the blog

We’ve had a couple of great new contributions on the Labs blog since the last newsletter.

Thomas Levine has written about how he parses PDF files, lovingly exploring a problem that all data wranglers will encounter and gnash their teeth over at least a few times in their lives.

Stefan Urbanek, meanwhile, has written an introduction to OLAP, “an approach to answering multi-dimensional analytical queries swiftly”, explaining what that means and why we should take notice.

Da?nabox

Labs friend Darwin Peltan reached out to the list to point out that his friend’s project Da?nabox is looking for testers and general feedback. Labs members are invited to pitch in by finding bugs and breaking it.

Da?nabox is “Heroku but with public payment pages”, crowdsourcing the payment for an app’s hosting costs. Da?nabox is open source and built on the Deis platform.

Community hangout

It’s almost time for the Labs community hangout. The Labs hangout is the regular event where Labs members meet up online to discuss their work, find ways to collaborate, and set the agenda for the weeks to come.

When will the hangout take place? Rufus proposes moving the hangout from the 21st to the 23rd. If you want to participate, leave a comment on the thread to let Labs know what time would work for you.

Get involved

Labs is the Labs community, no more and no less, and you’re invited to become a part of it! Join the community by coding, blogging, kicking around ideas on the Ideas Page, or joining the conversation on the Labs mailing list.

Convert data between formats with Data Converters

Data Converters is a command line tool and Python library making routine data conversion tasks easier. It helps data wranglers with everyday tasks like moving between tabular data formats—for example, converting an Excel spreadsheet to a CSV or a CSV to a JSON object.

The current release of Data Converters can convert between Excel spreadsheets, CSV data, and JSON tables, as well as some geodata formats (with additional requirements).

Its smart parser can guess the types of data, correctly recognizing dates, numbers, strings, and so on. It works as easily with URLs as with local files, and it is designed to handle very large files (bigger than memory) as easily as small ones.

Data Converters homepage

Converting data

Converting an Excel spreadsheet to a CSV or a JSON table with the Data Converters command line tool is easy. Data Converters is able to read XLS(X) and CSV files and to write CSV and JSON, and input files can be either local or remote.

dataconvert simple.xls out.csv
dataconvert out.csv out.json

# URLs also work
dataconvert https://github.com/okfn/dataconverters/raw/master/testdata/xls/simple.xls out.csv

Data Converters will try to guess the format of your input data, but you can also specify it manually.

dataconvert --format=xls input.spreadsheet out.csv

Instead of writing the converted output to a file, you can also send it to stdout (and then pipe it to other command-line utilities).

dataconvert simple.xls _.json  # JSON table to stdout
dataconvert simple.xls _.csv   # CSV to stdout

Converting data files can also be done within Python using the Data Converters library. The dataconvert convenience function shares the dataconvert command line utility’s file reading and writing functionality.

from dataconverters import dataconvert
dataconvert('simple.xls', 'out.csv')
dataconvert('out.csv', 'out.json')
dataconvert('input.spreadsheet', 'out.csv', format='xls')

Parsing data

Data Converters can do more than just convert data files. It can also parse tabular data into Python objects that captures the semantics of the source data.

Data Converters’ various parse functions each return an iterator over the records of the source data along with a metadata dictionary containing information about the data. The records returned by parse are not just (e.g.) split strings: they’re hash representations of the contents of the row, with column names and data types auto-detected.

import dataconverters.xls as xls
with open('simple.xls') as f:
    records, metadata = xls.parse(f)
    print metadata
    print [r for r in records]
=> {'fields': [{'type': 'DateTime', 'id': u'date'}, {'type': 'Integer', 'id': u'temperature'}, {'type': 'String', 'id': u'place'}]}
=> [{u'date': datetime.datetime(2011, 1, 1, 0, 0), u'place': u'Galway', u'temperature': 1.0}, {u'date': datetime.datetime(2011, 1, 2, 0, 0), u'place': u'Galway', u'temperature': -1.0}, {u'date': datetime.datetime(2011, 1, 3, 0, 0), u'place': u'Galway', u'temperature': 0.0}, {u'date': datetime.datetime(2011, 1, 1, 0, 0), u'place': u'Berkeley', u'temperature': 6.0}, {u'date': datetime.datetime(2011, 1, 2, 0, 0), u'place': u'Berkeley', u'temperature': 8.0}, {u'date': datetime.datetime(2011, 1, 3, 0, 0), u'place': u'Berkeley', u'temperature': 5.0}]

What’s next?

Excel spreadsheets and CSVs aren’t the only kinds of data that need converting.

Data Converters also supports geodata conversion, including converting between KML (the format for geographical data used in Google Maps and Google Earth), GeoJSON, and ESRI Shapefiles.

Data Converters’ ability to convert between tabular data may also grow, adding JSON support on the input side and XLS(X) support on the output side—as well as new conversions for XML, SQL dumps, and SPSS.

Visit the Data Converters home page to learn how to install Data Converters and its dependencies, and check out Data Converters on GitHub to see how you can contribute to the project.

Labs newsletter: 12 December, 2013

We’re back after taking a break last week with a bumper crop of updates. A few things have changed: Labs activities are now coordinated entirely through GitHub. Meanwhile, there’s been some updates around the Nomenklatura, Annotator, and Data Protocols projects and some new posts on the Labs blog.

Migration from Trello to GitHub

For some time now, Labs activities requiring coordination have been organized on Trello—but those days are now over. Labs has moved its organizational setup over to GitHub, coordinating actions and making plans by means of GitHub issues. This change comes as a big relief to the many Labs members who already use GitHub as their main platform for collaboration.

General Labs-related activities are now tracked on the Labs site’s issues, and activities around individual projects are managed (as before!) through those projects’ own issues.

New Bad Data

New examples of bad data continue to roll in—and we invite even more new submissions.

Bad datasets added since last newsletter include the UK’s Greater London Authority spend data (65+ files with 25+ different structures!), Nature Magazine’s supplementary data (an awful PDF jumble), and more.

Nomenklatura: new alpha

As we’ve previously noted, Labs member Friedrich Lindenberg has been thinking about producing “a fairly radical re-framing” of the Nomenklatura data reconciliation service.

Friedrich has now released an alpha version of a new release of Nomenklatura at nk-dev.pudo.org. The major changes with this alpha include:

  • A fully JavaScript-driven frontend
  • String matching now happens inside the PostgreSQL database
  • Better introductory text explaining what Nomenklatura does
  • “entity” and “alias” domain objects have been merged into “entity”

Friedrich is keen to hear what people think about this prototype—so jump in, give it a try, and leave your comments at the Nomenklatura repo.

Annotator v1.2.9

A new maintenance release of Annotator came out ten days ago. This new version is intended to be one of the last in the v1.2.x series—indeed, v1.2.8 itself was intended to be the last, but that version had some significant issues that this new release corrects.

Fixes in this version include:

  • Fixed a major packaging error in v1.2.8. Annotator no longer exports an excessive number of tokens to the page namespace.
  • Notification display bugfixes. Notification levels are now correctly removed after notifications are hidden.

The new Annotator is available, as always, from GitHub.

Data Protocols updates

Data Protocols is a project to develop simple protocols and formats for working with open data. Rufus Pollock wrote a cross-post to the list about several new developments with Data Protocols of interest to Labs. These included:

  • Close to final agreement on a spec for adding “primary keys” to the JSON Table Schema (discussion)
  • Close to consensus on spec for “foreign keys” (discussion)
  • Proposal for a JSON spec for views of data, e.g. graphs or maps (discussion)

For more, check out Rufus’s message and the Data Protocols issues.

On the blog

Labs members have added a couple new posts to the blog since the last newsletter. Yours truly (with extensive help from Rufus) posted on using Data Pipes to view a CSV. Michael Bauer, meanwhile, wrote about the new Reconcile-CSV service he developed while working on education data in Tanzania. Look to the Labs blog for the full scoop.

Get involved

If you have some spare time this holiday season, why not spend it helping out with Labs? We’re always always looking for new people to join the community—visit the Labs issues and the Ideas Page to get some ideas for how you can join in.

View a CSV (Comma Separated Values) in Your Browser

This post introduces one of the handiest features of Data Pipes: fast (pre) viewing of CSV files in your browser (and you can share the result by just copying a URL).

The Raw CSV

CSV files are frequently used for storing tabular data and are widely supported by spreadsheets and databases. However, you can’t usually look at a CSV file in your browser - usually your browser will automatically download a CSV file. And even if you could look at a CSV file, it is not very pleasant to look at:

Raw CSV

The Result

But using the Data Pipes html feature, you can turn an online CSV into a pretty HTML table in a few seconds. For example, the CSV you’ve just seen would become this pretty table:

CSV, HTML view

Using it

To use this service, just visit http://datapipes.okfnlabs.org/html/ and paste your CSV’s URL into the form.

For power users (or for use from the command line or API), you can just append your CSV url to:

http://datapipes.okfnlabs.org/csv/html/?url=

Previewing Just the First Part of a CSV File

You can also extend this basic previewing using other datapipes features. For example, suppose you have a big CSV file (say with more than a few thousand rows). If you tried to turn this into an HTML table and then view in your browser, it would probably crash it.

So what if you could just see a part of the file? After all, you may well only be interested in seeing what that CSV file looks like, not every row. Fortunately, Data Pipes supports only showing the first 10 lines of a CSV file using a head operation. To demonstrate, let’s just extend our example above to use head. This gives us the following URL (click to see the live result):

http://datapipes.okfnlabs.org/csv/head/html/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv

Colophon

Data Pipes is a free and open service run by Open Knowledge Foundation Labs. You can find the source code on GitHub at: https://github.com/okfn/datapipes. It also available as a Node library and command line tool.

If you like previewing CSV files in your browser, you might also be interested in the Recline CSV Viewer, a Chrome plugin that automatically turns CSVs into searchable HTML tables in your browser.

Labs newsletter: 28 November, 2013

Another busy week at the Labs! We’ve had lots of discussion around the idea of “bad data”, a blog post about Mark’s aid tracker, new PyBossa developments, and a call for help with a couple of projects. Next week we can look forward to another Open Data Maker Night in London.

Bad Data

Last Friday, Rufus announced Bad Data, a new educational mini-project that highlights real-world examples of how data shouldn’t be published.

This announcement was greeted with glee and with contributions of new examples. Open government activist Ivan Begtin chimed in with the Russian Ministry of the Interior’s list of regional offices and the Russian government’s tax rates for municipalities. Labs member Friedrich Lindenberg added the German finance ministry’s new open data initiative. As Andy Lulham said, “bad data” will be very useful for testing the new Data Pipes operators.

You can follow the whole discussion thread in the list archive.

Blog post: Looking at aid in the Philippines

At last week’s hangout, you heard about Mark Brough’s new project, a browser for aid projects in the Philippines generated from IATI data.

Now you can read more about Mark’s project on the blog, learning about where the data comes from, how the site is generated from the data (interestingly, it uses the Python-based static site generator Frozen-Flask), and what Mark plans to do next.

New PyBossa cache system

Labs member and citizen science expert Daniel Lombraña González has been “working really hard to add a new cache system to PyBossa”, the open source crowdsourcing platform.

As Daniel has discovered, the Redis key-value store meets all his requirements for a load-balanced, high-availability, persistent cache. As he put it: “Redis is amazing. Let me repeat it: amazing.”

Read the blog post to learn more about the new Redis-based PyBossa setup and its benefits.

Contributions needed: iOS and Python development

Philippe Plagnol of Product Open Data needs a few good developers to help with some projects.

Firstly, the Product Open Data Android app has been out for a while (source code), and it’s high time there was a port for Apple devices. If you’re interested in contributing to the port, leave a comment at this GitHub issue.

Secondly, work is now underway on a brand repository which will assign a Brand Standard Identifier Number (BSIN) to each brand worldwide, making it possible to integrate products in the product repository. Python developers are needed to help make this happen. If you want to help out, join in this GitHub thread. (Lots of people have already signed up!)

Next week: Open Data Maker Night London #7

On the 4th of December next week, the seventh London Open Data Maker Night is taking place. Anyone interested in building tools or insights from data is invited to drop in at any time after 6:30 and join the fun. (Please note that the event will take place on Wednesday rather than the usual Tuesday.)

What is an Open Data Maker Night? Read more about them here.

Get involved

Labs is always looking for new contributors. Read more about how you can join the community, whether you’re a coder, a data wrangler, or a communicator, and check out the Ideas Page to see what else is brewing.

Labs newsletter: 21 November, 2013

This week, Labs members gathered in an online hangout to discuss what they’ve been up to and what’s next for Labs. This special edition of the newsletter recaps that hangout for those who weren’t there (or who want a reminder).

Data Pipes update

Last week you heard about Andy Lulham’s improvements to Data Pipes, the online streaming data transformations service. He didn’t stop there, and in this week’s hangout, Andy described some of the new features he has been adding:

  • parse and render are now streaming operations
  • option parsing now uses optimist
  • a basic command-line interface
  • … and much, much more

Coming up next: map & filter with arbitrary functions!

Crowdcrafting: progress and projects

New Shuttleworth fellow Daniel Lombraña González reported on progress with CrowdCrafting, the citizen science platform built with PyBossa.

CrowdCrafting now has more than 3,500 users (though Daniel cautions that this doesn’t mean much in terms of participation), and the site now has more answers than tasks.

Last week, the team at MicroMappers used CrowdCrafting to classify tweets about the typhoon disaster in the Philippines. Digital mapping activists SkyTruth, meanwhile, have used CrowdCrafting to map and track fracking sites in the northeast United States. Daniel has also been in contact with EpiCollect about a project on trash collection in Spain.

Open Data Button

Labs member Oleg Lavrovsky discussed the Open Data Button, an interesting fork of the recently-launched Open Access Button.

The Open Access Button, an idea of the Open Science working group at OKCon 2013, is a bookmarklet that allows users to report their experiences of having their research blocked by paywalls. The Open Data Button applies this same idea to Open Data: users can use it to report their problems with legal and technical restrictions on data. (As Rufus pointed out, this ties in nicely with the IsItOpenData project.)

Queremos Saber

Labs ally Vítor Baptista reported on a new development with Queremos Saber, the Brazilian FOI request portal.

Changes in the way the Brazilian federal government accepts FOI requests have caused Queremos Saber problems. The federal government no longer accepts requests by email, forcing the use of a specialized FOI system which they are now promoting for local governments as well. This limits the number of places that will accept requests from Queremos Saber.

A solution to this problem is underway: an email-based API that will take emails received at certain addresses (e.g. ministryofhealthcare@queremossaber.org.br) and turn them into instructions for a web crawler to create an FOI request in the appropriate system. An interesting side effect of this would be the creation of an anonymization layer, allowing users to bypass the legal requirement that FOI requests not be placed anonymously.

Philippines Projects

Labs data wrangler Mark Brough showed off a test project collecting data on aid activities in the Philippines. Mark’s small static site, updated each night, collects IATI aid data on projects in the Philippines and republishes it in a more browsable form.

Mark also discussed another data-mashup project, still in the planning stage, that would combine budget and aid data for Tanzania (or any other developing country)—similar to Publish What You Fund’s old Uganda project but based on a non-static dataset.

Global Economic Map

Alex Peek discussed his initiative to create the Global Economic Map, “a collection of standardized data set of economic statistics that can be applied to every country, region and city in the world”.

The GEM will draw data from sources like government publications and SEC filings and will cover eleven statistics that touch on GDP, employment, corporations, and budgets. The GEM aims to be fully integrated with Wikidata.

Frictionless data

Finally, Rufus Pollock discussed data.okfn.org and the mission of “frictionless data”: making it “as simple as possible to get the data you want into the tool of your choice.”

data.okfn.org aims to help achieve this goal by promoting, among other things, simple data standards and the tooling to support them. As reported in last week’s newsletter, this now includes a Data Package Manager based on npm, now working at a very basic level. It also includes the data.okfn.org Data Package Viewer, which provides a nice view on data packages hosted on GitHub, S3, or wherever else.

Improving the Labs site

The hangout wrapped up with a discussion of how to improve the Labs site. Besides some discussion of the possibility of a one-click creation system for Open Data Maker Nights, talk focused on improving the projects page.

Oleg, who has volunteered to take the lead in reforming the projects page, highlighted the need for a way to differentiate projects by their activity level and their need for more contributors. Mark agreed, suggesting also that it would be nice to be able to filter projects by the languages and technologies they use. Both ideas were proposed as a way to fill out Tod Robbins’s suggestion that the projects page needs categories.

See the Labs hangout notes for the full details of this discussion.

Get involved

As always, Labs wants you to join in and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement, and have a look at the Ideas Page to see what other members have been thinking.

Bad Data: real-world examples of how not to do data

We’ve just started a mini-project called Bad Data. Bad Data provides real-world examples of how not to publish data. It showcases the poorly structured, the mis-formatted, and the just plain ugly.

This isn’t about being critical but about educating—providing examples of how not to do something may be one of the best ways of showing how to do it right. It also provides a source of good practice material for budding data wranglers!

Bad Data: ASCII spreadsheet

Each “bad” dataset on gets a simple page that shows what’s wrong along with a preview or screenshot.

We’ve started to stock the site with some of the better examples of bad data that we’ve come across over the years. This includes machine-unreadable Transport for London passenger numbers from the London Datastore and a classic “ASCII spreadsheet” from the US Bureau of Labor Statistics.

We welcome contributions of new examples! Submit them here.

Labs newsletter: 14 November, 2013

Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post.

Data Pipes: lots of improvements

Data Pipes is a Labs project that provides a web API for a set of simple data-transforming operations that can be chained together in the style of Unix pipes.

This past week, Andy Lulham has made a huge number of improvements to Data Pipes. Just a few of the new features and fixes:

  • new operations: strip (removes empty rows), tail (truncate dataset to its last rows)
  • new features: a range function and a “complement” switch for cut; options for grep
  • all operations in pipeline are now trimmed for whitespace
  • basic tests have been added

Have a look at the closed issues to see more of what Andy has been up to.

Webshot: new homepage and feature

Last week we introduced you to Webshot, a web API for screenshots of web pages.

Back then, Webshot’s home page was just a screenshot of GitHub. Now Webshot has a proper home page with a form interface to the API.

Webshot has also added support for full page screenshots. Now you can capture the whole page rather than just its visible portion.

On the blog: natural language processing with Python

Labs member Tarek Amr has contributed an awesome post on Python natural language processing with the NLTK toolkit to the Labs blog.

“The beauty of NLP,” Tarek says, “is that it enables computers to extract knowledge from unstructured data inside textual documents.” Read his post to learn how to do text normalization, frequency analysis, and text classification with Python.

Data Packages workflow à la Node

Wouldn’t it be nice to be able to initialize new Data Packages as easily as you can initialize a Node module with npm init?

Max Ogden started a discussion thread around this enticing idea, eventually leading to Rufus Pollock booting a new repo for dpm, the Data Package Manager. Check out dpm’s Issues to see what needs to happen next with this project.

Nomenklatura: looking forward

Nomenklatura is a Labs project that does data reconciliation, making it possible “to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names, against that canonical list”.

Friedrich Lindenberg has noted on the Labs mailing list that Nomenklatura has some serious problems, and he has proposed “a fairly radical re-framing of the service”.

The conversation around what this re-framing should look like is still underway—check out the discussion thread and jump in with your ideas.

Data Issues: following issues

Last week, the idea of Data Issues was floated: using GitHub Issues to track problems with public datasets. The idea has generated a few comments, and we’d love to hear more.

Discussion on the Labs list highlighted another benefit of using GitHub. Alioune Dia suggested that Data Issues should let users register to be notified when a particular issue is fixed. But Chris Mear pointed out that GitHub already makes this possible: “Any GitHub user can ‘follow’ a specific issue by using the notification button at the bottom of the issue page.”

Get involved

Anyone can join the Labs community and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement. Also check out the Ideas Page to see what’s cooking in the Labs.