CSV Conf 2014 – for Data Makers Everywhere

Announcing CSV,Conf - the conference for data makers everywhere which takes place on 15 July 2014 in Berlin.

This one day conference will focus on practical, real-world stories, examples and techniques of how to scrape, wrangle, analyze, and visualize data. Whether your data is big or small, tabular or spatial, graphs or rows this event is for you.

Key Info

CSV,Conf is run in conjunction with the week long Open Knowledge Festival.

What Is It About?

Building Community

We want to bring together data makers/doers/hackers from backgrounds like science, journalism, open government and the wider software industry to share tools and stories.

For those who love data

CSV Conf is a non-profit community conference run by some folks who really love data and sharing knowledge. If you are as passionate about data and the application it has to society then you should join us!

Big and small

This isn’t a conference just about spreadsheets. We are curating content about advancing the art of data collaboration, from putting your CSV on GitHub to producing meaningful insight by running large scale distributed processing.

Colophon: Why CSV?

This conference isn’t just about CSV data. But we chose to call it CSV Conf because we think CSV embodies certain important qualities that set the tone for the event:

  • Simplicity: CSV is incredibly simple - perhaps the simplest structured data format there is
  • Openness: the CSV ‘standard’ is well-known and open - free for anyone to use
  • Easy to use: CSV is widely supported - practically every spreadsheet program, relational database and programming language in existence can handle CSV in some form or other
  • Hackable: CSV is text-based and therefore amenable to manipulation and access from a wide range of standard tools (including revision control systems such as git, mercurial and subversion)
  • Big or small: CSV files can range from under a kilobyte to gigabytes and its line-oriented structure mean it can be incrementally processed – you do not need to read an entire file to extract a single row.

More informally:

CSV is the data Kalashnikov: not pretty, but many [data] wars have been fought with it and even kids can use it. @pudo (Friedrich Lindenberg)

CSV is the ultimate simple, standard data format - streamable, text-based, no need for proprietary tools etc @rufuspollock (Rufus Pollock)

[The above is adapted from the “Why CSV” section of the Tabular Data Package specification]

Candy Crush, King Digital Entertainment, Offshoring and Tax

Sifting through the King Entertainment F-1 filing with the SEC for their IPO (Feb 18 2014) I noticed the following in their risk section:

The intended tax benefits of our corporate structure and intercompany arrangements may not be realized, which could result in an increase to our worldwide effective tax rate and cause us to change the way we operate our business. Our corporate structure and intercompany arrangements, including the manner in which we develop and use our intellectual property and the transfer pricing of our intercompany transactions, are intended to provide us worldwide tax efficiencies [ed: for this I read – significantly reduce our tax-rate by moving our profits to low-tax jurisdictions …]. The application of the tax laws of various jurisdictions to our international business activities is subject to interpretation and also depends on our ability to operate our business in a manner consistent with our corporate structure and intercompany arrangements. The taxing authorities of the jurisdictions in which we operate may challenge our methodologies for valuing developed technology or intercompany arrangements, including our transfer pricing, or determine that the manner in which we operate our business does not achieve the intended tax consequences, which could increase our worldwide effective tax rate and adversely affect our financial position and results of operations.

It is also interesting how they have set up their corporate structure going “offshore” first to Malta and then to Ireland (from the “Our Corporate Information and Structure” section):

We were originally incorporated as Midasplayer.com Limited in September 2002, a company organized under the laws of England and Wales. In December 2006, we established Midasplayer International Holding Company Limited, a limited liability company organized under the laws of Malta, which became the holding company of Midasplayer.com Limited and our other wholly-owned subsidiaries. The status of Midasplayer International Holding Company Limited changed to a public limited liability company in November 2013 and its name changed to Midasplayer International Holding Company p.l.c. Prior to completion of this offering, King Digital Entertainment plc, a company incorporated under the laws of Ireland and created for the purpose of facilitating the public offering contemplated hereby, will become our current holding company by way of a share-for-share exchange in which the existing shareholders of Midasplayer International Holding Company p.l.c. will exchange their shares in Midasplayer International Holding Company p.l.c. for shares having substantially the same rights in King Digital Entertainment plc. See “Corporate Structure.”

Here’s their corporate structure diagram from the “Corporate Structure” section (unfortunately barely readable in the original as well …). As I count it there are 19 different entities with a chain of length 6 or 7 from base entities to primary holding company.

Labs newsletter: 20 March, 2014

We’re back with a bumper crop of updates in this new edition of the now-monthly Labs newsletter!

Textus Viewer refactoring

The TEXTUS Viewer is an HTML + JS application for viewing texts in the format of TEXTUS, Labs’s open source platform for collaborating around collections of texts. The viewer has now been stripped down to its bare essentials, becoming a leaner and more streamlined beast that’s easier to integrate into your projects.

Check out the demo to see the new Viewer in action, and see the full usage instructions in the repo.

JSON Table Schema: foreign key support

The JSON Table Schema, Labs’s schema for tabular data, has just added an important new feature: support for foreign keys. This means that the schema now provides a method for linking entries in a table to entries in a separate resource.

This update has been in the works for a long time, as you can see from the discussion thread on GitHub. Many thanks to everyone who participated in that year-long discussion, including Jeff Allen, David Miller, Gunnlaugur Thor Briem, Sebastien Ballesteros, James McKinney, Paul Fitzpatrick, Josh Ferguson, Tryggvi Björgvinsson, and Rufus Pollock.

Renaming of Data Explorer

Data Explorer is Labs’s in-browser data cleaning and visualization app—and it’s about to get a name change.

For the past four months, discussion around the new name has been bubbling. As of right now, Rufus Pollock is proposing to go with the new name DataDeck.

What do you think? If you object, now’s your chance to jump in the thread and re-open the issue!

On the blog: SEC EDGAR database

Rufus has been doing some work with the Securities and Exchange Commission (SEC) EDGAR database, “a rich source of data containing regulatory filings from publicly-traded US corporations including their annual and quarterly reports”. He has written up his initial findings on the blog and created a repo for the extracted data.

This is an interesting example of working with XBRL, the popular XML framework for financial reporting. You can find several good Python libraries for working with XBRL in Rufus’s message to the mailing list.

Labs Hangout: today!

Labs Hangouts are a fun and informal way for Labs members and friends to get together, discuss their work, and seek out new contributions—and the next one is happening today (20 March) at 1700-1800 GMT!

If you want to join in, visit the hangout Etherpad and record your name. The URL of the Hangout will be announced on the Labs mailing list as well as reported on the pad.

Get involved

Want to join in Labs activities? There’s lots to do! Possibilities for contribution include:

And much much more. Leave an idea on the Ideas Page, or visit the Labs site to learn more about how you can join the community.

The SEC EDGAR Database

This post looks at the Securities and Exchange Commission (SEC) EDGAR database. EDGAR is a rich source of data containing regulatory filings from publicly-traded US corporations including their annual and quarterly reports:

All companies, foreign and domestic, are required to file registration statements, periodic reports, and other forms electronically through EDGAR. Anyone can access and download this information for free. [from the SEC website]

This post introduces the basic structure of the database, and how to get access to filings via ftp. Subsequent posts will look at how to use the structured information in the form of XBRL files.

Note: an extended version of the notes here plus additional data and scripts can be found in this SEC EDGAR Data Package on Github.

Human Interface

See http://www.sec.gov/edgar/searchedgar/companysearch.html

Bulk Data

EDGAR provides bulk access via FTP: ftp://ftp.sec.gov/ - official documentation. We summarize here the main points.

Each company in EDGAR gets an identifier known as the CIK which is a 10 digit number. You can find the CIK by searching EDGAR using a name of stock market ticker.

For example, searching for IBM by ticker shows us that the the CIK is 0000051143.

Note that leading zeroes are often omitted (e.g. in the ftp access) so this would become 51143.

Next each submission receives an ‘Accession Number’ (acc-no). For example, IBM’s quarterly financial filing (form 10-Q) in October 2013 had accession number: 0000051143-13-000007.

FTP File Paths

Given a company with CIK (company ID) XXX (omitting leading zeroes) and document accession number YYY (acc-no on search results) the path would be:

File paths are of the form:

/edgar/data/XXX/YYY.txt

For example, for the IBM data above it would be:

ftp://ftp.sec.gov/edgar/data/51143/0000051143-13-000007.txt

Note, if you are looking for a nice HTML version you can find it at in the Archives section with a similar URL (just add -index.html):

http://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007-index.htm

Indices

If you want to get a list of all filings you’ll want to grab an Index. As the help page explains:

The EDGAR indices are a helpful resource for FTP retrieval, listing the following information for each filing: Company Name, Form Type, CIK, Date Filed, and File Name (including folder path).

Four types of indexes are available:

  • company — sorted by company name
  • form — sorted by form type
  • master — sorted by CIK number
  • XBRL — list of submissions containing XBRL financial files, sorted by CIK number; these include Voluntary Filer Program submissions

URLs are like:

ftp://ftp.sec.gov/edgar/full-index/2008/QTR4/master.gz

That is, they have the following general form:

ftp://ftp.sec.gov/edgar/full-index/{YYYY}/QTR{1-4}/{index-name}.[gz|zip]

So for XBRL in the 3rd quarter of 2010 we’d do:

ftp://ftp.sec.gov/edgar/full-index/2010/QTR3/xbrl.gz

CIK lists and lookup

There’s a full list of all companies along with their CIK code here: http://www.sec.gov/edgar/NYU/cik.coleft.c

If you want to look up a CIK or company by its ticker you can do the following query against the normal search system:

http://www.sec.gov/cgi-bin/browse-edgar?CIK=ibm&Find=Search&owner=exclude&action=getcompany&output=atom

Then parse the atom to grab the CIK. (If you prefer HTML output just omit output=atom).

There is also a full-text company name to CIK lookup here:

http://www.sec.gov/edgar/searchedgar/cik.htmL

(Note this does a POST to a ‘text’ API at http://www.sec.gov/cgi-bin/cik.pl.c)

Labs newsletter: 20 February, 2014

The past few weeks have seen major improvements to the Labs website, another Open Data Maker Night in London, updates to the TimeMapper project, and more.

Labs Hangout: today

The next Labs online hangout is taking place today in just a few hours—now’s your chance to sign up on the hangout’s Etherpad!

Labs hangouts are informal online gatherings held on Google Hangout at which Labs members and friends get together to discuss their work and to set the agenda for Labs activities.

Today’s hangout will take place at 1700 - 1800 GMT. Check the hangout pad for more details, and watch the pad for notes from the meeting.

Crowdcrafting at Citizen Cyberscience Summit 2014

In today’s other news, Labs’s Daniel Lombraña González is presenting Crowdcrafting at the Citizen Cyberscience Summit 2014. You can read more about his presentation here.

Crowdcrafting is an open-source citizen science platform that “empowers citizens to become active players in scientific projects by donating their time in order to solve micro-task problems”. Crowdcrafting has been used by institutions including CERN, the United Nations, and the National Institute of Space Research of Brazil.

Labs site updates

Labs has been discussing improving the website for some time now, and the past weeks have seen many of those proposed improvements being put into action.

One of the biggest changes is a new projects page. Besides having a beautiful new layout, the new projects page implements filtering by tags, language, and more.

The site now also features a reciprocal linking of users and projects. The projects page now shows projects’ maintainers (n.b. plural!), and users pages now show which projects users contribute to (e.g. Andy Lulham’s page highlights his Data Pipes contributions).

TimeMapper improvements

TimeMapper is a Labs project allowing you to create elegant timelines with inline maps from Google Spreadsheets in a matter of seconds.

A number of improvements have been made to TimeMapper:

Open Data Maker Night February

Two weeks ago today, the ninth Open Data Maker Night London was hosted by Andy Lulham. This edition was a mapping special, featuring OpenStreetMap contributor Harry Wood.

Open Data Maker Nights are informal, action-oriented get-togethers where things get made with open data. Visit the Labs website for more information on them, including info on how to host your own.

DataPackage + Bubbles

On last week’s newsletter, you heard about S?tefan Urbánek’s abstract data processing framework Bubbles. S?tefan just notified the OKFN Labs list that he has created a demo of Bubbles using Data Packages, Labs’s simple standard for data publication.

“The example is artificial”, S?tefan says, but it highlights the power of the Bubbles framework and the potential of the Data Package format.

Get involved

We’re always looking for new contributions at the Labs. Read about how you can join, and see the Ideas Page to get in on the ground floor of a Labs project—or just join the Labs mailing list to participate by offering feedback.

Labs newsletter: 30 January, 2014

From now on, the Labs newsletter will arrive through a special announce-only mailing list, newsletter@okfnlabs.org, more details on which can be found below.

Keep reading for other new developments including the fifth Labs Hangout, the launch of SayIt, and new developments in the vision of “Frictionless Data”.

New newsletter format

Not everyone who wants to know about Labs activities wants or needs to observe those activities unfolding on the main Labs list. For friends of Labs who just want occasional updates, we’ve created a new, Sendy-based announce-only list that will bring you a Labs newsletter every two weeks.

Everyone currently subscribed to okfn-labs@lists.okfn.org has been added to the new list. To join the new announce list, see the Labs Contact page, where there’s a form.

Labs Hangout no. 5

Last Thursday, Andy Lulham hosted the fifth OKFN Labs Hangout. The Labs Hangouts are a way for people curious about Labs projects to informally get together, share their work, and talk about the future of Labs.

For full details, check out the minutes from the hangout. Highlights included:

SayIt

SayIt, an open-source tool for publishing and sharing transcripts, has just been launched by Poplus. At last week’s Labs Hangout, Tom Steinberg of mySociety (one half of Poplus, alongside Ciudadano Inteligente) shared some of the motivations behind the creation of the tool, which was also discussed on the okfn-discuss mailing list.

As Tom explained, mySociety’s They Work For You has proven the popularity of transcript data. But making the transcripts available in a nice way (e.g. with a decent API) has so far called for bespoke software development. SayIt is designed to encourage “nice” publication as the starting-point—and to serve as a pedagogical example of what a good data publication tool looks like.

Frictionless data: vision, roadmap, composability

We’ve heard about Rufus’s vision for an ecosystem of “frictionless data” in the past. Now the discussion is starting to get serious. data.okfn.org now hosts two key documents generated through the conversation:

  • the vision: what will create a dynamic, productive, and attractive open data ecosystem?
  • the roadmap: what has to happen to bring this vision to life?

The new roadmap is a particularly lucid overview of how the frictionless data vision connects with concrete actions. Would-be creators of this new ecosystem should consult the roadmap to see where to join in.

Discussion on the Labs list has also generated some interesting insights. Data Unity’s Kev Kirkland discussed his work with Semantic Web formalization of composable data manipulation processes, and S?tefan Urbánek made a connection with his work on “abstracting datasets and operations” in the ETL framework Bubbles.

On the blog: OLAP part two

Last week, S?tefan Urbánek wrote us an introduction to Online Analytical Processing. Shortly afterwards, he followed up with a second post taking a closer look at how OLAP data is structured and why.

Check out S?tefan’s post to learn about how OLAP represents data as multidimensional “cubes” that users can slice and dice to explore the data along its many dimensions.

TimeMapper improvements

Andy Lulham has started working on TimeMapper, Labs’s easy-to-use tool for the creation of interactive timelines linked to geomaps.

Some of the improvements he has made so far have been bugfixes (e.g. preventing overflowing form controls, fixing the template settings file), but one of them is a new user feature: adding a way to change the starting event on a timeline so that they don’t always have to start at the beginning.

Get involved

Want to get involved with Labs’s projects? Now is a great time to join in! Check out the Ideas Page to see some of the many things you can do once you join Labs, or just jump on the Labs mailing list and take part in a conversation.

Labs newsletter: 16 January, 2014

Welcome back from the holidays! A new year of Labs activities is well underway, with long-discussed improvements to the Labs projects page, many new PyBossa developments, a forthcoming community hangout, and more.

Labs projects page

Getting the Labs project page organized better has been high on the agenda for some time now. In the past little while, significant progress has been made. New improvements to the project page include:

Oleg Lavrosky, Daniel Lombraña González, and Andy Lulham have all contributed to this development—and work is still ongoing, with further enhancements to attributes and more work on the UI still to come.

Lots of PyBossa milestones

PyBossa has achieved so many milestones since the last newsletter that it’s hard to know where to begin.

PyBossa v0.2.1 was released by Daniel Lombraña González, becoming a more robust service through the inclusion of a new rate-limiting feature for API calls. Alongside rate limits, the new PyBossa has improved security through the addition of a secure cookie-based solution for posting task runs. Full details can be found in the documentation.

Daniel also released a new PyBossa template for annotating pictures. The template, which incorporates the Annotorious.JS JavaScript library, “allow[s] anyone to extract structured information from pictures or photos in a very simple way”.

The Enki package for analyzing PyBossa applications was also released over the break. Enki makes it possible to download completed PyBossa tasks and associated task runs, analyze them with Pandas, and share the result as an IPython Notebook. Check out Daniel’s blog post on Enki to see what it’s about.

New on the blog

We’ve had a couple of great new contributions on the Labs blog since the last newsletter.

Thomas Levine has written about how he parses PDF files, lovingly exploring a problem that all data wranglers will encounter and gnash their teeth over at least a few times in their lives.

Stefan Urbanek, meanwhile, has written an introduction to OLAP, “an approach to answering multi-dimensional analytical queries swiftly”, explaining what that means and why we should take notice.

Da?nabox

Labs friend Darwin Peltan reached out to the list to point out that his friend’s project Da?nabox is looking for testers and general feedback. Labs members are invited to pitch in by finding bugs and breaking it.

Da?nabox is “Heroku but with public payment pages”, crowdsourcing the payment for an app’s hosting costs. Da?nabox is open source and built on the Deis platform.

Community hangout

It’s almost time for the Labs community hangout. The Labs hangout is the regular event where Labs members meet up online to discuss their work, find ways to collaborate, and set the agenda for the weeks to come.

When will the hangout take place? Rufus proposes moving the hangout from the 21st to the 23rd. If you want to participate, leave a comment on the thread to let Labs know what time would work for you.

Get involved

Labs is the Labs community, no more and no less, and you’re invited to become a part of it! Join the community by coding, blogging, kicking around ideas on the Ideas Page, or joining the conversation on the Labs mailing list.

Convert data between formats with Data Converters

Data Converters is a command line tool and Python library making routine data conversion tasks easier. It helps data wranglers with everyday tasks like moving between tabular data formats—for example, converting an Excel spreadsheet to a CSV or a CSV to a JSON object.

The current release of Data Converters can convert between Excel spreadsheets, CSV data, and JSON tables, as well as some geodata formats (with additional requirements).

Its smart parser can guess the types of data, correctly recognizing dates, numbers, strings, and so on. It works as easily with URLs as with local files, and it is designed to handle very large files (bigger than memory) as easily as small ones.

Data Converters homepage

Converting data

Converting an Excel spreadsheet to a CSV or a JSON table with the Data Converters command line tool is easy. Data Converters is able to read XLS(X) and CSV files and to write CSV and JSON, and input files can be either local or remote.

dataconvert simple.xls out.csv
dataconvert out.csv out.json

# URLs also work
dataconvert https://github.com/okfn/dataconverters/raw/master/testdata/xls/simple.xls out.csv

Data Converters will try to guess the format of your input data, but you can also specify it manually.

dataconvert --format=xls input.spreadsheet out.csv

Instead of writing the converted output to a file, you can also send it to stdout (and then pipe it to other command-line utilities).

dataconvert simple.xls _.json  # JSON table to stdout
dataconvert simple.xls _.csv   # CSV to stdout

Converting data files can also be done within Python using the Data Converters library. The dataconvert convenience function shares the dataconvert command line utility’s file reading and writing functionality.

from dataconverters import dataconvert
dataconvert('simple.xls', 'out.csv')
dataconvert('out.csv', 'out.json')
dataconvert('input.spreadsheet', 'out.csv', format='xls')

Parsing data

Data Converters can do more than just convert data files. It can also parse tabular data into Python objects that captures the semantics of the source data.

Data Converters’ various parse functions each return an iterator over the records of the source data along with a metadata dictionary containing information about the data. The records returned by parse are not just (e.g.) split strings: they’re hash representations of the contents of the row, with column names and data types auto-detected.

import dataconverters.xls as xls
with open('simple.xls') as f:
    records, metadata = xls.parse(f)
    print metadata
    print [r for r in records]
=> {'fields': [{'type': 'DateTime', 'id': u'date'}, {'type': 'Integer', 'id': u'temperature'}, {'type': 'String', 'id': u'place'}]}
=> [{u'date': datetime.datetime(2011, 1, 1, 0, 0), u'place': u'Galway', u'temperature': 1.0}, {u'date': datetime.datetime(2011, 1, 2, 0, 0), u'place': u'Galway', u'temperature': -1.0}, {u'date': datetime.datetime(2011, 1, 3, 0, 0), u'place': u'Galway', u'temperature': 0.0}, {u'date': datetime.datetime(2011, 1, 1, 0, 0), u'place': u'Berkeley', u'temperature': 6.0}, {u'date': datetime.datetime(2011, 1, 2, 0, 0), u'place': u'Berkeley', u'temperature': 8.0}, {u'date': datetime.datetime(2011, 1, 3, 0, 0), u'place': u'Berkeley', u'temperature': 5.0}]

What’s next?

Excel spreadsheets and CSVs aren’t the only kinds of data that need converting.

Data Converters also supports geodata conversion, including converting between KML (the format for geographical data used in Google Maps and Google Earth), GeoJSON, and ESRI Shapefiles.

Data Converters’ ability to convert between tabular data may also grow, adding JSON support on the input side and XLS(X) support on the output side—as well as new conversions for XML, SQL dumps, and SPSS.

Visit the Data Converters home page to learn how to install Data Converters and its dependencies, and check out Data Converters on GitHub to see how you can contribute to the project.

Labs newsletter: 12 December, 2013

We’re back after taking a break last week with a bumper crop of updates. A few things have changed: Labs activities are now coordinated entirely through GitHub. Meanwhile, there’s been some updates around the Nomenklatura, Annotator, and Data Protocols projects and some new posts on the Labs blog.

Migration from Trello to GitHub

For some time now, Labs activities requiring coordination have been organized on Trello—but those days are now over. Labs has moved its organizational setup over to GitHub, coordinating actions and making plans by means of GitHub issues. This change comes as a big relief to the many Labs members who already use GitHub as their main platform for collaboration.

General Labs-related activities are now tracked on the Labs site’s issues, and activities around individual projects are managed (as before!) through those projects’ own issues.

New Bad Data

New examples of bad data continue to roll in—and we invite even more new submissions.

Bad datasets added since last newsletter include the UK’s Greater London Authority spend data (65+ files with 25+ different structures!), Nature Magazine’s supplementary data (an awful PDF jumble), and more.

Nomenklatura: new alpha

As we’ve previously noted, Labs member Friedrich Lindenberg has been thinking about producing “a fairly radical re-framing” of the Nomenklatura data reconciliation service.

Friedrich has now released an alpha version of a new release of Nomenklatura at nk-dev.pudo.org. The major changes with this alpha include:

  • A fully JavaScript-driven frontend
  • String matching now happens inside the PostgreSQL database
  • Better introductory text explaining what Nomenklatura does
  • “entity” and “alias” domain objects have been merged into “entity”

Friedrich is keen to hear what people think about this prototype—so jump in, give it a try, and leave your comments at the Nomenklatura repo.

Annotator v1.2.9

A new maintenance release of Annotator came out ten days ago. This new version is intended to be one of the last in the v1.2.x series—indeed, v1.2.8 itself was intended to be the last, but that version had some significant issues that this new release corrects.

Fixes in this version include:

  • Fixed a major packaging error in v1.2.8. Annotator no longer exports an excessive number of tokens to the page namespace.
  • Notification display bugfixes. Notification levels are now correctly removed after notifications are hidden.

The new Annotator is available, as always, from GitHub.

Data Protocols updates

Data Protocols is a project to develop simple protocols and formats for working with open data. Rufus Pollock wrote a cross-post to the list about several new developments with Data Protocols of interest to Labs. These included:

  • Close to final agreement on a spec for adding “primary keys” to the JSON Table Schema (discussion)
  • Close to consensus on spec for “foreign keys” (discussion)
  • Proposal for a JSON spec for views of data, e.g. graphs or maps (discussion)

For more, check out Rufus’s message and the Data Protocols issues.

On the blog

Labs members have added a couple new posts to the blog since the last newsletter. Yours truly (with extensive help from Rufus) posted on using Data Pipes to view a CSV. Michael Bauer, meanwhile, wrote about the new Reconcile-CSV service he developed while working on education data in Tanzania. Look to the Labs blog for the full scoop.

Get involved

If you have some spare time this holiday season, why not spend it helping out with Labs? We’re always always looking for new people to join the community—visit the Labs issues and the Ideas Page to get some ideas for how you can join in.

View a CSV (Comma Separated Values) in Your Browser

This post introduces one of the handiest features of Data Pipes: fast (pre) viewing of CSV files in your browser (and you can share the result by just copying a URL).

The Raw CSV

CSV files are frequently used for storing tabular data and are widely supported by spreadsheets and databases. However, you can’t usually look at a CSV file in your browser - usually your browser will automatically download a CSV file. And even if you could look at a CSV file, it is not very pleasant to look at:

Raw CSV

The Result

But using the Data Pipes html feature, you can turn an online CSV into a pretty HTML table in a few seconds. For example, the CSV you’ve just seen would become this pretty table:

CSV, HTML view

Using it

To use this service, just visit http://datapipes.okfnlabs.org/html/ and paste your CSV’s URL into the form.

For power users (or for use from the command line or API), you can just append your CSV url to:

http://datapipes.okfnlabs.org/csv/html/?url=

Previewing Just the First Part of a CSV File

You can also extend this basic previewing using other datapipes features. For example, suppose you have a big CSV file (say with more than a few thousand rows). If you tried to turn this into an HTML table and then view in your browser, it would probably crash it.

So what if you could just see a part of the file? After all, you may well only be interested in seeing what that CSV file looks like, not every row. Fortunately, Data Pipes supports only showing the first 10 lines of a CSV file using a head operation. To demonstrate, let’s just extend our example above to use head. This gives us the following URL (click to see the live result):

http://datapipes.okfnlabs.org/csv/head/html/?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv

Colophon

Data Pipes is a free and open service run by Open Knowledge Foundation Labs. You can find the source code on GitHub at: https://github.com/okfn/datapipes. It also available as a Node library and command line tool.

If you like previewing CSV files in your browser, you might also be interested in the Recline CSV Viewer, a Chrome plugin that automatically turns CSVs into searchable HTML tables in your browser.