Category Archives: Open Knowledge Foundation

Archiving Twitter the Hacky Way

There are many circumstances where you want to archive a tweets - maybe just from your own account or perhaps for a hashtag for an event or topic.

Unfortunately Twitter search queries do not give data more than 7 days old and for a given account you can only get approximately the last 3200 of your tweets and 800 items from your timeline. [Update: People have pointed out that Twitter released a feature to download an archive of your personal tweets at the end of December - this, of course, still doesn’t help with queries or hashtags]

Thus, if you want to archive twitter you’ll need to come up with another solution (or pay them, or a reseller, a bunch of money - see Appendix below!). Sadly, most of the online solutions have tended to disappear or be acquired over time (e.g. twapperkeeper). So a DIY solution would be attractive. After reading various proposals on the web I’ve found the following to work pretty well (but see also this excellent google spreadsheet based solution).

The proposed process involves 3 steps:

  1. Locate the Twitter Atom Feed for your Search
  2. Use Google Reader as your Archiver
  3. Get your data out of Google Reader (a 1000 items at a time!)

One current drawback of this solution is that each stage has to be done by hand. It could be possible to automate more of this, and especially the important third step, if I could work out how to do more with the Google Reader API. Contributions or suggestions here would be very welcome!

Note that the above method will become obsolete as of March 5 2013 when Twitter close down RSS and Atom feeds - continuing their long march to becoming a fully more closed and controlled ecosystem.

As you struggle, like me, to get precious archival information out of Twitter it may be worth reflecting on just how much information you’ve given to Twitter that you are now unable to retrieve (at least without paying) …

Twitter Atom Feed

Twitter still have Atom feeds for their search queries:

http://search.twitter.com/search.atom?q=my_search

Note that if you want to search for a hash tag like #OpenData or a user e.g. @someone you’ll need to escape the symbols:

http://search.twitter.com/search.atom?q=%23OpenData

Unfortunately twitter atom queries are limited to only a few items (around 20) so we’ll need to continuously archive that feed to get full coverage.

Archiving in Google Reader

Just add the previous feed URL in your Google Reader account. It will then start archiving.

Aside: because the twitter atom feed is limited to a small number of items and the check in google reader only happens every 3 hours (1h if someone else is archiving the same feed) you can miss a lot of tweets. One option could be to use Topsy’s RSS feeds http://otter.topsy.com/searchdate.rss?q=%23okfn (though not clear how to get more items from this feed either!)

Gettting Data out of Google Reader

Google Reader offers a decent (though still beta) API. Unoffical docs for it can be found here: http://undoc.in/

The key URL we need is:

http://www.google.com/reader/atom/feed/[feed_address]?n=1000

Note that the feed is limited to a maximum of 1000 items and you can only access it for your account if you are logged in. This means:

  • If you have more than a 1000 items you need to find the continuation token in each set of results and then at &c={continuation-token} to your query.
  • Because you need to be logged in your browser you need to do this by hand :-( (it may be possible to automate via the API but I couldn’t get anything work - any tips much appreciated!)

Here’s a concrete example (note, as you need to be logged in this won’t work for you):

http://www.google.com/reader/atom/feed/http://search.twitter.com/search.atom%3Fq%3D%2523OpenData?n=1000

And that’s it! You should now have a local archive of all your tweets!

Appendix

Increasing Twitter is selling access to the full Twitter archive and there are a variety of 3rd services (such as Gnip, DataSift, Topsy and possibly more) who are offering full or partial access for a fee.

Recline JS Search Demo

Recline JS

We’ve recently finished a demo for ReclineJS showing how it can be used to build JS-based (ajax-style) search interfaces in minutes (or even seconds!): http://reclinejs.com/demos/search/

Because of Recline’s pluggable backends you get out of the box support for data sources such as SOLR, Google Spreadsheet, ElasticSearch, or plain old JSON or CSV – see examples below for live examples of using different backends.

Interested in using this yourself? The (prettified) source JS for the demo is available (plus the raw version) and it shows how simple it is to build an app like this using Recline – plus it has tips on how to customize and extend).

demo

More Examples

In addition to the simple example with local data there are several other examples showing how one can use this with other data sources including Google Docs and SOLR:

  1. A search example using a google docs listing Shell Oil spills in the Niger delta

  2. A search example running of OpenSpending SOLR API – we suggest searching for something interesting like “Drugs” or “Nuclear power”!

Code

The full (prettified) source JS for the demo is available (plus the raw version) but here’s a key code sample to give a flavour:

// ## Simple Search View
//
// This is a simple bespoke Backbone view for the Search. It Pulls together
// various Recline UI components and the central Dataset and Query (state)
// object
//
// It also provides simple support for customization e.g. of template for list of results
// 
//      var view = new SearchView({
//        el: $('some-element'),
//        model: dataset
//        // EITHER a mustache template (passed a JSON version of recline.Model.Record
//        // OR a function which receives a record in JSON form and returns html
//        template: mustache-template-or-function
//      });
var SearchView = Backbone.View.extend({
  initialize: function(options) {
    this.el = $(this.el);
    _.bindAll(this, 'render');
    this.recordTemplate = options.template;
    // Every time we do a search the recline.Dataset.records Backbone
    // collection will get reset. We want to re-render each time!
    this.model.records.bind('reset', this.render);
    this.templateResults = options.template;
  },

  // overall template for this view
  template: ' \
    <div class="controls"> \
      <div class="query-here"></div> \
    </div> \
    <div class="total"><h2><span></span> records found</h2></div> \
    <div class="body"> \
      <div class="sidebar"></div> \
      <div class="results"> \
        } \
      </div> \
    </div> \
    <div class="pager-here"></div> \
  ',
 
  // render the view
  render: function() {
    var results = '';
    if (_.isFunction(this.templateResults)) {
      var results = _.map(this.model.records.toJSON(), this.templateResults).join('\n');
    } else {
      // templateResults is just for one result ...
      var tmpl = '' + this.templateResults + ''; 
      var results = Mustache.render(tmpl, {
        records: this.model.records.toJSON()
      });
    }
    var html = Mustache.render(this.template, {
      results: results
    });
    this.el.html(html);

    // Set the total records found info
    this.el.find('.total span').text(this.model.recordCount);

    // ### Now setup all the extra mini-widgets
    // 
    // Facets, Pager, QueryEditor etc

    var view = new recline.View.FacetViewer({
      model: this.model
    });
    view.render();
    this.el.find('.sidebar').append(view.el);

    var pager = new recline.View.Pager({
      model: this.model.queryState
    });
    this.el.find('.pager-here').append(pager.el);

    var queryEditor = new recline.View.QueryEditor({
      model: this.model.queryState
    });
    this.el.find('.query-here').append(queryEditor.el);
  }
});

WikipediaJS – accessing Wikipedia article data through Javascript

WikipediaJS is a simple JS library for accessing information in Wikipedia articles such as dates, places, abstracts etc.

The library is the work of Labs member Rufus Pollock. In essence, it is a small wrapper around the data and APIs of the DBPedia project and it is they who have done all the heavy lifting of extracting structured data from Wikipedia - huge credit and thanks to DBPedia folks!

Demo and Examples

A demo is included and you can see some examples of the library in action at the following links:

Colophon

One of the reasons for creating WikipediaJS is that we think it can be useful in Timeliner and other apps as a way to quickly add new items to your timeline.

State Budget Crisis Task Force Report

The State Budget Crisis Task Force was convened in June 2011 and issued its report in July 2012.

The top line quote from the main site states:

State finances are not transparent and often include hidden liabilities as well as rapidly growing responsibilities which are difficult to control. While state revenues are gradually recovering from the drastic decline of the Great Recession, they are not growing sufficiently to keep pace with the spending required by Medicaid costs, pensions, and other responsibilities and obligations. This has resulted in persistent and growing structural deficits in many states which threaten their fiscal sustainability. [emphasis added]

Full report (pdf)

Debt Does Not Equal Revenue Except in California

Striking quote on inability to understand that debt != revenue:

California is also confused about the meaning of the term “revenues”. Asked at a 2008 budget conference whether Schwarzenegger would consider raising revenues to balance the budget, Thomas Sheehy, deputy director of the Department of Finance, replied that the governor’s budget, in fact, already included new revenues: $3.3 billion from the sale of deficit bonds! A corporate executive who reports borrowed dollars as sales is angling for for a bunk in federal prison. It doesn’t take much financial sophistication to understand that a cash advance on your credit card isn’t revenue. It is debt.

California Crack-up, p.95

The authors follow this with this comment which I think is of striking relevance to Open Spending:

The first, crucial step towards responsible and democratic budgeting is to present the state’s fiscal information to Californians honestly and clearly.

It also reminds me of Niall Ferguson’s statement quoted in a previous post:

The present system is, to put it bluntly, fraudulent. There are no regularly published and accurate official balance sheets. Huge liabilities are simply hidden from view.

Not even the current income and expenditure statements can be relied upon in some countries. No legitimate business could possible carry on in this fashion.

Timeliner – Make Nice Timelines Fast

As part of the Recline launch I put together quickly some very simple demo apps one of which was called Timeliner:

http://timeliner.reclinejs.com/

This uses the Recline timeline component (which itself is a relatively thin wrapper around the excellent Verite timeline) plus the Recline Google docs backend to provide an easy way for people to make timelines backed by a Google Docs spreadsheet.

As an example of use, I started work on a “spending stories” timeline about the bankruptcy of US cities (esp in California) as a result of the “Great Recession” (source spreadsheet). I’ve also created an example timeline of major wars, a screenshot of which I’ve inlined:

Code

Source code for the Timeliner is here: https://github.com/okfn/timeliner

If you have suggestions for improvements, want to see the ones that already exist, or, gasp, find a bug please see the issue tracker: https://github.com/okfn/timeliner/issues

The Data Transformer – Cleaning Up Data in the Browser

This a brief post to announce an alpha prototype version of the Data Transformer, an app to let you clean up data in the browser using javascript:

http://transformer.datahub.io/

2m overview video:

What does this app do?

  1. You load a CSV file from github (fixed at the moment but soon to be customizable)
  2. You write simple javascript to edit this file (uses ReclineJS transform and grid views + CSV backends – here’s the original ReclineJS transform demo)
  3. You save this updated file back to github (via oauth login - this utilizes Michael’s great work in Prose!)

This prototype was hacked together in an afternoon a couple of weeks ago when I was fortunate enough to spend an an afternoon with Michael Aufreiter, Chris Herwig, Mike Morris and others at the Development Seed offices. It builds on ReclineJS + oauth / github connectors borrowed from Prose.

It’s part of an ongoing plan to create a “Data Orchestra” of lightweight data services that can play nicely together with each other and connect to things like the DataHub (or GitHub …): http://notebook.okfn.org/2012/06/22/datahub-small-pieces-loosely-joined/

Public Debt, Public Finances and OpenSpending

Excerpts and commentary on Niall Ferguson’s first Reith Lecture. All emphasis added.

In reading this piece I thought constantly of the Open Spending project where we are endeavouring to collect together government (and other public) financial information from around the world and present it in an understandable way. In particular, it made me wonder whether we should try to do more beyond collection and presentation of the data to provide additional (necessarily somewhat speculative computations) such as proper financial balance sheets.

Fraudulent and inaccurate public finances

The present system is, to put it bluntly, fraudulent. There are no regularly published and accurate official balance sheets. Huge liabilities are simply hidden from view.

Not even the current income and expenditure statements can be relied upon in some countries. No legitimate business could possible carry on in this fashion.

The last corporation to publish financial statements this misleading was Enron.

There is, in fact, a better way. Public sector balance sheets can – and should be – drawn up so that the liabilities of governments can be compared with their assets.

That would help clarify the difference between deficits to finance investment and deficits to finance current consumption. Governments should also follow the lead of business and adopt the Generally Accepted Accounting Principles.

And, above all, generational accounts should be prepared on a regular basis to make absolutely clear the inter-generational implications of current policy.

US liabilities

The most recent estimate for the difference between the net present value of federal government liabilities and the net present value of future federal revenues is $200 trillion, nearly thirteen times the debt as stated by the U.S. Treasury.

Notice that these figures, too, are incomplete, since they omit the unfunded liabilities of state and local governments, which are estimated to be around $38 trillion.

These mind-boggling numbers represent nothing less than a vast claim by the generation currently retired or about to retire on their children and grandchildren, who are obligated by current law to find the money in the future, by submitting either to substantial increases in taxation or to drastic cuts in other forms of public expenditure.

Scape-goating

As our economic difficulties have worsened, we voters have struggled to find the appropriate scapegoat.

We blame the politicians whose hard lot it is to bring public finances under control, but we also like to blame bankers and financial markets, as if their reckless lending was to blame for our reckless borrowing. [ed: but bankers often engaged in efforts to enable and prolong reckless borrowing, and this included heavy lobbying to prevent effective regulation, after all one of the roles of the State is to help its citizens avoid bad decisions]

We bay for tougher regulation, though not of ourselves.

Weekly Update: Rufus Pollock – 2nd April 2012

Availability

  • All week

Last Week

This Week