Category Archives: Software

ANN: PyWordPress – Python WordPress Library using the WordPress XML-RPC API

Announcing PyWordpress, a Python library for WordPress that provides a pythonic interface to WordPress using the WordPress XML-RPC API:

Along with a wrapper for the main functions it also provides various helper methods, for example to create many pages at once. This is somewhat of a belated announce as the first version of this was written almost a year ago!

Usage

Command line

Check out the commands::

wordpress.py -h 

Commands::

create_many_pages: Create many pages at once (and only create pages which do not already exist).
delete_all_pages: Delete all pages (i.e. delete_page for each page in instance).
delete_page: http://codex.wordpress.org/XML-RPC_wp#wp.deletePage
edit_page: http://codex.wordpress.org/XML-RPC_wp#wp.editPage
get_authors: http://codex.wordpress.org/XML-RPC_wp#wp.getAuthors
get_categories: http://codex.wordpress.org/XML-RPC_wp#wp.getCategories
get_page: http://codex.wordpress.org/XML-RPC_wp#wp.getPage
get_page_list: http://codex.wordpress.org/XML-RPC_wp#wp.getPageList
get_pages: http://codex.wordpress.org/XML-RPC_wp#wp.getPages
get_tags: http://codex.wordpress.org/XML-RPC_wp#wp.getTags
init_from_config: Class method to initialize a `Wordpress` instance from an ini file.
new_page: http://codex.wordpress.org/XML-RPC_wp#wp.newPage

You will need to create a config with the details (url, login) of the wordpress instance you want to work with::

cp config.ini.tmpl config.ini
# now edit away ...
vim config.ini

Python library

Read the code documentation::

>>> from pywordpress import WordPress
>>> help(WordPress)

License

MIT-licensed: http://www.opensource.org/licenses/mit-license.php

Tabular Data Formats

As part of recent work on the DataExplorer I’ve been looking into formats / schemas for tabular data and have just posted this info on the wiki:

http://wiki.ckan.org/Data_Formats#Formats_-_Tabular

The list is quite short and if anyone out there has useful links or comments I’d love to know more (as one example, I hear very positive things about R and its data frames but have not yet tracked down a really good overview of interface of how its designed).

Background: why are we looking at this? The immediate reason is that we want to define a lightweight intermediate format for DataExplorer (and possibly the Webstore) into which one can convert incoming data coming from different sources (e.g. Webstore, Google docs, OData etc) before exporting to formats needed for the display widgets (such as SlickGrid, flot, d3 etc).

hg-git and pushing to git from mercurial

Documenting my experience pushing mercurial repos to git (and github specifically).

Install hg-git

Follow https://bitbucket.org/durin42/hg-git/src/tip/README.md

Install dulwich >= 0.6. On ubuntu:

sudo apt-get install python-dulwich

Get the latest version of hg-git:

hg clone https://bitbucket.org/durin42/hg-git

Add it to your extensions

[extensions]
git = path/to/hg-git/hggit

Push an existing mercurial repo

Assuming you’ve got a git repo somewhere, e.g. for me (rgrp) on github:

 cd my-current-mercurial-repo
 hg push git+ssh://git@github.com/rgrp/myrepo

Really important note: do not change git before the @ sign to your username as you would in mercurial but leave it as ‘git’ (this cost me around 20m of googling with errors like

Permission denied (publickey).
abort: the remote end hung up unexpectedly

You may also want to check your ssh setup with github really is working (see http://help.github.com/troubleshooting-ssh/).

Datapkg 0.8 Released

A new release (v0.8) of datapkg, the tool for distributing, discovering and installing data is out!

There’s a quick getting started section below (also see the docs).

About the release

This release brings substantial improvements to the download functionality of datapkg including support for extending the download system via plugins. The full changelog below has more details and here’s an example of the new download system being used to download material selectively from the COFOG package on CKAN.

# download metadata and all resources from cofog package to current directory
# Resources to retrieve will be selected interactively
download ckan://cofog .

# download all resources
# Note need to quote *
download ckan://name path-on-disk "*"

# download only those resources that have format 'csv' (or 'CSV')
download ckan://name path-on-disk csv

For more details see the documentation of the download command:

datapkg help download

Get started fast

# 1. Install: (requires python and easy_install)
$ easy_install datapkg
# Or, if you don't like easy_install
$ pip install datapkg or even the raw source!

# 2. [optional] Take a look at the manual
$ datapkg man

# 3. Search for something
$ datapkg search ckan:// gold
gold-prices -- Gold Prices in London 1950-2008 (Monthly)

# 4. Get some data
# This will result in a csv file at /tmp/gold-prices/data
$ datapkg download ckan://gold-prices /tmp

Find out more » — including how to create, register and distribute your own ‘data packages’.

Changelog

  • ResourceDownloader objects and plugin point (#964)
  • Refactor PackageDownloader to use ResourceDownloader and support Resource filtering
  • Retrieval options for package resourcs (#405). Support selection of resources to download (on command line or API) via glob style patterns or user interaction.

PyWordPress – Python Library for WordPress

Announcing pywordpress, a python interface to WordPress using the WordPress XML-RPC API.

Usage

Command line

Check out the commands::

wordpress.py -h 

You will need to create a config with the details (url, login) of the wordpress instance you want to work with::

cp config.ini.tmpl config.ini
# now edit away ...
vim config.ini

Python library

Read the code documentation::

>>> from pywordpress import WordPress
>>> help(WordPress)

CKAN v1.2 Released together with Datapkg v0.7

This is a cross-post of the release announcement originally put up on the OKFN Blog.


We’re delighted to announce CKAN v1.2, a new major release of the CKAN software. This is the largest iteration so far with 146 tickets closed and includes some really significant improvements most importantly a new extension/plugin system, SOLR search integration, caching and INSPIRE support (more details below). The extension work is especially significant as it now means you can extend CKAN without having to delve into any core code.

In addition there are now over 20 CKAN instances running around the world and CKAN is being used in official government catalogues in the UK, Norway, Finland and the Netherlands. Furthermore, http://ckan.net/ — our main community catalogue — now has over 1500 data ‘packages’ and has become the official home for the LOD Cloud (see the lod group on ckan.net).

We’re also aiming to provide a much more integrated ‘datahub’ experience with CKAN. Key to this is the provision of a ‘storage’ component to complement the registry/catalogue component we already have. Integrated storage will support all kinds of important functionality from automated archival of datasets to dataset cleaning with google refine.

We’ve already been making progress on this front with the launch of a basic storage service at http://storage.ckan.net/ (back in September) and the development of the OFS bucket storage library. The functionality is still at an alpha stage and integration with CKAN is still limited so improving this area will be a big aim for the next release (v1.3).

Even in its alpha stage, we are already making use of the storage system, most significantly, in the latest release of datapkg, our tool for distributing, discovering and installing data (and content) ‘packages’. In particular, the v0.7 release (more detail below) includes upload support allowing you store (as well as register) your data ‘packages’.

Highlights of CKAN v1.2 release

  • Package edit form: attach package to groups (#652) & revealable help
  • Form API – Package/Harvester Create/New (#545)
  • Authorization extended: authorization groups (#647) and creation of packages (#648)
  • Extension / Plug-in interface classes (#741)
  • WordPress twentyten compatible theming (#797)
  • Caching support (ETag) (#693)
  • Harvesting GEMINI2 metadata records from OGC CSW servers (#566)

Minor:

  • New API key header (#466)
  • Group metadata now revisioned (#231)

All tickets

Datapkg Release Notes

A major new release (v0.7) of datapkg is out!

There’s a quick getting started section below (also see the docs).

About the release

This release brings major new functionality to datapkg especially in regard to its integration with CKAN. datapkg now supports uploading as well as downloading and can now be easily extended via plugins. See the full changelog below for more details.

Get started fast

# 1. Install: (requires python and easy_install)
$ easy_install datapkg
# Or, if you don't like easy_install
$ pip install datapkg or even the raw source!

# 2. [optional] Take a look at the manual
$ datapkg man

# 3. Search for something
$ datapkg search ckan:// gold
gold-prices -- Gold Prices in London 1950-2008 (Monthly)

# 4. Get some data
# This will result in a csv file at /tmp/gold-prices/data
$ datapkg download ckan://gold-prices /tmp

# 5. Store some data
# Edit the gold prices csv making some corrections
$ cp gold-prices/data mynew.csv
$ edit mynew.csv
# Now upload back to storage
$ datapkg upload mynew.csv ckan://mybucket/ckan-gold-prices/mynew.csv

Find out more » — including how to create, register and distribute your own ‘data packages’.

Changelog

  • MAJOR: Support for uploading datapkgs (upload.py)
  • MAJOR: Much improved and extended documenation
  • MAJOR: New sqlite-based DB index giving support for a simple, central, ‘local’ index (ticket:360)
  • MAJOR: Make datapkg easily extendable

    • Support for adding new Index types with plugins
    • Support for adding new Commands with command plugins
    • Support for adding new Distributions with distribution plugins
  • Improved package download support (also now pluggable)

  • Reimplement url download using only python std lib (removing urlgrabber requirment and simplifying installation)
  • Improved spec: support for db type index + better documentation
  • Better configuration management (especially internally)
  • Reduce dependencies by removing usage of PasteScript and PasteDeploy
  • Various minor bugfixes and code improvements

Credits

A big hat-tip to Mike Chelen and Matthew Brett for beta-testing this release and to Will Waites for code contributions.

Open-Source Annotation Toolkit for Inline, Online Web Annotation

I’ve been working on web-annotation — inline, online annotation of web texts — for several years.

My original motivation was to support annotation of texts in http://openshakespeare.org/ so we can collaboratively build up critical notes but since then I’ve seen this need again and again — in drafting new open data licenses, with scholars working on medieval canon law, when taking my own notes on academic papers.

http://openshakespeare.org annotation

Open Shakespeare’s Hamlet in annotate mode

What’s surprised me is that there appears to be no good opensource tool out there to do this. There are several commercial offerings (including annotation in google docs), and there have been opensource attempts such as annotea, Stet (for GPLv3), marginalia, and co-ment but none of these really seemed to work — my original implementation in 2006/2007 of annotation for http://openshakespeare.org/ used http://geof.net/‘s (excellent) marginalia library but I ultimately ran into performance and integration problems).

Thus, a year and a half ago, in collaboration with Nick Stenning, we started developing an annotator project to create a new, simple javascript (+backend) library for web-annotation. Our main goals were and are:

  • Annotation of arbitrary text ranges
  • Annotate any web (html) document
  • Easy to use — 2 lines of javascript to insert this in your web page/app etc
  • Well-factored and library-structured — easy to integrate and easy to extend

Nick’s (who’s a great javascript (and css) developer), has been responsible for writing all of the frontend (i.e. the annotation stuff you actually see!) while I’ve developed the backend annotation store.

In the way of spare-time projects, development has been rather slower than we would have liked but we now have a functioning alpha which has now been running successfully on http://openshakespeare.org/ for the last 6 months.

Furthermore, the system is completely app-agnostic and is incredibly easy to use — adding annotation to your web page only requires one line of jquery javascript (assuming a backend is set up):

$('#your-element-id').annotator()

Interested? Below are links to project information including the source code and docs and mailing list. We’re especially eager to get feedback from those looking to integrate into other apps or who would like to help develop the library features.

Project Info

Source code

Features

  • Open JSON-REST annotation protocol – simple JSON and REST-based
  • Javascript (jquery-based) library for inserting inline annotations in a given document supporting this protocol
  • One or more backends implementing this protocol (emphasis on backends that are easy to deploy using standard tools e.g. using sql database or couchdb)
  • Really simple: just do (jquery-esqe) $(‘myelement’).annotator() to get up and running
  • Fast even on large documents
  • Support of multiple users
  • Pluggable backends

Where Does My Money Go? Spending Explorer using Protovis and jQuery

Over the last couple of months I’ve been playing around with Protovis in my spare time to create an interactive pure javascript Government Spending Explorer for Where Does My Money Go? (datastore api):

Warning: won’t work in IE (atm due to lack of svg support) and works best (i.e. fastest) in Chrome!

I’d be interested in any feedback and any suggestions for experience with protovis or any other javascript libraries (I’ve also used flot and thejit a bit). In particular one thing a bit lacking currently in protovis is any animation (something that’s goodin thejit …).

Features:

  • True ‘explorer': you can choose any set of breakdown ‘keys’ to visualize
  • Primary ‘financial bubbles’ view with interactive navigation into bubbles
    • Support for arbitrary depth of data ‘tree’ so you can keep navigating down (though currently limited by user interface to select at most 3 levels)
  • Multiple other visualizations including treemap, sunburst, dendrogram and ‘icicle’
  • Time support
  • View the source data in table or as json

I’m sure there’s tons to improve especially on the usability (e.g. should default labels have amounts in them?) so if you take a look please let me know any feedback.

Some specific limitations:

  • Does not work in IE — but hope to fix this using svg.js soon
  • Colours and general ‘look’ could be improved — help wanted!
  • Occasional bugs e.g. weird redraws — if you find one please let me know

Datapkg v0.7 Beta Released

I’ve just put out a beta of a major new version of datapkg (see changelog below):

There’s a quick getting started section below (also see docs).

About the release

This is a substantial release with a lot of new features. As this is a client app which will run on a variety of platforms its been released as a beta first so there’s a chance to catch any of the cross-platform compatibility bugs that inevitably show up. (My favourite from last time was a variation between python 2.5 and 2.6 in the way urlparse functioned for non-standard schemes …)

I’d therefore really welcome any feedback especially regarding bugs and from people using platforms I don’t usually — such as windows!

Get started fast

# 1. Install: (requires python and easy_install)
$ easy_install datapkg
# Or, if you don't like easy_install
$ pip install datapkg or even the raw source!

# 2. [optional] Take a look at the manual
$ datapkg man

# 3. Search for something
$ datapkg search ckan:// gold
gold-prices -- Gold Prices in London 1950-2008 (Monthly)

# 4. Get some data
# This will result in a csv file at /tmp/gold-prices/data
$ datapkg download ckan://gold-prices /tmp

Find out more » — including how to create, register and distribute your own ‘data packages.

Changelog

  • (MAJOR) Support for uploading datapkgs (upload.py)
  • (MAJOR) Much improved and extended documenation
  • (MAJOR) Make datapkg easily extendable
    • Support for adding new Index types with plugins
    • Support for adding new Commands with command plugins
    • Support for adding new Distributions with distribution plugins
  • Improved package download support (also now pluggable)
  • New sqlite-based DB index (ticket:360)
  • Improved spec: support for db type index + better documentation
  • Better configuration management (especially internally)
  • Reduce dependencies by removing dependency on PasteScript and PasteDeploy
  • Various minor bugfixes and code improvements

Versioning / Revisioning for Data, Databases and Domain Models: Copy-on-Write and Diffs

There are several ways to implement revisioning (versioning) of domain model and Databases and data generally):

  • Copy on write – so one has a ‘full’ copy of the model/DB at each version.
  • Diffs: store diffs between versions (plus, usually, a full version of the model at a given point in time e.g. store HEAD)

In both cases one will usually want an explicit Revision/Changeset object to which :

  • timestamp
  • author of change
  • log message

In more complex revisioning models this metadata may also be used to store key data relevant to the revisioning structure (e.g. revision parents)

Copy on write

In its simplest form copy-on-write (CoW) would copy entire DB on each change. However, this is cleary very inefficient and hence one usually restricts the copy-on-write to relevant changed “objects”. The advantage of doing this is that it limits the the changes we have to store (in essence objects unchanged between revision X and revision Y get “merged” into a single object).

For example, if our domain model had Person, Address, Job, a change to Person X would only require a copy of Person X record (an even more standard example is wiki pages). Obviously, for this to work, one needs to able to partition the data (domain model). With normal domain model this is trivial: pick the object types e.g. Person, Address, Job etc. However, for a graph setup (as with RDF) this is not so trivial.

Why? In essence, for copy on write to work we need:

  1. a way to reference entities/records
  2. support for putting objects in a deleted state

The (RDF) graph model has poor way for referencing triples (we could use named graphs, quads or reification but none are great). We could move to the object level and only work with groups of triples (e.g. those corresponding to a “Person”). You’d also need to add a state triple to every base entity (be that a triple or named graph) and add that to every query statement. This seems painful.

Diffs

The diff models involves computing diffs (forward or backward) for each change. A given version of the model is then computed by composing diffs.

Usually for performance reasons full representations of the model/DB at a given version are cached — most commonly HEAD is kept available. It is also possible to cache more frequently and, like copy-on-write, to cache selectively (i.e. only cache items which have change since the last cache period).

The disadvantage of the diff model is the need (and cost) of creating and composing diffs (CoW is, generally, easier to implement and use). However, it is more efficient in storage terms and works better with general data (one can always compute diffs), especially that which doesn’t have such a clear domain model — e.g. the RDF case discussed above.

Usage

  • Wikis: Many wikis implement a full copy-on-write model with a full copy of each page being made on each write.
  • Source control: diff model (usually with HEAD cached and backwards diffs)
  • vdm: copy-on-write using SQL tables as core ‘domain objects’
  • ordf: (RDF) diffs with HEAD caching