You are browsing the archive for 2011 April.

Searching for raw data

Tim McNamara - April 30, 2011 in Technical

So you want data. But, you would like to use tools that you are already familiar with. That way, you can focus on areas of most interest to you: analysis.

Let’s consider the example of someone who is looking for air pollution information in Sydney, Australia. Australia has an data catalogue at But it’s likely that the catalogue will be hard to search through, may be incomplete, and may only give us the phone number or email address of someone to talk to. Google spends hundreds of millions of dollars a year indexing the web. Let’s make use of that.

To start with, let’s see how close we get just with:

sydney air pollution

Actually, we get pretty close. The first link takes us to the relevant government department’s web visualisation tool of Sydney’s “Air Quality Index” (AQI). Within one click, I could get to a web table with live data in it. But, it seems impossible to get to the raw data. By inspecting the HTML source, it turns out that the table is living within an iframe to This seems to be a bespoke web application that doesn’t provide access to its raw data.

Maybe someone has created a dump of the AQI?

The first change we could make to our query is by adding a filetype operator. To look for Excel spreadsheets, we simply add:


If we would like to include .xlsx files, we can use brackets and a pipe character. To Google, this says, “either is fine”.

(filetype:xls | filetype:xlsx)

The first link is a World Bank source, which provides information about countries’ air pollution. While we may not have the source that we wanted, we have some raw data to play with as a fallback.

The problem with Excel spreadsheets is that they tend to have lots of formatting issues. Columns are inconsistent. There may be images or explanatory text. These characteristics occur because spreadsheets are created by people for people. They’re made easier for people to read, which tends to make it harder for machines to read. But, there is something better, CSV.

CSV is a great format. It’s plain text. It’s easy. It’s readable by everything. Most importantly though, it’s almost always written by & for machines. This makes it very easy for you to do analysis with. So how do we find CSV files? Same as before


Another format that’s worth mentioning here is TSV. Exactly the same as CSV, but tend to be slightly easier for people to read. You may as well look for both:

(filetype:csv | filetype:tsv)

This will return many results that look they do on your hard drive: Unfortunately, many dynamic websites do not use file extensions. Instead, they will do something like this: The filetype: operator will miss these ones, because the search engine doesn’t think they look like files. To improve our coverage, let’s introduce inurl:.

(filetype:csv | filetype:tsv | inurl:csv | inurl:tsv)

Magic. Our search queries are now able to find those files. Beware though, that inurl: will result in several false positives. However, it will also result in much more data. Including data from several countries that we’re not really interested in. How do we change that?

Use the site: operator. The interesting thing about site: is that it’s useful for more than sites. We can use it to restrict results to particular level domains. If we were only interested in Australian results, then we could ask Google to filter them for us:

We can actually get far more specific. We we’re interested in government sources, just ask for them.

One final operator that is useful to mention is ext:. One of the problems with filetype: is that it tends to perform poorly with more specialised file formats. These include XML files, Atom feeds, ESRI Shapefiles and other industry-specific formats.

Let’s return to the problem at hand. We would like to receive a machine-readable dump of the API from Sydney, Australia. Ideally, we will get our data from a government source. That means we need to construct a query that combine many of the approaches that I’ve mentioned in this post.

sydney air pollution (filetype:csv | filetype:tsv | ext:xml)

The first result is something from stored at New South Wales Ministry of Health. It’s labelled “env_airaqi.csv”. Further inspection reveals that it provides samples from the original data source that we could only access via a web page. Bingo!

A reminder about copyright

Remember, you probably shouldn’t be distributing the files that you find. There’s the possibility that you will be infringing copyright. Stick to publishing the analysis or visualisations the data. Those are your creative works. They don’t count as derived works. Copyright protects the expression of facts, not the facts themselves.


  1. filetype:csv is the most likely way to get raw data quickly
  2. inurl:csv will reveal even more sources, but will return false positives
  3. restricts the results to websites from the British government
  4. ext:xml returns formats that are not indexed by filetype operator

Update: I have created a scraper that provides much of the Sydney air quality dataset in raw form.

Data Hubs, Data Management Systems and CKAN

Rufus Pollock - April 27, 2011 in CKAN, Design

Data Hub / Data Management System: Features

Data Hub / Data Management System: Architecture


  • Is a Data Hub difference from a Data Management System
    • ANS: probably. A hub is, well, more “hubby”, i.e. more social, more ‘there’s one of me’, more about sharing. A DMS could be Data Hub if it had certain features but it could also not be.

The OpenSpending Data Format for Importing Data into OpenSpending

Rufus Pollock - April 21, 2011 in OpenSpending

Working to make it easier for others to get data into OpenSpending.

Here’s a diagram that outlines the process:

Been working on an improved documentation and spec:

Our first, and current, version of a spec (and documentation):

25k Spending: ETL Process

Friedrich Lindenberg - April 18, 2011 in CKAN, OpenSpending, Technical

Notes on last weeks discussion re Also produced (working) code:

Thinking about the General Problem: Diagram

tool requirements


  • Get a list of all the files (from DGU)
    • Register them on the CKAN package
  • Consolidate them on disk (cache locally)
  • For each file:
    • Set up a task object
    • Validate the file
    • Convert to OpenSpending format
    • Load
    • Upsert?
    • Invalidate relevant cache


  • Provenance = FileId (Checksum/URL) + RowID [ + SheetID ]
  • Need a fingerprint of each spending entry (perhaps just the row itself?)
    • DeptID + TransactionID
    • DeptID + Expense Area + TransactionID
    • DeptID + Date + TransactionID

Steps (Generalized)

  • List
  • Retrieve
    • Report errors
  • Validate
    • Parse
    • Schema matching, headers
    • Check for null and badly typed values
    • Report errors
  • Transform
    • Provenance [* Decide on Fingerprint]
    • Apply schema
    • Report errors
  • Load (Sync)
    • Upsert, ensure matching??
    • Compute aggregates
    • Solr index
    • Report errors
  • Report issues to source (Government)

Process DB Schema

report {

error {

entry {

Other links

Talk at RE:PUBLICA 2011

Rufus Pollock - April 15, 2011 in Talks

Yesterday I was at RE:PUBLICA XI to give a talk on Open Government Data in the opening session of the “open” stream. The crammed to over-capacity room was a nice indicator of the growing attention and interest being generated by open data, and especially open governernment data. Slides online here and below.

OKFN Organizational Diagram v0.1

Rufus Pollock - April 11, 2011 in Governance

Open Data Manual: Updates and Next Steps

Rufus Pollock - April 8, 2011 in Open Data Manual, Projects

This about the Open Data Manual: It was also sent as an email to the OGD and EU Open Data lists.

Below I outline where we are with the manual and the next steps we should take. Please let me know your thoughts.

For those who don’t get that far a new (work-in-progress) version of the manual using sphinx doc framework 1 is at:

Source (please fork or request commit rights if you want to contribute!) at:

Current Status

  1. The first of the manual was completed last Autumn during a book sprint in Berlin
  2. There was a period of community review with some additions.
  3. The manual was posted up with some re-formatting and amendments in a wordpress site in January.
  4. It is now close to a basic v1.0

Next Steps

There is lot we can do to make the manual even better, for example:

  1. Translate it
  2. Bulk out many of the sections, some of which are quite rudimentary
  3. Include lots of examples
  4. Include information on working with with data – getting it, processing it, visualizing it etc. This would move the manual towards a more data wrangler / data user audience

I think we can make a lot of progress on these quite quickly. The one thing slightly holding us up at the moment is our current (tech) framework and process for the documentation which I discuss next.

The Documentation Framework

So far we have used a combination of google docs and wordpress. While these were great starting point they have some severe problems, especially if we want more people to get involved:

  1. Limited ‘documentation’ features (e.g. references between pages, table of contents, indexes etc)
  2. Difficult to track changes (no source control) which makes it hard to have more contributions and contributors (e.g. if someone now updates the google docs it will be a nightmare to reintegrate that into wordpress)
  3. Difficult to build to other formats e.g. PDF

I therefore propose:

  1. Move to using Sphinx documentation system [1]. Sphinx uses restructured text whose basic syntax is close to markdown (markdown derived from it) but has many more features that make it suitable for something like this.
  2. Storing the documentation in a version control system (mercurial). This way people can just fork to contribute.
  3. Possibly complementing this with a free-form wiki (for additional material, early drafts etc)

I’ve already made a start on this by:

  1. Moving all the source for the manual into a mercurial repo:
  2. Converted from markdown/html (we had a mixture) to use the sphinx documentation system. You can see the results here (temporary location):

Talk at UKSG 2011 Conference

Rufus Pollock - April 6, 2011 in Talks

Yesterday, I was up in Harrogate at the UKSG (UK Serials Group) annual conference to speak in a keynote session on Open Bibiliograpy and Open Bibliographic Data.

I’ve posted the slides online and iframed below.


Over the past few years, there has an explosive growth in open data with significant uptake in government, research and elsewhere.

Bibliographic records are a key part of our shared cultural heritage. They too should therefore be open, that is made available to the public for access and re-use under an open license which permits use and reuse without restriction ( Doing this promises a variety of benefits.

First, it would allow libraries and other managers of bibliographic data to share records more efficiently and improve quality more rapidly through better, easier feedback. Second, through increased innovation in bibliographic services and applications generating benefits for the producers and users of bibliographic data and the wider community.

This talk will cover the what, why and how of open bibliographica data, drawing on direct recent experience such as the development of the Open Biblio Principles and the work of the Bibliographica and JISC OpenBib projects to make the 3 million records of the British Library’s British National Bibliography (BNB) into linked open data.

With a growing number of Government agencies and public institutions making data open, is it now time for the publishing and library community to do likewise?

Javascript Templating and Frameworks

Rufus Pollock - April 3, 2011 in Uncategorized

Ongoing and incomplete review of javascript templating systems and frameworks.


Unobtrusive (HTML + JSON)

‘Standard’ Templating Browser



  • nodeunit
  • qunit
  • jasmine
  • sinon.js (mocking) – integrates with qunit well



  • backbone – used quite a bit
  • knockout
  • (big) sproutcore


  • express
    • tags: nodejs
  • backbone now supported pretty well

Messaging and Job Queues



  • For mongo:
  • Backbone sort of includes one (though relationships are poorly handled at the moment)