You are browsing the archive for Friedrich Lindenberg.

Organisational IDs inside government: publicbodies.org

Friedrich Lindenberg - October 24, 2011 in Uncategorized

The final event of last week’s Open Government Data Camp in Warsaw was a workshop on organisational identifiers. While the bulk of the conversation revolved around IDs for companies and charities, this also prompted me to continue work on publicbodies.org. I had proposed and registered this site a few months ago after cleansing the UK’s departmental spending data.

While the 25k data is a great source of information, the data quality is very low. In particular, it has two fields called “Department Family” and “Entity” that reference the responsible department and public body disbursing the funds, respectively. The content of these fields is just text, leading to a large number of different spellings not only of the word “Health” but of department and entity names in general.

The solution to this is obviously Chris Taggart’s OpenCorporates.com: generate a unique REST resource for each public body and provide a reconciliation API to fuzzy match surface forms to these entities. So, let me present to you: publicbodies.org. The site currently contains data from the best sources on public body identities I could find, freedom of information request sites. With a quick bit of research, this gave me German public bodies, UK entities and EU insitutions (with names in multiple languages).

While the service is very usable as is (I’ve just reconciled the EC’s Financial Transparency System and a sample of the UK spending), there are still some tasks left:

  • Map national identifier systems to the scheme and support their resolution. Currently, the site is using random IDs prefixed with the country code of the entry.
  • Creating redirects to similar resources, such as data.gov.uk’s reference project.
  • Needs to look nicer.

If you have public body information from another jurisdiction (or would be willing to merge additional information into the existing dataset), have a look at the main schema and the alias table to see how the data should be formatted. I’m happy to add other fields to the entity schema, though.

EU Spending Research

Friedrich Lindenberg - August 25, 2011 in Uncategorized

With the Utrecht workshop on EU spending data only about two weeks away, I’ve started to do some research on EU finance in general. While we’ve been exploring the budget, the Commission’s transparency portal and some parts of the regional development funds within OpenSpending for a while, I’d never really had a clear understanding of the proportions and amounts for each part of the EU’s intricate network of policies, objectives, programmes and funds.

And while there are many attempts on the part of the Commission (and whoever is writing the relevant Wikipedia pages) to explain individual funding schemes and overall budget priorities, it’s pretty hard to find a map. So – time to make one. On a very high level, here’s an overview of the EU funds system (click for Google Drawing):

This – as well the the underlying numbers research and the list of available data sources is just a first attempt to collect knowledge – most figures are probably wrong (or outdated). But perhaps a few other spending cartographers will help fix and complete this over the following months.

Helmut the reconciliation server

Friedrich Lindenberg - July 29, 2011 in Uncategorized

Helmut Kohl is the German chancellor who famously claims to have single-handedly reunited Eastern and Western Germany. Given these impressive powers in reconnecting split entities based on their true identity, its a good thing we now have this for data as well.

Helmut the reconciliation server is a very simple service based on the webstore that provides three distinct services:

  • URIs for entities, e.g. MIME types, public bodies etc.
  • A reconciliation API to perform somewhat fuzzy matching on these entities, both based on a general query string and specified properties (follows Google Refine spec).
  • An alias normalization service where for each entity multiple aliases can be defined.
  • A web interface for creating alias mappings to manually refine matching.

Most of the idea is obviously stolen from the brilliant OpenCorporates/OpenlyLocal sites by Chris Taggart, but I hope that making them into small and easy services, more things will have reconciliation APIs and become matchable. Using the webstore as a backend for this means users can easily upload an existing mapping spreadsheet they have and turn it into a URI scheme.

More info on the CKAN wiki.

datapatterns.org: let’s collect some tricks for data wrangling!

Friedrich Lindenberg - July 29, 2011 in Uncategorized

How do you scrape a massive online archive? How do you fix a broken CSV file? How do you normalize entity names in a large collection of records? There is a lot of practical skill in handling newly opened data, and the implicit promise of the open data movement is that we will help more people to access and re-use data. And while it would be desirable to be able to offer simple web-based tools for data wrangling, the truth is that what’s required is often a wild mix of web tools, desktop and command-line tools and programming skills. So what we need is the other half of the Open Data Manual.

datapatterns.org will be a collaborative attempt to collect specific tips on how to code, wrangle and hack your way through messy data. The site will not be end-all of data literacy, but rather adopt a focussed point of view:

  • We try to provide methods that are immediately useful for coders, data journalists, researchers etc. If it doesn’t solve a data acquisition, cleanup or use problem, it can probably wait a bit.
  • Assume basic knowledge of python programming and web technologies. There are many ways to learn this, and we’d probably have a hard time trumping Zed Shaw.
  • Provide opinionated advice: it’s impossible to give a comprehensive overview of all tools, concerns or strategies relating to data and knowledge management. While its certainly interesting to discuss pros and cons of various technologies, its not always useful in practice. datapatterns.org will pick sides, and follow them through.
  • Link out. There’s no reason not to provide contextualized links instead of explaining things ourselves whereever possible.

So how will we create this? Luckily, we have at least two sources of information about data wrangling: the excellent questions on getthedata.org and our own attempts at making sense of data, e.g. in the OpenSpending project. Using these two sources of both questions and answers will probably mean we’ll start off with a slightly odd set of issues, but as with all OKF projects the answer is: bring your own! Either post questions to getthedata.org or write a chapter and commit it to the datapatterns repository on github.

OpenSpending/Solr Search Performance

Friedrich Lindenberg - May 11, 2011 in Uncategorized

The following is just a log of the observed performance characteristics of OpenSpending under load. All measurements have been performed against a local paster server.

Siege: 3 facets (limit: 500), full stats, benchmark mode
Availability: 100.00 %
Response time: 0.69 secs
Transaction rate: 7.18 trans/sec
Throughput: 0.12 MB/sec
Concurrency: 4.95
Longest transaction: 1.57
Shortest transaction: 0.21

(from: siege -c 5 -b -r 200 http://localhost:5000/dataset/ukdepartments, 1.8m entries)

Siege: no facets, no stats, benchmark mode
Availability: 100.00 %
Response time: 0.69 secs
Transaction rate: 7.21 trans/sec
Throughput: 0.12 MB/sec
Concurrency: 4.94
Longest transaction: 1.32
Shortest transaction: 0.23

(from: siege -c 5 -b -r 200 http://localhost:5000/dataset/ukdepartments, 1.8m entries)

Siege: no facets, no stats, benchmark mode, small dataset
Availability: 100.00 %
Response time: 0.94 secs
Transaction rate: 5.22 trans/sec
Throughput: 0.08 MB/sec
Concurrency: 4.90
Longest transaction: 1.65
Shortest transaction: 0.45

(from siege -c 5 -b -r 200 http://localhost:5000/dataset/pwyf_uganda_budget, 12k entries)

Siege: no facets, no stats, benchmark mode, 10 instead of 5 concurrent clients
Availability: 100.00 %
Response time: 1.42 secs
Transaction rate: 7.00 trans/sec
Throughput: 0.11 MB/sec
Concurrency: 9.91
Longest transaction: 2.35
Shortest transaction: 0.35

(siege -c 10 -b -r 200 http://localhost:5000/dataset/ukdepartments, 1.8m entries)

Footnotes:
Latency between UAT FE Web Server an Solr Index:

--- eu4.okfn.org ping statistics --- 18 packets transmitted, 18 received, 0% packet loss, time 17026ms rtt min/avg/max/mdev = 106.785/106.951/107.301/0.380 ms

25k Spending: ETL Process

Friedrich Lindenberg - April 18, 2011 in CKAN, OpenSpending, Technical

Notes on last weeks discussion re http://ckan.net/package/ukgov-25k-spending. Also produced (working) code:

https://bitbucket.org/okfn/ukgov-25k-spending/

Thinking about the General Problem: Diagram

tool requirements

Steps

  • Get a list of all the files (from DGU)
    • Register them on the CKAN package
  • Consolidate them on disk (cache locally)
  • For each file:
    • Set up a task object
    • Validate the file
    • Convert to OpenSpending format
    • Load
    • Upsert?
    • Invalidate relevant cache

Related

  • Provenance = FileId (Checksum/URL) + RowID [ + SheetID ]
  • Need a fingerprint of each spending entry (perhaps just the row itself?)
    • DeptID + TransactionID
    • DeptID + Expense Area + TransactionID
    • DeptID + Date + TransactionID

Steps (Generalized)

  • List
  • Retrieve
    • Report errors
  • Validate
    • Parse
    • Schema matching, headers
    • Check for null and badly typed values
    • Report errors
  • Transform
    • Provenance [* Decide on Fingerprint]
    • Apply schema
    • Report errors
  • Load (Sync)
    • Upsert, ensure matching??
    • Compute aggregates
    • Solr index
    • Report errors
  • Report issues to source (Government)

Process DB Schema

report {
  id,
  ckan_url,
  resource_id,
  resource,
  source_url,
  department_name,
  report_fingerprint,
  cache_file,
  file_checksum,
  ntries,
  }

error {
  id,
  report_id,
  report_fingerprint,
  entry_fingerprint,
  affected_row,
  affected_column,
  message,
  timestamp,
  fatal,
  }

entry {
  report_id,
  fingerprint,
  department,
  entity,
  date,
  expense_area,
  supplier,
  transaction_number,
  amount
  }

Other links