You are browsing the archive for CKAN.

DataHub – Small Pieces Loosely Joined

Rufus Pollock - June 22, 2012 in CKAN, DataHub

Open Data Workshop in Sofia, Bulgaria 4th June

Rufus Pollock - June 6, 2011 in CKAN, Events, Workshop

Notes from the workshop.

Initial Agenda

  1. Introductions
  2. Set agenda and outline for the day

People

  • Martin – software engineer. Interested in design and how government works.
  • Chrastian – ontotext. Interested in open semantic data. http://www.ontotext.com/
  • Elena – from Sofia University (teach Sociology). Teach course on content analysis. Excited that there is growing interested in public data. You can process a lot but need a purpose.
  • Martin – OpenStreetMap’er. How can we integrate with other data e.g. missing people
  • Ivo – works for ontotext
  • Galia – just interested in open data
  • Bogdаn – software author. Curiosity!
  • Peio – legal adviser by day, IT background. Curious.
  • Plamen – ex-software engineer. Aggregating data from bulgarian parliament.
  • Alex – interested in using new technologies, electronics and music!
  • Stoian Mishinev – IT Specialist
  • Yana Petrova – journalism student

Agenda

  1. Data (and problem) mapping
  2. Problems with getting data
  3. Tools for working with data and developing a community around it (using it)

Summary

Gov data mapping

  • Legislation
  • Finances
  • Civic info
  • Transport
  • Geodata
  • News / Gazettes

Government structure in Bulgaria

  • Central Gov – executive and parliament and courts
  • Regions (28)
  • Municipalities (cities are sometimes municipalities by themselves)
  • Districts (possibly)
  • (mayors in smallest villages)

Legal status for gov material (e.g. legilslation) — ЗАПСП http://lex.bg/bg/laws/ldoc/2133094401 Член 4, точка 4 Не са обект на авторско право

  • Не са обект на авторското право:

    1. нормативни и индивидуални актове на държавни органи за управление, …
    2. новини, факти, сведения и данни.

Question: how far does this extend to all documents.

The Law

The law: in state gazette: mostly online (html and pdf)? http://dv.parliament.bg

Public procurement

4th tab link on: http://dv.parliament.bg/ (no direct link because no urls!)

Parliamentary

Committee debates: http://www.parliament.bg/bg/parliamentarycommittees/members/226/steno

Plenary sessions debates: http://www.parliament.bg/bg/plenaryst

Legal decisions

Local stuff

Finances

Have CKAN package: http://ckan.net/package/bg-budget

Transport

Civic Info (Health, Education etc)

Company Register

The company register was publicly available until 2011; at some point in 2011 it has been closed and access to it is available for a fee.

[ACTION: Peio - get old dump and analysis and add to relevant CKAN dataset]

Geodata and Cadastral

  • Land registry and mapping agency: http://www.cadastre.bg/
    • No data available as far as one can tell!
  • Postcodes: …

Problems getting data

  • Gov objections to giving out data (and what can you do about it).
  • Data format
  • Data persistence
  • Data quality

ACTION [Peio]: clarify scope of public domain provision for gov data (is this just legislation and gov documents or all gov data)

What do we do about PDF? * Ask – directly or via http://isitopendata.org/ * Find a contact if you can * Find out what the worries are … * Transcribe * Find tools – http://getthedata.org/questions/339/excel-table-from-a-pdf

[ACTION: Rufus Pollock: ask Julian Todd to write up instructions on PDF parsing based on UNDemocracy experience]

Tools and Communities

Basic process:

  1. Extract
  2. Transform (clean and integrate)
  3. Load

Tools:

Proprietary but free (in some form or other):

  • Google docs and google fusion tables
  • Google refine
  • Tableau, Needlebase …

Ideas / Wanted

  • croudsourcing the collection of all the bulgarian legislative data
  • extract structured info from plenary and committee debates
  • list of municipalities
  • http://wiki.openspending.org/Countries – find volunteers to populate data for the Bulgarian budget
  • on time stats for public transport
  • wifi locations
  • ‘Tell me about my area’ — On my phone (on facebook even!)

Profiles: Friedrich Lindenberg

Lucy Chambers - May 20, 2011 in CKAN, OpenSpending, People, Profiles

Friedrich Lindenberg is a media scientist turned coder working on open government and transparency initiatives.

As a developer at the Open Knowledge Foundation, he is contributing to OpenSpending, an international effort to make financial data accessible. After presenting the German state budget on OffenerHaushalt.de in 2010, he is now working on technologies that allow the budgets and spending records of any state and region to be visualized and explored. In CKAN, a community-driven data catalogue project, Friedrich has helped to create data portals for a number of European administrations and is working on an effort to create a pan-European data catalogue at publicdata.eu. He is the author of Adhocracy, a collaborative drafting software used by the Internet commission of the German Bundestag and several political parties and organizations to enable citizens to contribute to policy documents and to vote on them.

Profiles: William Waites

Lucy Chambers - May 20, 2011 in CKAN, People, Profiles

William Waites is a network engineer and systems programmer with 15 years experience in academia, industry and international development environments. He has contributed to many free software projects including amongst others 4store, rdflib, Asterisk and NetBSD. He is currently a visiting researcher at the School of Informatics, University of Edinburgh as well as technical director of Okapi Consulting and contributor to several projects of the Open Knowledge Foundation. He is a licensed amateur radio operator, violinist and shuttle-pipe player.

Thoughts on “Local” CKAN and the DataDeck

Rufus Pollock - May 20, 2011 in CKAN, Design

Some very rough thoughts on a Local CKAN and a DataDeck prompted by discussion with Pedro Markun at CONSEGI.

Why

People want to organize and work with data locally on their machine.

Wants:

  • List and view datasets they have (including all files contained in that dataset)
  • Work with the stuff directly (e.g. by putting in a directory)
  • Ability to pull stuff down easily from CKAN (dp pull)
  • Ability to sync semi-automatically with main CKAN (dp push)

Implementation

Options

  • datapkg based
    • rename datapkg to dp
    • By default local index was just json file (index.json) with individual packages in package.json
    • all the stuff talked about with matthew
  • run ckan with sqlite locally (still have a lot of dependencies — can we strip these down)
  • use datadeck (i.e. pure js)
    • but how do we connect to local storage?

Suggested Approach

A combination of datapkg + datadeck (with a stripped down local python server) would be the KISS approach.

Aside on Debian Style Create Your Own Apt-Repository

The datapkg model is nice because it allows us to make it ultra-simple to publish your own mini-package repository. You just share a folder online with:

  • index.json at its base pointing to data package directories
  • each package directory contains a data package (i.e. package.json, manifest.json plus associated files)

Profiles: Adrià Mercader

Lucy Chambers - May 20, 2011 in CKAN, People, Profiles

Adrià Mercader is a software developer focused on the Web and Open technologies in general, and the geospatial field in particular. Before joining the CKAN team he built and managed several geo-related projects for different organizations, ranging from online map viewers to spatially enabled services and APIs. He is currently based in Newcastle upon Tyne, and more information about him can be found on his website.

Profiles: Seb Bacon

Lucy Chambers - May 20, 2011 in CKAN, People, Profiles

Seb Bacon is a software developer and business consultant. He does analysis, design and build for the charity sector. He works on various things in the team, including hacking on the CKAN software, project management, and troublemaking around processes and systems. He has also recently taken on responsibility for improving our documentation. He currently (May 2011) works 4 days a week for OKF.

Data Hubs, Data Management Systems and CKAN

Rufus Pollock - April 27, 2011 in CKAN, Design

Data Hub / Data Management System: Features

Data Hub / Data Management System: Architecture

Questions

  • Is a Data Hub difference from a Data Management System
    • ANS: probably. A hub is, well, more “hubby”, i.e. more social, more ‘there’s one of me’, more about sharing. A DMS could be Data Hub if it had certain features but it could also not be.

25k Spending: ETL Process

Friedrich Lindenberg - April 18, 2011 in CKAN, OpenSpending, Technical

Notes on last weeks discussion re http://ckan.net/package/ukgov-25k-spending. Also produced (working) code:

https://bitbucket.org/okfn/ukgov-25k-spending/

Thinking about the General Problem: Diagram

tool requirements

Steps

  • Get a list of all the files (from DGU)
    • Register them on the CKAN package
  • Consolidate them on disk (cache locally)
  • For each file:
    • Set up a task object
    • Validate the file
    • Convert to OpenSpending format
    • Load
    • Upsert?
    • Invalidate relevant cache

Related

  • Provenance = FileId (Checksum/URL) + RowID [ + SheetID ]
  • Need a fingerprint of each spending entry (perhaps just the row itself?)
    • DeptID + TransactionID
    • DeptID + Expense Area + TransactionID
    • DeptID + Date + TransactionID

Steps (Generalized)

  • List
  • Retrieve
    • Report errors
  • Validate
    • Parse
    • Schema matching, headers
    • Check for null and badly typed values
    • Report errors
  • Transform
    • Provenance [* Decide on Fingerprint]
    • Apply schema
    • Report errors
  • Load (Sync)
    • Upsert, ensure matching??
    • Compute aggregates
    • Solr index
    • Report errors
  • Report issues to source (Government)

Process DB Schema

report {
  id,
  ckan_url,
  resource_id,
  resource,
  source_url,
  department_name,
  report_fingerprint,
  cache_file,
  file_checksum,
  ntries,
  }

error {
  id,
  report_id,
  report_fingerprint,
  entry_fingerprint,
  affected_row,
  affected_column,
  message,
  timestamp,
  fatal,
  }

entry {
  report_id,
  fingerprint,
  department,
  entity,
  date,
  expense_area,
  supplier,
  transaction_number,
  amount
  }

Other links