This weekend I spent some time converting 381,518 bibliograpic records from the National Library of New Zealand into 9,353,243 triples. Thanks belong with the library for providing their catalogue as open data. I’ve learned a lot through the process, and thought that I would share some of that knowledge.
The code for the conversion is available. The data are awaiting upload.
Note: This is not related to Open Knowledge Foundation’s Working Group on Open Bibliography, although I hope to integrate this work somehow.
I had a few goals:
- let the National Library and others know that they don’t need expensive tools to implement a conversion system
- create a complete MARC to RDF system; the tools that I found did would often filter lots of information from the original MARC records
- learn more about bibliographic data
- learn more about linked data
- create a visualisation of the types of knowledge that New Zealanders generate [currently in progress]
My system is, in principle, fairly simple: I open a record, create a node in the graph and then map the MARC tags, fields and subfields to a predicate, and use the values as values. I left out creating separate nodes for the other entities, such as authors. I wanted to get something up and running.
Here are some of the challenges I faced and what I did to overcome them.
Creating a namespace for non-linked records
One of the things I discovered was that the National Library doesn’t seem to provide each of the items in their catalogue with a URL1. This makes life a bit pain when you want to use a specific URL for each resource. As a bit of a workaround, I decided to do the following to create namespaces:
- where there is a National Library catalogue number, use
- where a WorldCat or RLIN number is known use
- otherwise create a blank node
There are problems with this approach though.
- They’re brittle. The library hasn’t provided any warranty that they will be keeping their
nlnzcat.natlib.govt.nz hostname or their
vprimo application will remain the
- They’re wrong. Using WorldCat’s search is ambiguous. Multiple results can come back when a specific number is given.
- They’re incomplete. A URL for each of the records probably exists somewhere. I just couldn’t find it while looking through the National Library’s site.
Matching MARC21 tags to RDF predicates
One of the more troublesome things was finding what MARC tags records should map to. I spent about two hours looking into different vocabularies to use and reading research published by other projects. Apart from physical descriptions, where I used Good Relations, I went with the recommendations of these three sources who discusses Dublin Core and OWL:
In subsequent iterations, I think I will use a greater variety of more specific vocabularies. For example, use of
skos:Subject will make linkages to dbpedia slightly easier. I would also like to use a vocabulary that’s specific for bibliographic data, but I didn’t come to the project with domain-specific knowledge, so went with generics.
Dublin Core/RDF questions
A question that popped into my mind several times is, “Is the
dc: namespace completely deprecated?”. This was often followed by “But other people are referencing
dc:, will I break something if I suddenly talk about
Lastly, I didn’t know if
dc:Creator were equivalent. I assumed that they are not. However, most people seem to use lower case, but the documentation has evertything in TitleCase. That left me wondering, “Which should I use?”
Running out of memory
Once I had spent several hours in research and had developed a fairly complete mapping between MARC21 and RDF, I decided to fire it up. The graph quickly blew all 4GB of RAM on my laptop. Opps. By that stage, it was 2am on Sunday morning.
The next morning, I used BerkeleyDB to store the graph on disk. This solved my memory problems, but then soon introduced a new one.
I underestimated how long processing would take. Maybe I have been fooled into thinking that things should just take a few seconds to run, maybe a few minutes if something is inefficient. This is the first time I’ve had to deal with processing that takes hours. With
rdflib‘s BerkeleyDB RDF store implementation, I achieved data import of about 1000/triples/second. As a comparison, 4store is about 700 times faster.
I wrote my code in a fairly synchronous manner, which means lots of disc I/O blocking while I wait for the next section of the MARC file to be read and then the triples to be written. I expect that implementing things in Twisted would lead to a speed increase of several orders of magnitude.
Where to from here?
I want to extract more triples. I only have material that is directly related to the items that are held. The records hold much more. There is lots of scope for extracting information relating to authors, publishers, dates and locations.
Linking. I want to link this data with dbpedia, Freebase and the Open Library. Topics and subjects can be linked with dbpedia. Authors and publishers could link to Freebase. Items themselves would suit the Open Library. Much of the data suits all three.
The code is not yet open source. I have some tidying up to do before then. I would like to refactor things. I want to be able to do things in an async manner, probably using Twisted. I expect that the performance improvements will be extremely large, as my code is bound by I/O to a very large extent.
Some issues that are of less use to a more general reader:
pymarc is a very good library. It works on chunks of data at a time, minimising the data that is sitting in RAM before processing begins.
Why are subfields in a list?
The thing I disliked is its treatment of MARC subfields. Rather than using a
dict to map subfields to values,
record.field.subfields comes back as a
'Scale 1:50,000 ;',
'(E 169\xb012\'17"-E 169\xb043\'31"/S 44\xb034\'04"-S 44\xb051\'10").',
'New Zealand map grid proj.'
I used some code to more naturally expressed as a mapping between keys and values:
'a': 'Scale 1:50,000 ;',
'c': '(E 169\xb012\'17"-E 169\xb043\'31"/S 44\xb034\'04"-S 44\xb051\'10").',
'b': 'New Zealand map grid proj.'
MARCReader should support ordering and counting
One design decision that I dislike is that the
MARCReader class has shares the behaviour of Python’s
set type. The
set is in an unordered collection, meaning you can’t ask for the twenty second item. I would have liked to have been able to do this so that I could pick out a single record while learning the API. Instead, I used this code as a workaround:
records = MARCReader('nznb_22feb1011.mrc')
for record in records:
Even though there is an immediate
break statement, Python’s interpreter will assign
record to the first record in
Loading from compressed files
One final thing that might be handy is to be able to add a flag on the
MARCReader class that asks whether the file is compressed. This is only a minor consideration and probably wouldn’t have helped me. I was working with a .gzip file living in a .zip file.
A very small fraction (perhaps 10 dozen) number of records failed because of encoding issues. Here is an example:
(E 169\xb011\'14"-E 169\xb042\'37"/S 44\xb050\'15"-S 45\xb07\'22").
Notably, every map I will eventually get around to manually working out how to fix this. [In retrospect, I should have probably have used
pymarc's ability to coerce data to UTF-8]
Fewer full stops
Full stops are important for sentences, but they’re added noise when added to data. For example, a record might have a list of subjects, that looks similar to this:
The upshot of this is that it makes the data harder to link to other sources. For example, dbpedia encodes these two concepts as
The MARC21 format is an extremely compact way of expressing a huge volume of information. I can see why Linked Data antagonists are skeptical. However, I haven’t begun to actually use the data for anything yet. Its real power is the ability to link multiple datasets together. Hopefully I can add more soon.
Like many software projects, it took a lot longer than I expected. There were many more learning curves than I had anticipated. Even once I had an understanding of MARC21 worked, I then needed to learn
pymarc. Once that was done, I needed to get to grips with the terribly documented
RDFlib‘s documentation should be excellent, it’s a library that is over a decade old.
I’m unhappy with some of the ugliness in the code. I would have preferred to use more efficient or elegant ways of expressing my ideas.
However, because I wanted to create a utility that could be shared, I wanted to minimise dependencies. This meant that I went for a synchronous programming style. This was a mistake. I should have used Twisted from the beginning.
Finally, I’m really happy with the final result. I have already thought of a few applications that I would like to implement that make use of this data. Most importantly though, the learning experience has been very valuable. It’s been vindication that Linked Data doesn’t need to expensive for public agencies to produce.
1 Yes, URI or IRI are more precise terms. However, some of the people reading may be confused by the (to them) new terminology.