A group of Cologne-based libraries has taken a big step towards open data. In an concerted action they have relased their catalogue data for reuse on the web. Project manager Adrian Pohl comments on the initiative and what role the Semantic Web will play for libraries in the future.
In March 2010 several Cologne-based libraries have opened their catalogue data under a CC0 license following Tim Berners-Lee’s call for “Raw Data Now!”. What has been the motivation behind this step?
The hbz (“Hochschulbibliothekzentrum des Landes Nordrhein-Westfalen”, english: “North Rhine-Westphalian Library Service Centre”) has come to the conclusion that libraries need to participate in the development of the Semantic Web. The opening of catalog data followed as a necessary first step. Our intention is to show with this first legal-political step how important the legal/licensing dimension is when you publish data on the web, be it Linked Data or not. So for us at the hbz the Open Data initiative primarily is seen as the first step in eventually publishing Linked Open Data just as Tim Berners-Lee had called for.
Other participants in the Cologne Open Data initiative like the Cologne University and City Library focus more on the direct advantages the releasing of raw bibliographic data bings: With other libraries and consortia following this example it will be easy to enrich existing catalog or other bibliographic services with subject headings, classification numbers, tags etc. Also, published raw data is integrated into other web services like Wikipedia which point back to libraries’ services. Indeed, Open Data is an end in itself which should be pursued by more organizations in the library world and beyond it.
The provided data is currently availble in a proprietary but open format. Can you give us some technical description of the published data? Do you have plans in providing more structured datasets in the future?
“Opaque but open” would be the better description of the underlying format because it isn’t proprietary at all. Actually, alongside the data from the hbz union catalog there is data stemming from libraries’ local databases (see http://opendata.ub.uni-koeln.de/ and http://opendata.zbsport.de/). We are using different internal formats. Generally, all the formats are based on the MAB format (an acronym for “Maschinelles Austauschformat für Bibliotheken” which means “Automatic Interchange Format for Libraries”) that is only used in the German and Austrian library world for the data interchange between libraries similar to the better known MARC format (Machine-Readable Cataloging) of the Library of Congress. It was developed in the 1970s for storing data on magnetic tape. The format documentation can be viewed on the German National Library’s webpages. As the format is nearly 40 years old, the processing of MAB data is very cumbersome on modern computers. Therefore, the hbz provides an encapsulation method called “generic format”, where the historic data records of the library catalogs are unwrapped into a more common, user-friendly scheme. Each record is placed into a Unicode UTF-8 encoded file, containing all the MAB fields, each of them separated by line feeds, and the whole record set of a library is forming a “tar” archive, which is compressed afterwards to save space. It is possible to dump those archives by a usual unpack tool. This software is available on all known Windows/Linux/Unix platforms. Or you can use a simple Perl helper script provided by hbz. More tools and scripts, even in other programming languages, are in preparation for publication. The opaqueness and the age of the standards used in the library world (the english standard MARC which is used worldwide doesn’t differ in these respects from MAB) make it necessary to change to a more open and widely adopted standard. That’s where Linked Data comes into play which is based on the accepted and widespread standards HTTP and URIs. The construction of RDF out of the library catalog raw data is a very sophisticated design task. Our plans are to convert the existing data to RDF using proper vocabularies which enable us to lose as little information as possible and giving access to the data by providing a SPARQL endpoint.
Currently the data you provide is open but not yet linked. What are your plans when it comes to contribute to the Linked Data Cloud?
I have to go into greater detail to answer this question properly. Viewed simply, the data of library institutions can be divided into two broad types: authority data and bibliographic data. Authority data splits up in data about people, about corporate entities and about subject headings. In Germany, authority data is maintained centrally by the German National Library in cooperation with the six German library consortia. Bibliographic databases consist of records about books or rather editions of books. Authority data and bibliographic data are already heavily linked, for instance a bibliographic record contains the author’s or editor’s authority number which links to the corresponding authority record. The German National Library is also working on migrating library data, especially authority data, into the Semantic Web. They recently made their Linked Data prototype for authority data publicly available. We have already taken first steps to cooperate and coordinate our efforts. The colleagues at the German National Library have recently developed a Linked Data prototype for their authority data. As they take care of authority data we focus ourselves on bibliographic data. At the moment we are exploring the technology and vocabularies for publishing bibliographic data as Linked Data. That’s a demanding task because besides the known vocabularies like Dublin Core or the Bibliographic Ontology (Bibo) which don’t fully map to the density and structure of the information in the catalogs, there has been several years’ work on the new comprehensive cataloging standard RDA (Resource Description and Access) for which a RDF representation has been developed. However, RDA in RDF needs to be modified a lot so that it can be applied to our bibliographic data. We are currently working on a vocabulary for the union catalog’s data based on existing vocabularies like Bibo and RDA. Of course, as soon as we will have published bibliographic data as linked data we will start linking to hubs in the Linked Data Cloud like DBpedia or GeoNames.
Publishing data to the LOD Cloud is one thing. Consuming data is another. Have you plans to integrate data from the LOD Cloud into your systems? Do you have policies for quality assurance?
Of course the possibility to incorporate data from other sources easily is one major reason for us to publish Linked Data besides the goal of making libraries’ data an integral part of the web. Enriching our data with other data and providing new services through and with mashups would be a main reason to link to other data. We are, however, not working on such projects yet, because we first need to convert our legacy data to RDF.
What role will the Semantic Web play for libraries in the future?
We believe the Semantic Web plays an important role for the future of libraries. Discussions about “Next Generation Catalogs” are a recurring theme in the library world since the 1990s. It is time to finally act and move our data enprisoned in opaque formats to a new level by improving its structure and underlying technology and by migrating to formats that can be easily consumed by others who are not part of the library world. Joining the Linked Open Data community seems to us the best way to go. Also, the production, publication and dissemination of academic literature is subject to ongoing and fundamental changes which have far-reaching implications for the work of academic libraries and their role in research and education. We believe that semantic markup and interlinking will play an important role in the development of knowledge production and thus indirectly will have great impact on libraries. Clearly, the Semantic Web can’t be cancelled out of the future of libraries.
Moreover, turning your question around, libraries could play an important role for the future of the Semantic Web. Libraries are trusted institutions and deeply grounded in our culture. As indicated above libraries have produced linked data (again: lower case) since the time of card catalogs. We undoubtly have some practice in producing and curating linked data which should be worth a lot to the Semantic Web community. We thus think libraries are predestinated for helping to coninuously order the messy place the Semantic Web always will be and ensuring its trustworthiness and stability.
About Adrian Pohl
Adrian Pohl is working at the Cologne-based North Rhine-Westphalian Library Service Center on Open Data, Linked Data and its conceptual, theoretical and legal implications. He regularly writes at Übertext: Blog about the internet, libraries and metadata, Linked Open Data, communication, epistemology and the like. He has studied communication science and philosophy in Aachen and is currently studying Library and Information Science at the Cologne University of Applied Science. You can follow him on Twitter: http://twitter.com/acka47.