You are welcome to comment on this page and this wiki in general. To do so, you will need to sign up.
The most up-to-date version of the documentation can be found under LOD Mapping 201107.
Vocabularies
We have decided to use the Bibliographic Ontology for our first attempt to convert our catalog data to Linked Data. The main motivation to do so was to create comprehensible data that lines up with existing Linked Bibliographic Data such as that published by LIBRIS, the OpenLibrary and Mannheim University Library. We are planning to also release the same data using the the RDA vocabularies.
Namespaces used
@prefix bibo: <http://purl.org/ontology/bibo/> . @prefix dc: <http://purl.org/dc/terms/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . @prefix geonames: <http://www.geonames.org/ontology#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix frbr: <http://purl.org/vocab/frbr/core#> .
Note There are several alternatives available for FRBR-vocabularies. We are using the version by Ian Davis et. al because of a naming problems in the current IFLA version. Predicate names in that version have numbers as local name parts, which makes it impossible to serialize the data as RDF/XML.
Mapping of fields
We have mapped to fields from the record-centric RDF/ISO2709-format to a resource-centric BIBO-description as follows. Note that the original field names used below may contain wildcards for single characters (. as used in regular expressions).
Resource-URI |
The URI of the resource that is to be described is derived from identifier of the record, to be found in |
dc:title |
The title of the resource, found in |
dc:language |
The language of the resource, found in |
dc:subject |
Subject-Links. These are derived from several fields:
|
bibo:isbn |
The ISBN of the resource, found in |
bibo:issn |
The ISSN of the resource, found in |
dc:extent |
The extent of the resource, usually the number of pages, as found in |
dcterms:issued |
The year the resource was issued, as found in |
rdf:type |
The type of a resource is derived from several fields, thus possibly resulting in multiple types for the same resource. The current mapping is most likely over-simplified and will be subject of a further analysis for future releases:
|
bibo:volume |
The volume number of the resource, found in |
dc:isPartOf |
Fortunately, the original data already includes many links from subordinate to superordinate records which can be used to link the corresponding resources:
|
bibo:authorlist |
The |
dc:publisher |
The fields |
frbr:exemplar |
In the current state of the raw data, holding information is only implicitly available. Since the records are segmented into packages by instutition, we know that an institution is the |
There is a complete documentation of the fields found in the RDF/ISO2709-Version of the data. Unfortunately, the RDF/ISO2709-fields are not completely in line with this official documentation. This is due to the fact that our data passes through an interface that is based on MARC21 before it is published. Some fields are renamed in this process. We are working on either documenting the differences or using the proper fields.
The resulting model
Infrastructure
The conversion results in 82.471.813 triples which we have loaded into a 4store instance, providing a SPARQL-Endpoint. To serve the data as Linked Data, there is a Pubby Linked Data Frontend tied to that endpoint here. You can also download the entire dump.
Conversion process
Although we have released the raw data in an ntriples-format, using native rdf-tools such as rdflib for python has proved to be way to slow to handle massive amounts of data. Regular expressions in Perl are much faster, and thus used here. Due to the use of blank nodes as explained above, the script outputs RDF in turtle notation so that blank node identifiers don’t have to be generated.
Simple perl-regex based conversion script
open INDEX, "<$ARGV[0]";
open OUTPUT, ">$ARGV[0].ttl";
print OUTPUT
"\@prefix bibo: <http://purl.org/ontology/bibo/> .
\@prefix dcterms: <http://purl.org/dc/terms/> .
\@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
\@prefix foaf: <http://xmlns.com/foaf/0.1/> .
\@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
\@prefix geonames: <http://www.geonames.org/ontology#> .
\@prefix owl: <http://www.w3.org/2002/07/owl#> .
\@prefix frbr: <http://purl.org/vocab/frbr/core#> .
\n";
while ($record = <INDEX>) {
open FILE, "<$record";
$_ = do { local $/; <FILE> };
close FILE;
s/<tag:hbz.metadata.mab.rdfmab#[^>]*> (.*) \./\1 ;/g;
s/<rdfmab:field\/001__a> "(.*)" ;/<http:\/\/lobid.org\/resource\/\1>/g;
s/<rdfmab:field\/010__a> "(.*)"/dcterms:isPartOf <http:\/\/lobid.org\/resource\/\1>/g;
s/<rdfmab:field\/026__a> "((?!HBZ).*)"/owl:sameAs <urn:nbn:de:eki\/\1>/g;
s/<rdfmab:field\/037b_a> /dcterms:language/g;
s/<rdfmab:field\/050> "a.*"/rdf:type dcterms:BibliographicResource/g;
s/<rdfmab:field\/051> "m.*"/rdf:type bibo:Book/g;
if (m/<rdfmab:field\/090__a>/g) {
s/<rdfmab:field\/090__a>/bibo:volume/g;
} else {
s/<rdfmab:field\/089__a>/bibo:volume/g;
}
s/<rdfmab:field\/331._a>/dcterms:title/g;
s/<rdfmab:field\/425a_a>/dcterms:issued/g;
s/<rdfmab:field\/9..__9> "(.*)"/dcterms:subject <http:\/\/d-nb.info\/gnd\/\1>/g;
s/<rdfmab:field\/453__a> "(.*)"/dcterms:isPartOf <http:\/\/lobid.org\/resource\/\1>/g;
s/<rdfmab:field\/540._a>/bibo:isbn/g;
s/<rdfmab:field\/542._a>/bibo:issn/g;
s/<rdfmab:field\/433__a>/dcterms:extent/g;
s/<rdfmab:field\/599__a> "(.*)"/dcterms:isPartOf <http:\/\/lobid.org\/resource\/\1>/g;
s/<rdfmab:field\/700b_a> "(...).*"/dcterms:subject <http:\/\/dewey.info\/class\/\1\/>/g;
$output = $_;
if (@authors = (m/<rdfmab:field\/1..._9> "(.*)"/g)) {
$output .= "bibo:authorlist (\n";
for $author (@authors) {
if (substr($author, 0, 2) == 'HP') {
$output .= "<http://lobid.org/person/$author>\n";
} else {
$output .= "<http://d-nb.info/gnd/$author>\n";
}
}
$output .= ");\n";
}
if (@publisher = (m/<rdfmab:field\/412__a> (.*) ;/g)) {
$output .= "dcterms:publisher [\n";
$output .= "rdf:type foaf:Organisation ;\n";
$output .= "foaf:name @publisher[0] ;\n";
if (@place = (m/<rdfmab:field\/410__a> (.*) ;/g)) {
$output .= "geo:location [\n";
$output .= "rdf:type geo:SpatialThing ;\n";
$output .= "geonames:name @place[0] ;\n";
$output .= "]\n";
}
$output .= "];\n";
}
$output =~ s/<rdfmab:field.*\n//g;
$output .= "rdf:type frbr:Manifestation .\n";
print OUTPUT $output;
}
close OUTPUT
Preliminary steps
We have released those parts of the union catalog that participating institutions have holdings in. For each institution the corresponding subset of records was extracted from the union catalog and packaged independently from any other records. This results in duplicate records for those resources held by several institutions. Thus, the first step was to generate a list of unique files. This file is then split up in order to process the data in parallel.
Preparation: Create list of unique files
$ find ./data/ -type f -printf %h -printf / -printf %f -printf \\t -printf %f\\n > index.txt $ wc -l index.txt 7539743 index.txt $ head index.txt ./data/20100728-DE-38M/0000/0047/HT007583895 HT007583895 ./data/20100728-DE-38M/0000/0047/HT001644273 HT001644273 ./data/20100728-DE-38M/0000/0047/HT004940272 HT004940272 ./data/20100728-DE-38M/0000/0047/HT002031904 HT002031904 ./data/20100728-DE-38M/0000/0047/HT003301706 HT003301706 ./data/20100728-DE-38M/0000/0047/HT003003970 HT003003970 ./data/20100728-DE-38M/0000/0047/HT008492141 HT008492141 ./data/20100728-DE-38M/0000/0047/HT003747423 HT003747423 ./data/20100728-DE-38M/0000/0047/HT003282525 HT003282525 ./data/20100728-DE-38M/0000/0047/HT005010942 HT005010942 $ sort -u -k2 index.txt | cut -f1 > index_uniq.txt $ wc -l index_uniq.txt 5542687 index_uniq.txt $ head index_uniq.txt ./data/20100728-DE-929/0000/0060/BT000000626 ./data/20100728-DE-929/0000/0019/BT000000628 ./data/20100728-DE-38/0000/0019/BT000000887 ./data/20100728-DE-107/0000/0003/BT000001724 ./data/20100728-DE-38M/0000/0034/BT000003114 ./data/20100728-DE-38/0000/0070/BT000003669 ./data/20100728-DE-38/0000/0172/BT000003778 ./data/20100728-DE-107/0000/0046/BT000004683 ./data/20100728-DE-Zw1/0000/0001/BT000005415 ./data/20100728-DE-Zw1/0000/0002/BT000006239 $ split -l 692836 index_uniq.txt part $ wc -l parta* 692836 partaa 692836 partab 692836 partac 692836 partad 692836 partae 692836 partaf 692836 partag 692835 partah 5542687 total
Invoking the script
$ for file in part*; do perl rdfmab2bibo.pl $file & done $ head partaa.ttl <http://lobid.org/resource/BT000000626> dc:language"ger" ; rdf:type dc:BibliographicResource ; rdf:type bibo:Book ; dc:title "Freizeitkarte, leisure map, carte loisirs Ennepe-Ruhr-Kreis" ; dc:created "1993" ; dc:extent "1 Kt. : mehrfarb. ; 47 x 55 cm, gefaltet" ; bibo:isbn "3-8164-0500-2" ; dc:subject <http://d-nb.info/gnd/4014819-1> ; dc:subject <http://d-nb.info/gnd/4155353-6> ;
Generating holdings information
To generate the holding information, we simply generate the corresponding triples based on the file names in the data packaged by institution:
$ find data/20100728-DE-Zw1 -type f -printf "<http://lobid.org/resource/%f> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].\n" > holdings_DE-Zw1.ttl $ head holdings_DE-Zw1.ttl <http://lobid.org/resource/HT000543651> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>]. <http://lobid.org/resource/TT002230534> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>]. <http://lobid.org/resource/TT001091846> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>]. <http://lobid.org/resource/HT001266762> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>]. <http://lobid.org/resource/HT004029684> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>]. <http://lobid.org/resource/HT003848295> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>]. <http://lobid.org/resource/HT014029353> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].
Importing into 4store
$ 4s-import -f turtle -v hbzlod -M http://lobid.org/resource/ hbzlod/hbzlod*.ttl $ 4s-import -f turtle -v hbzlod -M http://lobid.org/resource/holdings/ hbzlod/holdings_DE-*

2 Kommentare
Owen Stephens sagt:
22.09.2010I note that you've hung the place of publication off the publisher, and wonder if this is an issue. Clearly the 'place of publication' is linked to the publisher in some way - they'll have to have some kind of operating address I guess - but it also feels like this is a direct property of the published item as well, and having a direct link may well be beneficial. In the latest modelling from the British Library, they use the proposed isbd:hasPlaceOfPublicationProductionDistribution property. I'm not particularly keen on this - partly because it doesn't exist yet, but mainly because I'm not sure it is sensible to limit this to a 'bibliographic' type property (many things can have a place of production). Any thoughts on this?
Pascal Christoph sagt:
28.07.2011Hi Owen, sorry for the "late" answer... we think it makes sense to make this a direct property of the manifestation itself. Mainly because the place of the publisher may change but the place of publication can not change. Now, imagine we had URIs for publishers (which would be very nice indeed!)), the information of the place of publication can thus not be derived from the place of publisher.
Do you know a better property for this? http://iflastandards.info/ns/isbd/elements doesnot work . Also, do we really need a rdfs:label and thus a bnode?