Converting the Open Data from the hbz to BIBO

You are welcome to comment on this page and this wiki in general. To do so, you will need to sign up.

The most up-to-date version of the documentation can be found under LOD Mapping 201107.

Vocabularies

We have decided to use the Bibliographic Ontology for our first attempt to convert our catalog data to Linked Data. The main motivation to do so was to create comprehensible data that lines up with existing Linked Bibliographic Data such as that published by LIBRIS, the OpenLibrary and Mannheim University Library. We are planning to also release the same data using the the RDA vocabularies.

Namespaces used

@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix dc:   <http://purl.org/dc/terms/> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix geonames: <http://www.geonames.org/ontology#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix frbr: <http://purl.org/vocab/frbr/core#> .

Note There are several alternatives available for FRBR-vocabularies. We are using the version by Ian Davis et. al because of a naming problems in the current IFLA version. Predicate names in that version have numbers as local name parts, which makes it impossible to serialize the data as RDF/XML.

Mapping of fields

We have mapped to fields from the record-centric RDF/ISO2709-format to a resource-centric BIBO-description as follows. Note that the original field names used below may contain wildcards for single characters (. as used in regular expressions).

Resource-URI	The URI of the resource that is to be described is derived from identifier of the record, to be found in `<rdfmab:field/001__a>`.
dc:title	The title of the resource, found in `<rdfmab:field/331._a>`.
dc:language	The language of the resource, found in `<rdfmab:field/037b_a>`.
dc:subject	Subject-Links. These are derived from several fields: `<rdfmab:field/9..__9>` fields contain identifiers from the subject authority file of the German National Library(DNB), which are available as Linked Data since April 2010. `<rdfmab:field/700b_a>` contain DDC-Notations. In order to link to the Linked Data Version of the classification, these numbers are truncated to the first three levels. If the full classification where available, we would be very happy to link to deeper levels.
bibo:isbn	The ISBN of the resource, found in `<rdfmab:field/540._a>`. The ISBN is deliberately provided as a string, not a URI, since it is the string that is the identifier, not some resource identified by <uri:ISBN:ISBN>. This conforms to the range defined in the BIBO.
bibo:issn	The ISSN of the resource, found in `<rdfmab:field/542._a>`. The ISSN is deliberately provided as a string, not a URI, since it is the string that is the identifier, not some resource identified by <uri:ISBN:ISBN>. This conforms to the range defined in the BIBO.
dc:extent	The extent of the resource, usually the number of pages, as found in `<rdfmab:field/433__a>`.
dcterms:issued	The year the resource was issued, as found in `<rdfmab:field/425a_a>`.
rdf:type	The type of a resource is derived from several fields, thus possibly resulting in multiple types for the same resource. The current mapping is most likely over-simplified and will be subject of a further analysis for future releases: if the value of `<rdfmab:field/050>` contains an `a` at the first position, the resource is typed as `dc:BibliographicResource`. if the value of `<rdfmab:field/051>` contains an `m` at the first position, the resource is typed as `bibo:Book`. all resources are generally typed as `frbr:Manifestation`.
bibo:volume	The volume number of the resource, found in `<rdfmab:field/090_a>`, which holds the sortable form. If this is not available, the descriptive form in field `<rdfmab:field/089_a>` is used.
dc:isPartOf	Fortunately, the original data already includes many links from subordinate to superordinate records which can be used to link the corresponding resources: `<rdfmab:field/010__a>` contains the record-id of a direct superordinate `<rdfmab:field/453__a>` contains the record-id of the first series title `<rdfmab:field/599__a>` contains the record-id of the record describing the journal that this resource is published in.
bibo:authorlist	The `<rdfmab:field/1..._9>` fields contain authority numbers of the authors of the resource. To preserve the order, an rdf-list is used instead of simply linking all authors directly via `dc:creator`. The downside of this is that currently the authorlists are blank nodes and thus not handled ideally by generic Linked-Data-Displays such as pubby. Note that there are basically two types of authority numbers in the data: those maintained by the DNB (which are available as Linked Data) and local hbz-numbers, which are not available as Linked Data. In the first case, the resulting link leads to the Linked Data Service of the DNB, in the latter case the link unfortunately leads nowhere.
dc:publisher	The fields `<rdfmab:field/412_a>` and `<rdfmab:field/410_a>` contain the name and place of the publisher. To conform to the range of the `dc:publisher` predicate as defined in the DCMI Metadata Terms, we have introduced blank nodes for the publishers, typed as `foaf:Organisation`. The place of the publisher is attached as another blank node via `geo:location`. That blank node is typed `geo:SpatialThing` and has the name of the place attached by `geonames:name`, since we lack a mapping of the place names to geonames-identifiers. We are aware that this seems overly complicated, but we are trying to identify and properly model the entities that are referenced in the original data, even if that results in blank nodes in the first run. As soon as an authority file for publishers is available, we will try to link there. We might even have a look at the resulting blank nodes and see if the information is clean enough to form the basis of such a file.
frbr:exemplar	In the current state of the raw data, holding information is only implicitly available. Since the records are segmented into packages by instutition, we know that an institution is the `frbr:owner` of at least one `frbr:Item` of the described `frbr:Manifestation`. Since we currently do not have signature-information, those items are once again modelled as blank nodes.

There is a complete documentation of the fields found in the RDF/ISO2709-Version of the data. Unfortunately, the RDF/ISO2709-fields are not completely in line with this official documentation. This is due to the fact that our data passes through an interface that is based on MARC21 before it is published. Some fields are renamed in this process. We are working on either documenting the differences or using the proper fields.

The resulting model

Infrastructure

The conversion results in 82.471.813 triples which we have loaded into a 4store instance, providing a SPARQL-Endpoint. To serve the data as Linked Data, there is a Pubby Linked Data Frontend tied to that endpoint here. You can also download the entire dump.

Conversion process

Although we have released the raw data in an ntriples-format, using native rdf-tools such as rdflib for python has proved to be way to slow to handle massive amounts of data. Regular expressions in Perl are much faster, and thus used here. Due to the use of blank nodes as explained above, the script outputs RDF in turtle notation so that blank node identifiers don’t have to be generated.

Simple perl-regex based conversion script

open INDEX, "<$ARGV[0]";
open OUTPUT, ">$ARGV[0].ttl";

print OUTPUT
"\@prefix bibo:    <http://purl.org/ontology/bibo/> .
\@prefix dcterms:  <http://purl.org/dc/terms/> .
\@prefix rdf:      <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
\@prefix foaf:     <http://xmlns.com/foaf/0.1/> .
\@prefix geo:      <http://www.w3.org/2003/01/geo/wgs84_pos#> .
\@prefix geonames: <http://www.geonames.org/ontology#> .
\@prefix owl:      <http://www.w3.org/2002/07/owl#> .
\@prefix frbr:     <http://purl.org/vocab/frbr/core#> .
\n";

while ($record = <INDEX>) {
open FILE, "<$record";
$_ = do { local $/; <FILE> };
close FILE;

s/<tag:hbz.metadata.mab.rdfmab#[^>]*> (.*) \./\1 ;/g;
s/<rdfmab:field\/001__a> "(.*)" ;/<http:\/\/lobid.org\/resource\/\1>/g;
s/<rdfmab:field\/010__a> "(.*)"/dcterms:isPartOf <http:\/\/lobid.org\/resource\/\1>/g;
s/<rdfmab:field\/026__a> "((?!HBZ).*)"/owl:sameAs <urn:nbn:de:eki\/\1>/g;
s/<rdfmab:field\/037b_a> /dcterms:language/g;
s/<rdfmab:field\/050> "a.*"/rdf:type dcterms:BibliographicResource/g;
s/<rdfmab:field\/051> "m.*"/rdf:type bibo:Book/g;
if (m/<rdfmab:field\/090__a>/g) {
    s/<rdfmab:field\/090__a>/bibo:volume/g;
} else {
    s/<rdfmab:field\/089__a>/bibo:volume/g;
}
s/<rdfmab:field\/331._a>/dcterms:title/g;
s/<rdfmab:field\/425a_a>/dcterms:issued/g;
s/<rdfmab:field\/9..__9> "(.*)"/dcterms:subject <http:\/\/d-nb.info\/gnd\/\1>/g;
s/<rdfmab:field\/453__a> "(.*)"/dcterms:isPartOf <http:\/\/lobid.org\/resource\/\1>/g;
s/<rdfmab:field\/540._a>/bibo:isbn/g;
s/<rdfmab:field\/542._a>/bibo:issn/g;
s/<rdfmab:field\/433__a>/dcterms:extent/g;
s/<rdfmab:field\/599__a> "(.*)"/dcterms:isPartOf <http:\/\/lobid.org\/resource\/\1>/g;
s/<rdfmab:field\/700b_a> "(...).*"/dcterms:subject <http:\/\/dewey.info\/class\/\1\/>/g;

$output = $_;

if (@authors = (m/<rdfmab:field\/1..._9> "(.*)"/g)) {
    $output .= "bibo:authorlist (\n";
    for $author (@authors) {
        if (substr($author, 0, 2) == 'HP') {
            $output .= "<http://lobid.org/person/$author>\n";
        } else {
            $output .= "<http://d-nb.info/gnd/$author>\n";
        }
    }
    $output .= ");\n";
}

if (@publisher = (m/<rdfmab:field\/412__a> (.*) ;/g)) {
    $output .= "dcterms:publisher [\n";
    $output .= "rdf:type foaf:Organisation ;\n";
    $output .= "foaf:name @publisher[0] ;\n";
    if (@place = (m/<rdfmab:field\/410__a> (.*) ;/g)) {
        $output .= "geo:location [\n";
        $output .= "rdf:type geo:SpatialThing ;\n";
        $output .= "geonames:name @place[0] ;\n";
        $output .= "]\n";
    }
    $output .= "];\n";
}

$output =~ s/<rdfmab:field.*\n//g;

$output .= "rdf:type frbr:Manifestation .\n";
print OUTPUT $output;

}

close OUTPUT

Preliminary steps

We have released those parts of the union catalog that participating institutions have holdings in. For each institution the corresponding subset of records was extracted from the union catalog and packaged independently from any other records. This results in duplicate records for those resources held by several institutions. Thus, the first step was to generate a list of unique files. This file is then split up in order to process the data in parallel.

Preparation: Create list of unique files

$ find ./data/ -type f -printf %h -printf / -printf %f -printf \\t -printf %f\\n > index.txt
$ wc -l index.txt
7539743 index.txt
$ head index.txt
./data/20100728-DE-38M/0000/0047/HT007583895    HT007583895
./data/20100728-DE-38M/0000/0047/HT001644273    HT001644273
./data/20100728-DE-38M/0000/0047/HT004940272    HT004940272
./data/20100728-DE-38M/0000/0047/HT002031904    HT002031904
./data/20100728-DE-38M/0000/0047/HT003301706    HT003301706
./data/20100728-DE-38M/0000/0047/HT003003970    HT003003970
./data/20100728-DE-38M/0000/0047/HT008492141    HT008492141
./data/20100728-DE-38M/0000/0047/HT003747423    HT003747423
./data/20100728-DE-38M/0000/0047/HT003282525    HT003282525
./data/20100728-DE-38M/0000/0047/HT005010942    HT005010942
$ sort -u -k2 index.txt | cut -f1 > index_uniq.txt
$ wc -l index_uniq.txt
5542687 index_uniq.txt
$ head index_uniq.txt
./data/20100728-DE-929/0000/0060/BT000000626
./data/20100728-DE-929/0000/0019/BT000000628
./data/20100728-DE-38/0000/0019/BT000000887
./data/20100728-DE-107/0000/0003/BT000001724
./data/20100728-DE-38M/0000/0034/BT000003114
./data/20100728-DE-38/0000/0070/BT000003669
./data/20100728-DE-38/0000/0172/BT000003778
./data/20100728-DE-107/0000/0046/BT000004683
./data/20100728-DE-Zw1/0000/0001/BT000005415
./data/20100728-DE-Zw1/0000/0002/BT000006239
$ split -l 692836 index_uniq.txt part
$ wc -l parta*
   692836 partaa
   692836 partab
   692836 partac
   692836 partad
   692836 partae
   692836 partaf
   692836 partag
   692835 partah
  5542687 total

Invoking the script

$ for file in part*; do perl rdfmab2bibo.pl $file & done
$ head partaa.ttl
<http://lobid.org/resource/BT000000626>
dc:language"ger" ;
rdf:type dc:BibliographicResource ;
rdf:type bibo:Book ;
dc:title "Freizeitkarte, leisure map, carte loisirs Ennepe-Ruhr-Kreis" ;
dc:created "1993" ;
dc:extent "1 Kt. : mehrfarb. ; 47 x 55 cm, gefaltet" ;
bibo:isbn "3-8164-0500-2" ;
dc:subject <http://d-nb.info/gnd/4014819-1> ;
dc:subject <http://d-nb.info/gnd/4155353-6> ;

Generating holdings information

To generate the holding information, we simply generate the corresponding triples based on the file names in the data packaged by institution:

$ find data/20100728-DE-Zw1 -type f -printf "<http://lobid.org/resource/%f> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].\n" > holdings_DE-Zw1.ttl
$ head holdings_DE-Zw1.ttl
<http://lobid.org/resource/HT000543651> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].
<http://lobid.org/resource/TT002230534> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].
<http://lobid.org/resource/TT001091846> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].
<http://lobid.org/resource/HT001266762> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].
<http://lobid.org/resource/HT004029684> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].
<http://lobid.org/resource/HT003848295> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].
<http://lobid.org/resource/HT014029353> frbr:exemplar [rdf:type frbr:Item; frbr:owner <http://lobid.org/organisation/DE-Zw1>].

Importing into 4store

$ 4s-import -f turtle -v hbzlod -M http://lobid.org/resource/ hbzlod/hbzlod*.ttl
$ 4s-import -f turtle -v hbzlod -M http://lobid.org/resource/holdings/ hbzlod/holdings_DE-*

Seitenhierarchie