The RadioTV Datamodel in DOMS

As seen a lot of places, the Radio TV collection should be ingested and made available through DOMS. This document describes the datamodel used for the objects in DOMS.

Initial intro

We chose to use a Program-Centric datamodel. So, the primary object is the Program object, corresponding to one TV show, radio program, movie. There is no higher structure encoded in the datamodel, about relations between programs. So, while the programs (of course) contain information about when they were aired, there will be no links to the previous or next program. This structure must be dynamically created from an index.

Beside the program, the datamodel consist of two other types of objects, Shard and File. File objects represent one of the recording files, which spans multiple programs and channels. Shard represent the exact recording of this specific program. The trick here is that the Shard objects do not refer to real files. The files they refer to are created dynamically when needed. One the File objects refer to real files.

Details

As stated above, there are three kinds of objects, Program, Shard and File.

Program

The Program object contain all the bibliographic information we have about the specific aired program. The primary data is stored in the PBCORE datastream, not suprisingly in the PBCore format. We use PBCore version 1.1.

The original bibliographic metadata is in the Ritzau and Gallup/TVMeter format. At present we do not have access to the Gallup/TVMeter data. The original data is stored in the RITZAU_ORIGINAL and GALLUP_ORIGINAL datastreams in the Program object. They have no useful schema, as the data is not really xml.

The Program object contain one and just one relation to a Shard Object, with the predicate "http://doms.statsbiblioteket.dk/relations/default/0/1/#hasShard". There must be a 1-to-1 relation between Program and Shard objects.

Shard

There will be one Shard object for each Program object. A Shard object is really a very special kind of File Object. Because it is a File Object, it has a CONTENTS datastream with the url to the data. As it is not a "real" file object, the url does not refer to a real file. In fact, it does not refer to anything at the moment. It will always be of the form "http://www.statsbiblioteket.dk/doms/shard/{shard-pid}" where {shard-pid} is the pid of the shard object.

Because it is a File object, the shard object must also have a "CHARACTERISATION" datastream. The datastream is filled in with placeholder values, so the object validates, as the virtual file has no useful characterisation information.

The Shard object has one more datastream, "SHARD_METADATA". This contain the information about which datafile(s) contain the relevant information, and which offsets and cutoff values are used.

An example of the Shard Metadata can be seen below. The file tag is repeatable. The rest is not.

<shard_metadata>
  <file>
    <file_url>http://bitfinder.statsbiblioteket.dk/bart/mux1.1256943600-2009-10-31-00.00.00_1256947200-2009-10-31-01.00.00_dvb1-1.ts</file_url>
    <channel_id>102</channel_id>
    <program_start_offset>2100</program_start_offset>
    <program_clip_length>1500</program_clip_length>
    <file_name>mux1.1256943600-2009-10-31-00.00.00_1256947200-2009-10-31-01.00.00_dvb1-1.ts</file_name>
    <format_uri>info:pronom/x-fmt/386</format_uri>
  </file>
</shard_metadata>

This is (all) the information the ShardCutter/BroadcastExtraction service needs to extract and transcode the relevant bit of the recording.