Data Shards ***

Context

When creating a new collection choices will often have to be made in regards to the actual structure of the data, it would for instance be technically demanding, beyond reason, to store a continuous stream of television data, it must be broken into shards in order to allow for efficient data storage and handling.

Description of pattern

This pattern describes the "low level" construction of a shard of data, it is generalised in order to span the multiple possible application of the shard data model in the DOMS.

Problem description

The major problem is the generality of the sharded data, this is also the strength of the shard pattern, shards are general enough to apply to most situations where data is split up, and can be used to model data-objects that are naturally granulated, think data that originates from pages, as well as data that is originally more complex, think of the video formats stored from a DVB-T stream.

Solution

There will be one Shard object for each Program object. A Shard object is really a very special kind of File Object. Because it is a File Object, it has a CONTENTS datastream with the url to the data. As it is not a "real" file object, the url does not refer to a real file. In fact, it does not refer to anything at the moment. It will always be of the form "http://www.statsbiblioteket.dk/doms/shard/{shard-pid}" where {shard-pid} is the pid of the shard object.

Because it is a File object, the shard object must also have a "CHARACTERISATION" datastream. The datastream is filled in with placeholder values, so the object validates, as the virtual file has no useful characterisation information.

The Shard object has one more datastream, "SHARD_METADATA". This contain the information about which datafile(s) contain the relevant information, and which offsets and cutoff values are used.

An example of the Shard Metadata can be seen below. The file tag is repeatable. The rest is not.

<shard_metadata>
  <file>
    <file_url>http://bitfinder.statsbiblioteket.dk/bart/mux1.1256943600-2009-10-31-00.00.00_1256947200-2009-10-31-01.00.00_dvb1-1.ts</file_url>
    <channel_id>102</channel_id>
    <program_start_offset>2100</program_start_offset>
    <program_clip_length>1500</program_clip_length>
    <file_name>mux1.1256943600-2009-10-31-00.00.00_1256947200-2009-10-31-01.00.00_dvb1-1.ts</file_name>
    <format_uri>info:pronom/x-fmt/386</format_uri>
  </file>
</shard_metadata>

This is (all) the information the ShardCutter/BroadcastExtraction service needs to extract and transcode the relevant bit of the recording.
The Shard object has one or more relations to File objects, with the predicate "http://doms.statsbiblioteket.dk/relations/default/0/1/#consistsOf". Each file mentioned in the SHARD_METADATA must also be referenced via this relation (to the corresponding file object).

Consider next

The actual mapping from a shard to the file (part) in the file system, also consider the possible requirement of a way to combine shard-data into a complete file/document.

GuidelinesForNewDatamodel/PatternLanguage/Data_Shards (last edited 2010-12-07 12:11:01 by eab)