Differences between revisions 2 and 3
Revision 2 as of 2010-10-14 09:16:13
Size: 1525
Editor: eab
Comment:
Revision 3 as of 2010-12-07 12:11:01
Size: 3234
Editor: eab
Comment:
Deletions are marked like this. Additions are marked like this.
Line 24: Line 24:
The shards must have a sufficiently general structure to accommodate, at least, the current granulated objects as well as those envisioned for the future.
Line 26: Line 25:
Use the shards as elements in the documents they form part of. There will be one Shard object for each Program object. A Shard object is really a very special kind of File Object. Because it is a File Object, it has a CONTENTS datastream with the url to the data. As it is not a "real" file object, the url does not refer to a real file. In fact, it does not refer to anything at the moment. It will always be of the form "http://www.statsbiblioteket.dk/doms/shard/{shard-pid}" where {shard-pid} is the pid of the shard object.

Because it is a File object, the shard object must also have a "CHARACTERISATION" datastream. The datastream is filled in with placeholder values, so the object validates, as the virtual file has no useful characterisation information.

The Shard object has one more datastream, "SHARD_METADATA". This contain the information about which datafile(s) contain the relevant information, and which offsets and cutoff values are used.

An example of the Shard Metadata can be seen below. The file tag is repeatable. The rest is not.
{{{
<shard_metadata>
  <file>
    <file_url>http://bitfinder.statsbiblioteket.dk/bart/mux1.1256943600-2009-10-31-00.00.00_1256947200-2009-10-31-01.00.00_dvb1-1.ts</file_url>
    <channel_id>102</channel_id>
    <program_start_offset>2100</program_start_offset>
    <program_clip_length>1500</program_clip_length>
    <file_name>mux1.1256943600-2009-10-31-00.00.00_1256947200-2009-10-31-01.00.00_dvb1-1.ts</file_name>
    <format_uri>info:pronom/x-fmt/386</format_uri>
  </file>
</shard_metadata>
}}}

This is (all) the information the ShardCutter/BroadcastExtraction service needs to extract and transcode the relevant bit of the recording.

The Shard object has one or more relations to File objects, with the predicate "http://doms.statsbiblioteket.dk/relations/default/0/1/#consistsOf". Each file mentioned in the SHARD_METADATA must also be referenced via this relation (to the corresponding file object).

Data Shards ***

Context

When creating a new collection choices will often have to be made in regards to the actual structure of the data, it would for instance be technically demanding, beyond reason, to store a continuous stream of television data, it must be broken into shards in order to allow for efficient data storage and handling.

Description of pattern

This pattern describes the "low level" construction of a shard of data, it is generalised in order to span the multiple possible application of the shard data model in the DOMS.

Problem description

The major problem is the generality of the sharded data, this is also the strength of the shard pattern, shards are general enough to apply to most situations where data is split up, and can be used to model data-objects that are naturally granulated, think data that originates from pages, as well as data that is originally more complex, think of the video formats stored from a DVB-T stream.

Solution

There will be one Shard object for each Program object. A Shard object is really a very special kind of File Object. Because it is a File Object, it has a CONTENTS datastream with the url to the data. As it is not a "real" file object, the url does not refer to a real file. In fact, it does not refer to anything at the moment. It will always be of the form "http://www.statsbiblioteket.dk/doms/shard/{shard-pid}" where {shard-pid} is the pid of the shard object.

Because it is a File object, the shard object must also have a "CHARACTERISATION" datastream. The datastream is filled in with placeholder values, so the object validates, as the virtual file has no useful characterisation information.

The Shard object has one more datastream, "SHARD_METADATA". This contain the information about which datafile(s) contain the relevant information, and which offsets and cutoff values are used.

An example of the Shard Metadata can be seen below. The file tag is repeatable. The rest is not.

<shard_metadata>
  <file>
    <file_url>http://bitfinder.statsbiblioteket.dk/bart/mux1.1256943600-2009-10-31-00.00.00_1256947200-2009-10-31-01.00.00_dvb1-1.ts</file_url>
    <channel_id>102</channel_id>
    <program_start_offset>2100</program_start_offset>
    <program_clip_length>1500</program_clip_length>
    <file_name>mux1.1256943600-2009-10-31-00.00.00_1256947200-2009-10-31-01.00.00_dvb1-1.ts</file_name>
    <format_uri>info:pronom/x-fmt/386</format_uri>
  </file>
</shard_metadata>

This is (all) the information the ShardCutter/BroadcastExtraction service needs to extract and transcode the relevant bit of the recording.

The Shard object has one or more relations to File objects, with the predicate "http://doms.statsbiblioteket.dk/relations/default/0/1/#consistsOf". Each file mentioned in the SHARD_METADATA must also be referenced via this relation (to the corresponding file object).

Consider next

The actual mapping from a shard to the file (part) in the file system, also consider the possible requirement of a way to combine shard-data into a complete file/document.

GuidelinesForNewDatamodel/PatternLanguage/Data_Shards (last edited 2010-12-07 12:11:01 by eab)