Diff for "DataModel/ContentModel File"

Differences between revisions 5 and 6

doms:ContentModel_File

Extends [:DataModel/ContentModel_DOMS: doms:ContentModel_DOMS]

In DOMS, we have found it beneficial to separate the abstract concept of "Image" or "Audio" from the concrete implementations such as "jpeg" and "mp3". The metadata about the image will be relevant no matter the manifestation of the image, and as such should not reside along with the technical metadata about the manifestation. To support this separation, we have introduced the concept of File objects.

A File object is an object, that contain a link to the file (in Bitstorage), and the technical metadata about this file. Only File objects are allowed to reference a file in Bitstorage. File objects must all have a Content Model that extends ContentModel_File.

Objects of ContentModel_File can have "doms-relations:hasOriginal" relations to other objects of "doms:ContentModel_File". If a file A is the result of a migration of file B and both are in Doms, the File A data object will have a "doms-relations:hasOriginal" relation to the data object for File B.

Objects of doms:ContentModel_File must have the datastream "CHARACTERISATION", "CONTENTS", "ORIGIN" and "PRONOMID". Each of these deserve some description.

"CONTENTS" is where the actual file is. It will always have the the "E" controlgroup, meaning that the file is externally referenced. The reference must allways be to a File in Bitstorage, and only this datastream is ever allowed to reference files. The datastream itself is just an URL, but if you access it through the API-A, you get the contents of the file.

PRONOMID

In order to perform proper digital preservation, we need to store the exact format of each file somehow. The National Archives, Great Britain have developed the PRONOM scheme, http://www.nationalarchives.gov.uk/pronom For each file format, or version thereof, in the registry, they have a signature file, that enables their tool (DROID) to identify files of this type.

We selected PRONOM because they are currently able to identify all the relevant preservation file formats used in DOMS, and was the closest we could find to an uniformly accepted standard.

The schema for this datastream is:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance#"
            targetNamespace="http://doms.statsbiblioteket.dk/properties/pronomID/0/1/#"
            xmlns="http://doms.statsbiblioteket.dk/properties/pronomID/0/1/#"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified">

    <xsd:element name="pronomID" type="extpropertiesType"/>

    <xsd:complexType name="extpropertiesType">
        <xsd:attribute name="value" type="xsd:string"/>
    </xsd:complexType>
</xsd:schema>

ORIGIN

This section depends on input from Elsebeth, so it has intentionally left blank.

CHARACTERISATION

When a file is uploaded to Bitstorage (see Bitstorage_API), it is automatically subjected to a series of characterisation tools. Aside from extraction the PRONOM ID, which should be stored in the "PRONOMID" datastream, they also extract a lot of other metadata. This metadata is highly dependant on the type of file in question, so parsing it in a general context is not feasable. Instead, a datastream have been made to hold it, so that it can be parsed at a later date, if the particular file becomes interesting.

There is no schema for the actual metadata. We have defined a schema for the containing structure. The metadata must be valid xml, but it does not need to be schema validated.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            targetNamespace="http://doms.statsbiblioteket.dk/types/characterisation/0/1/#"
            xmlns="http://doms.statsbiblioteket.dk/types/characterisation/0/1/#"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified">

    <xsd:element name="characterisation" type="characterisationType"/>

    <xsd:complexType name="characterisationType">
        <xsd:sequence>
            <xsd:element name="characterisationRun"
                         minOccurs="1"
                         maxOccurs="unbounded"
                         type="characterisationRunType"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="characterisationRunType">
        <xsd:sequence>
            <xsd:element name="tool" type="xsd:string" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="output" type="outputType" maxOccurs="1"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="outputType">
        <xsd:sequence>
            <xsd:any namespace="##any" processContents="skip" maxOccurs="unbounded"/>
        </xsd:sequence>
    </xsd:complexType>
</xsd:schema>

The characterisation datastream could look like this.

<?xml version="1.0" encoding="UTF-8"?>
<char:characterisation xsi:schemaLocation="http://doms.statsbiblioteket.dk/types/characterisation/0/1/# http://developer.statsbiblioteket.dk/DOMS/types/characterisation/0/1/characterisation/characterisation-0-1.xsd"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:char="http://developer.statsbiblioteket.dk/DOMS/types/characterisation/0/1/#"
  xmlns:jhove="">
  <char:characterisationRun>
    <char:tool>JHove</char:tool>
    <char:output>
      <jhove:...>
      </jhove:...>
    </char:output>
  </char:characterisationRun>
</char:characterisation>

-  ⇤ ← Revision 5 as of 2008-10-01 13:18:00 → 
  Size: 4165
  Editor: abr
  Comment:
+   ← Revision 6 as of 2008-10-02 14:05:57 → ⇥
  Size: 5682
  Editor: abr
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-=== doms:ContentModel_File ===
+== doms:ContentModel_File ==
 Line 3:
-Line 9:
+Line 8:
-It defines that objects of !ContentModel_File can have "doms-relations:hasOriginal" relations to other objects of "doms:!ContentModel_File". If a file A is the result of a migration of file B  and both are in Doms, the File A data object will have a "doms-relations:hasOriginal" relation to the data object for File B.
+Objects of !ContentModel_File can have "doms-relations:hasOriginal" relations to other objects of "doms:!ContentModel_File". If a file A is the result of a migration of file B  and both are in Doms, the File A data object will have a "doms-relations:hasOriginal" relation to the data object for File B.
-Line 11:
+Line 10:
-Data objects of doms:!ContentModel_File must have the datastream "CHARACTERISATION", "CONTENTS", "ORIGIN" and "PRONOMID". Each of these deserve some description.
+Objects of doms:!ContentModel_File must have the datastream "CHARACTERISATION", "CONTENTS", "ORIGIN" and "PRONOMID". Each of these deserve some description.
-Line 15:
+Line 14:
-If you get the datastream through the standard API, you get the contents of the file, not the link.
+The datastream itself is just an URL, but if you access it through the API-A, you get the contents of the file.
-Line 25:
+Line 23:
-The schema from the PRONOMID_SCHEMA datastream.
+The schema for this datastream is:
-Line 42:
+Line 40:
-Line 45:
+Line 44:
+This section depends on input from Elsebeth, so it has intentionally left blank.
-Line 47:
+Line 48:
+When a file is uploaded to Bitstorage (see Bitstorage_API), it is automatically subjected to a series of characterisation tools. Aside from extraction the PRONOM ID, which should be stored in the "PRONOMID" datastream, they also extract a lot of other metadata. This metadata is highly dependant on the type of file in question, so parsing it in a general context is not feasable. Instead, a datastream have been made to hold it, so that it can be parsed at a later date, if the particular file becomes interesting.
-Line 48:
+Line 50:
-Requirements for objects described by !ContentModel_File
 * ObjectProperties
  * External Properties
   * http://doms.statsbiblioteket.dk/extproperties/#pronomID : The pronom ID of the file
 * Datastreams
  * CHARACTERISATION: The output of the characterisation tools. Schema attachment:Characterisation.xsd
  * CONTENTS: Datastream containing the file
   * !ContentLocation URL = The file in Bitstorage
  * ORIGIN: Metadata about the creation of the file, in the Premis [http://www.loc.gov/standards/premis/v1/Event-v1-1.xsd schema]
+There is no schema for the actual metadata. We have defined a schema for the containing structure. The metadata must be valid xml, but it does not need to be schema validated.
-Line 58:
+Line 52:
+{{{
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            targetNamespace="http://doms.statsbiblioteket.dk/types/characterisation/0/1/#"
            xmlns="http://doms.statsbiblioteket.dk/types/characterisation/0/1/#"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified">

    <xsd:element name="characterisation" type="characterisationType"/>

    <xsd:complexType name="characterisationType">
        <xsd:sequence>
            <xsd:element name="characterisationRun"
                         minOccurs="1"
                         maxOccurs="unbounded"
                         type="characterisationRunType"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="characterisationRunType">
        <xsd:sequence>
            <xsd:element name="tool" type="xsd:string" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="output" type="outputType" maxOccurs="1"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="outputType">
        <xsd:sequence>
            <xsd:any namespace="##any" processContents="skip" maxOccurs="unbounded"/>
        </xsd:sequence>
    </xsd:complexType>
</xsd:schema>
}}}