Diff for "DataModel/ContentModel File"

Differences between revisions 1 and 8 (spanning 7 versions)

doms:Conten
CONTENTS

Format URI

ORIGIN CHARACTERISATION< & tModel_File class="line862">Extends [:DataModel/ContentModel_DOMS: doms:ContentModel_DOMS]

In DOMS, we have found it beneficial to separate the abstract concept of "Image" or "Audio" from the concrete implementations such as "jpeg" and "mp3". The descriptive metadata about the image will be relevant no matter the manifestation of the image, and as such should not reside along with the technical metadata about the manifestation. To support this separation, we have introduced the concept of File objects. For more details see DomsFileHandling.

A File object is an object, that contains a link to the file (in Bitstorage), and the technical metadata about this file. Only File objects are allowed to reference a file in Bitstorage. File objects must all have a Content Model that extends ContentModel_File.

Objects of ContentModel_File can have "doms-relations:hasOriginal" relations to other objects of "doms:ContentModel_File". If a file A is the result of a migration of file B and both are in Doms, the File A data object will have a "doms-relations:hasOriginal" relation to the data object for File B.

Objects of doms:ContentModel_File must have the datastreams "CHARACTERISATION", "CONTENTS" and "ORIGIN". Each of these deserve some description.

class="line874">"CONTENTS" is where the actual file is. It will always have the the "E" controlgroup, meaning that the file is externally referenced. The reference must always be to a File in Bitstorage, and only this datastream is ever allowed to reference files. The datastream itself is just an URL, but if you access it through [:Fedora_3.0_API#anchor_getDatastreamDissemination:"getDatastreamDissemination"] in the Fedora API-A, you get the contents of the file.

The CONTENTS datastream must contain a "<foxml:ContentDigest type="MD5" value=""/> element, with the correct checksum for the referenced file. Given that, we can ensure that the referenced file has not been altered by some other means.

class="line862">In order to perform proper digital preservation, we need to store the exact format of each file somehow. The National Archives, Great Britain, have developed the PRONOM scheme, http://www.nationalarchives.gov.uk/pronom For each file format, or version thereof, in the registry, they have a signature file, that enables their tool (DROID) to identify files of this type.

We selected PRONOM because they are currently able to identify all the relevant preservation file formats used in DOMS, and was the closest we could find to an uniformly accepted standard.

Fedora provide a way to store the format uri of datastream contents in the datastream header, in a tag called "formatURI". There exist, of course, a URI version of the pronom IDs, described in [http://www.nationalarchives.gov.uk/aboutapps/pronom/puid.htm] and [http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:pronom/].

As for validation, in the DS-COMPOSITE datastream, described in more detail in FedoraTypeChecking, you can place a number of <form FORMAT_URI="uri"/> tags. The validator requires that the datastream will have one of the thus specified format uris.

class="anchor" id="line-30">

This section depends on input from Elsebeth, so it is intentionally left blank.

/h4> class="anchor" id="line-35">

When a file is uploaded to Bitstorage (see Bitstorage_API), it is automatically subjected to a series of characterisation tools. Aside from extraction the PRONOM ID, which should be stored in the "PRONOMID" datastream, they also extract a lot of other metadata. This metadata is highly dependant on the type of file in question, so parsing it in a general context is not feasable. Instead, a datastream has been made to hold it, so that it can be parsed at a later date, if the particular file becomes interesting.

There is no schema for the actual metadata. We have defined a schema for the containing structure. The metadata must be valid xml, but it does not need to be schema validated.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" targetNamespace="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#" xmlns="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#" elementFormDefault="qualified" attributeFormDefault="unqualified"> <xsd:element name="characterisation" type="characterisationType"/> <xsd:complexType name="characterisationType"> <xsd:sequence> <xsd:element name="characterisationRun" minOccurs="1" maxOccurs="unbounded" type="characterisationRunType"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="characterisationRunType"> <xsd:sequence> <xsd:element name="tool" type="xsd:string" maxOccurs="1" minOccurs="1"/> <xsd:element name="formatURI" type="xsd:anyURI" maxOccurs="1" minOccurs="1"/> <xsd:element name="valid" type="xsd:boolean" maxOccurs="1" minOccurs="1"/> <xsd:element name="output" type="outputType" minOccurs="0" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="outputType"> <xsd:sequence> <xsd:any namespace="##any" processContents="skip" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:schema>

The characterisation datastream should look like this.

<c:characterisation xmlns:c="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#"> <c:characterisationRun> <c:tool>JHove</c:tool> <c:formatURI>info:pronom/fmt/42</c:formatURI> <c:valid>true</c:valid> <c:output></c:output> </c:characterisationRun> lt;/c:characterisation>

This datastream declares that the file has been characterized by the tool JHove, that the format of the file is info:pronom/fmt/42, and that the file is valid in regards to this format. In addition, there is an empty block of output from the characterisation program.

-  ⇤ ← Revision 1 as of 2008-10-01 10:44:18 → 
  Size: 5851
  Editor: abr
  Comment:
+   ← Revision 8 as of 2008-10-15 09:10:20 → ⇥
  Size: 6210
  Editor: abr
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
+== doms:ContentModel_File ==
Extends [:DataModel/ContentModel_DOMS: doms:ContentModel_DOMS]
-Line 2:
+Line 4:
-=== doms:ContentModel_File ===
Extends ContentModel_DOMS
+In DOMS, we have found it beneficial to separate the abstract concept of "Image" or "Audio" from the concrete implementations such as "jpeg" and "mp3". The descriptive metadata about the image will be relevant no matter the manifestation of the image, and as such should not reside along with the technical metadata about the manifestation. To support this separation, we have introduced the concept of File objects. For more details see DomsFileHandling.
-Line 5:
+Line 6:
+A File object is an object, that contains a link to the file (in Bitstorage), and the technical metadata about this file. Only File objects are allowed to reference a file in Bitstorage. File objects must all have a Content Model that extends !ContentModel_File.
-Line 6:
+Line 8:
-In DOMS, we have found it beneficial to separate the abstract concept of "Image" or "Audio" from the concrete implementations such as "jpeg" and "mp3". The metadata about the image will be relevant no matter the manifestation of the image, and as such should not reside along with the technical metadata about the manifestation. To support this separation, we have introduced the concept of File objects.
+Objects of !ContentModel_File can have "doms-relations:hasOriginal" relations to other objects of "doms:!ContentModel_File". If a file A is the result of a migration of file B  and both are in Doms, the File A data object will have a "doms-relations:hasOriginal" relation to the data object for File B.
-Line 8:
+Line 10:
-A File object is an object, that contain a link to the file (in Bitstorage), and the technical metadata about this file. Only File objects are allowed to reference a file in Bitstorage. File objects must all have a Content Model that extends !ContentModel_File.
+Objects of doms:!ContentModel_File must have the datastreams "CHARACTERISATION", "CONTENTS" and "ORIGIN". Each of these deserve some description.
-Line 10:
+Line 12:
-The ONTOLOGY datastream from !ContentModel_File
{{{
<rdf:RDF
        xmlns:owl="http://www.w3.org/2002/07/owl#"
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
        xml:base="http://doms.statsbiblioteket.dk/relations/default/0/1/#">
+=== CONTENTS ===
"CONTENTS" is where the actual file is. It will always have the the "E" controlgroup, meaning that the file is externally referenced. The reference must always be to a File in Bitstorage, and only this datastream is ever allowed to reference files.
The datastream itself is just an URL, but if you access it through [:Fedora_3.0_API#anchor_getDatastreamDissemination:"getDatastreamDissemination"] in the Fedora API-A, you get the contents of the file.
-Line 18:
+Line 16:
-    <owl:Class rdf:about="info:fedora/doms:ContentModel_File">
+The CONTENTS datastream must contain a "<foxml:ContentDigest type="MD5" value=""/> element, with the correct checksum for the referenced file. Given that, we can ensure that the referenced file has not been altered by some other means.
-Line 20:
+Line 18:
-        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="hasOriginal"/>
                <owl:allValuesFrom rdf:resource="info:fedora/doms:ContentModel_File"/>
            </owl:Restriction>
        </rdfs:subClassOf>

    </owl:Class>

    <owl:ObjectProperty rdf:about="hasOriginal"/>

</rdf:RDF>
}}}
In human readable format, it defines that objects of !ContentModel_File can have "doms-relations:hasOriginal" relations to other objects of "doms:!ContentModel_File". If a file A is the result of a migration of file B  and both are in Doms, the File A data object will have a "doms-relations:hasOriginal" relation to the data object for File B.


The DS-COMPOSITE datastream 
{{{
<dsCompositeModel
        xmlns="info:fedora/fedora-system:def/dsCompositeModel#"
        xmlns:schema="http://doms.statsbiblioteket.dk/types/dscompositeschema/0/1/#">

    <dsTypeModel ID="CHARACTERISATION">
        <form MIME="text/xml"/>
        <extensions name="DOMS">
            <schema:schema type="xsd" datastream="CHARACTERISATION_SCHEMA"/>
        </extensions>
    </dsTypeModel>

    <dsTypeModel ID="CONTENTS"/>

    <dsTypeModel ID="ORIGIN">
        <form MIME="text/xml"/>
        <extensions name="DOMS">
            <schema:schema type="xsd" datastream="ORIGIN_SCHEMA"/>
        </extensions>
    </dsTypeModel>

    <dsTypeModel ID="PRONOMID">
        <form MIME="text/xml"/>
        <extensions name="DOMS">
            <schema:schema type="xsd" datastream="PRONOMID_SCHEMA"/>
        </extensions>
    </dsTypeModel>

</dsCompositeModel>
}}}
This specifies that data objects of doms:!ContentModel_File must have the datastream "CHARACTERISATION", "CONTENTS", "ORIGIN" and "PRONOMID". Each of these deserve some description.

==== CONTENTS ====
"CONTENTS" is where the actual file is. It will always have the the "E" controlgroup, meaning that the file is externally referenced. The reference must allways be to a File in Bitstorage, and only this datastream is ever allowed to reference files.

If you get the datastream through the standard API, you get the contents of the file, not the link.


==== PRONOMID ====
In order to perform proper digital preservation, we need to store the exact format of each file somehow. The National Archives, Great Britain have developed the PRONOM scheme, http://www.nationalarchives.gov.uk/pronom
+==== Format URI ====
In order to perform proper digital preservation, we need to store the exact format of each file somehow. The National Archives, Great Britain, have developed the PRONOM scheme, http://www.nationalarchives.gov.uk/pronom
-Line 81:
+Line 24:
-The schema from the PRONOMID_SCHEMA datastream. 
{{{
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance#"
            targetNamespace="http://doms.statsbiblioteket.dk/properties/pronomID/0/1/#"
            xmlns="http://doms.statsbiblioteket.dk/properties/pronomID/0/1/#"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified">
+Fedora provide a way to store the format uri of datastream contents in the datastream header, in a tag called "formatURI". There exist, of course, a URI version of the pronom IDs, described in [http://www.nationalarchives.gov.uk/aboutapps/pronom/puid.htm] and [http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:pronom/].
-Line 90:
+Line 26:
-    <xsd:element name="pronomID" type="extpropertiesType"/>

    <xsd:complexType name="extpropertiesType">
        <xsd:attribute name="value" type="xsd:string"/>
    </xsd:complexType>
</xsd:schema>
}}}
+As for validation, in the DS-COMPOSITE datastream, described in more detail in FedoraTypeChecking, you can place a number of <form FORMAT_URI="uri"/> tags. The validator requires that the datastream will have one of the thus specified format uris.
-Line 101:
+Line 31:
+This section depends on input from Elsebeth, so it is intentionally left blank.
-Line 103:
+Line 35:
+When a file is uploaded to Bitstorage (see Bitstorage_API), it is automatically subjected to a series of characterisation tools. Aside from extraction the PRONOM ID, which should be stored in the "PRONOMID" datastream, they also extract a lot of other metadata. This metadata is highly dependant on the type of file in question, so parsing it in a general context is not feasable. Instead, a datastream has been made to hold it, so that it can be parsed at a later date, if the particular file becomes interesting.
-Line 104:
+Line 37:
-Requirements for objects described by !ContentModel_File
 * ObjectProperties
  * External Properties
   * http://doms.statsbiblioteket.dk/extproperties/#pronomID : The pronom ID of the file
 * Datastreams
  * CHARACTERISATION: The output of the characterisation tools. Schema attachment:Characterisation.xsd
  * CONTENTS: Datastream containing the file
   * !ContentLocation URL = The file in Bitstorage
  * ORIGIN: Metadata about the creation of the file, in the Premis [http://www.loc.gov/standards/premis/v1/Event-v1-1.xsd schema]
+There is no schema for the actual metadata. We have defined a schema for the containing structure. The metadata must be valid xml, but it does not need to be schema validated.
-Line 114:
+Line 39:
+{{{
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            targetNamespace="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#"
            xmlns="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified">
-Line 115:
+Line 47:
-The characterisation datastream could look like this.
+    <xsd:element name="characterisation" type="characterisationType"/>

    <xsd:complexType name="characterisationType">
        <xsd:sequence>
            <xsd:element name="characterisationRun"
                         minOccurs="1"
                         maxOccurs="unbounded"
                         type="characterisationRunType"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="characterisationRunType">
        <xsd:sequence>
            <xsd:element name="tool" type="xsd:string" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="formatURI" type="xsd:anyURI" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="valid" type="xsd:boolean" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="output" type="outputType" minOccurs="0" maxOccurs="1"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="outputType">
        <xsd:sequence>
            <xsd:any namespace="##any" processContents="skip" minOccurs="0" maxOccurs="unbounded"/>
        </xsd:sequence>
    </xsd:complexType>

</xsd:schema>
}}}

The characterisation datastream should look like this.
-Line 117:
+Line 78:
-<?xml version="1.0" encoding="UTF-8"?>
<char:characterisation xsi:schemaLocation="http://doms.statsbiblioteket.dk/types/characterisation/0/1/# http://developer.statsbiblioteket.dk/DOMS/types/characterisation/0/1/characterisation/characterisation-0-1.xsd"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:char="http://developer.statsbiblioteket.dk/DOMS/types/characterisation/0/1/#"
  xmlns:jhove="">
  <char:characterisationRun>
    <char:tool>JHove</char:tool>
    <char:output>
      <jhove:...>
      </jhove:...>
    </char:output>
  </char:characterisationRun>
</char:characterisation>
+<c:characterisation xmlns:c="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#">
    <c:characterisationRun>
        <c:tool>JHove</c:tool>
        <c:formatURI>info:pronom/fmt/42</c:formatURI>
        <c:valid>true</c:valid>
        <c:output></c:output>
    </c:characterisationRun>
</c:characterisation>
-Line 130:
+Line 87:
+This datastream declares that the file has been characterized by the tool JHove, that the format of the file is info:pronom/fmt/42, and that the file is valid in regards to this format. In addition, there is an empty block of output from the characterisation program.

doms:Conten CONTENTS

doms:Conten
CONTENTS