Diff for "DataModel/ContentModel File"

Differences between revisions 13 and 14

doms:ContentModel_File

In DOMS, we have found it beneficial to separate the abstract concept of "Image" or "Audio" from the concrete implementations such as "jpeg" and "mp3". The descriptive metadata about the image will be relevant no matter the manifestation of the image, and as such should not reside along with the technical metadata about the manifestation. To support this separation, we have introduced the concept of File objects. For more details see DomsFileHandling.

A File object is an object, that contains a link to the file (in Bitstorage), and the technical metadata about this file. Only File objects are allowed to reference a file in Bitstorage. File objects must all have a Content Model that extends ContentModel_File.

Objects of ContentModel_File can have "doms-relations:hasOriginal" relations to other objects of "doms:ContentModel_File". If a file A is the result of a migration of file B and both are in Doms, the File A data object will have a "doms-relations:hasOriginal" relation to the data object for File B.

Objects of doms:ContentModel_File must have the datastreams "CHARACTERISATION", "CONTENTS" and "ORIGIN". Each of these deserve some description.

"CONTENTS" is where the actual file is. It will always have the the "E" controlgroup, meaning that the file is externally referenced. The reference must always be to a File in Bitstorage, and only this datastream is ever allowed to reference files. The datastream itself is just an URL, but if you access it through "getDatastreamDissemination" in the Fedora API-A, you get the contents of the file.

The CONTENTS datastream must contain a "<foxml:contentDigest TYPE="MD5" DIGEST=""/> element, with the correct checksum for the referenced file. Given that, we can ensure that the referenced file has not been altered by some other means.

Format URI

In order to perform proper digital preservation, we need to store the exact format of each file somehow. The National Archives, Great Britain, have developed the PRONOM scheme, http://www.nationalarchives.gov.uk/pronom For each file format, or version thereof, in the registry, they have a signature file, that enables their tool (DROID) to identify files of this type.

We selected PRONOM because they are currently able to identify all the relevant preservation file formats used in DOMS, and was the closest we could find to an uniformly accepted standard.

Fedora provide a way to store the format uri of datastream contents in the datastream header, in an attribute called "FORMAT_URI". There exist, of course, a URI version of the pronom IDs, described in http://www.nationalarchives.gov.uk/aboutapps/pronom/puid.htm and http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:pronom/.

As for validation, in the DS-COMPOSITE datastream, described in more detail in FedoraTypeChecking, you can place a number of <form FORMAT_URI="uri"/> tags. The validator requires that the datastream will have one of the thus specified format uris.

ORIGIN

This is a reserved datastream that should always be present in file objects, but the contents are expected to be different in implementing content models for specific file formats.

The datastream is reserved for describing the software and/or hardware that created this file.

We will generally use PREMIS for this, see http://www.loc.gov/standards/premis/

The schema for this datastream is thus expected to be a valid subset of http://www.loc.gov/standards/premis/premis.xsd

CHARACTERISATION

When a file is uploaded to Bitstorage (see Bitstorage_API), it is automatically subjected to a series of characterisation tools. Aside from extraction the PRONOM ID, which should be stored in the "FORMAT_URI" attribute of the "CONTENT" datastream, they also extract a lot of other metadata. This metadata is highly dependant on the type of file in question, so parsing it in a general context is not feasible. Instead, a datastream has been made to hold it, so that it can be parsed at a later date, if the particular file becomes interesting.

There is no schema for the actual metadata. We have defined a schema for the containing structure. The contained metadata must be valid xml, but it does not need to be schema validated.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            targetNamespace="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#"
            xmlns="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified">

    <xsd:element name="characterisation" type="characterisationType"/>

    <xsd:complexType name="characterisationType">
        <xsd:sequence>
            <xsd:element name="characterisationRun"
                         minOccurs="1"
                         maxOccurs="unbounded"
                         type="characterisationRunType"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="characterisationRunType">
        <xsd:sequence>
            <xsd:element name="formatURI" type="xsd:anyURI" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="valid" type="xsd:boolean" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="output" type="outputType" minOccurs="0" maxOccurs="1"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="outputType">
        <xsd:sequence>
            <xsd:any namespace="##any" processContents="skip" minOccurs="0" maxOccurs="unbounded"/>
        </xsd:sequence>
    </xsd:complexType>

</xsd:schema>

The characterisation datastream should look like this.

<c:characterisation xmlns:c="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#">
    <c:characterisationRun>
        <c:formatURI>info:pronom/fmt/42</c:formatURI>
        <c:valid>true</c:valid>
        <c:output></c:output>
    </c:characterisationRun>
</c:characterisation>

This datastream declares that the format of the file is info:pronom/fmt/42, and that the file is valid in regards to this format. In addition, there is an empty block of output from the characterisation program.

-  ⇤ ← Revision 13 as of 2008-10-20 12:10:14 → 
  Size: 6520
  Editor: ThomasSkouHansen
  Comment:
+   ← Revision 14 as of 2010-03-17 13:12:54 → ⇥
  Size: 6526
  Editor: localhost
  Comment: converted to 1.6 markup
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
-Extends [:DataModel/ContentModel_DOMS: doms:ContentModel_DOMS]
+Extends [[DataModel/ContentModel DOMS| doms:ContentModel_DOMS]]
 Line 14:
-The datastream itself is just an URL, but if you access it through [:Fedora_3.0_API#anchor_getDatastreamDissemination:"getDatastreamDissemination"] in the Fedora API-A, you get the contents of the file.
+The datastream itself is just an URL, but if you access it through [[Fedora 3.0 API#anchor_getDatastreamDissemination|"getDatastreamDissemination"]] in the Fedora API-A, you get the contents of the file.
 Line 24:
-Fedora provide a way to store the format uri of datastream contents in the datastream header, in an attribute called "FORMAT_URI". There exist, of course, a URI version of the pronom IDs, described in [http://www.nationalarchives.gov.uk/aboutapps/pronom/puid.htm] and [http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:pronom/].
+Fedora provide a way to store the format uri of datastream contents in the datastream header, in an attribute called "FORMAT_URI". There exist, of course, a URI version of the pronom IDs, described in [[http://www.nationalarchives.gov.uk/aboutapps/pronom/puid.htm]] and [[http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:pronom/]].
 Line 43:
-[[Anchor(CharacterizationSchema)]]
+<<Anchor(CharacterizationSchema)>>