Differences between revisions 7 and 8
Revision 7 as of 2008-10-02 14:15:34
Size: 5721
Editor: abr
Comment:
Revision 8 as of 2008-10-15 09:10:20
Size: 6210
Editor: abr
Comment:
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
In DOMS, we have found it beneficial to separate the abstract concept of "Image" or "Audio" from the concrete implementations such as "jpeg" and "mp3". The metadata about the image will be relevant no matter the manifestation of the image, and as such should not reside along with the technical metadata about the manifestation. To support this separation, we have introduced the concept of File objects. For more details see DomsFileHandling. In DOMS, we have found it beneficial to separate the abstract concept of "Image" or "Audio" from the concrete implementations such as "jpeg" and "mp3". The descriptive metadata about the image will be relevant no matter the manifestation of the image, and as such should not reside along with the technical metadata about the manifestation. To support this separation, we have introduced the concept of File objects. For more details see DomsFileHandling.
Line 6: Line 6:
A File object is an object, that contain a link to the file (in Bitstorage), and the technical metadata about this file. Only File objects are allowed to reference a file in Bitstorage. File objects must all have a Content Model that extends !ContentModel_File. A File object is an object, that contains a link to the file (in Bitstorage), and the technical metadata about this file. Only File objects are allowed to reference a file in Bitstorage. File objects must all have a Content Model that extends !ContentModel_File.
Line 10: Line 10:
Objects of doms:!ContentModel_File must have the datastream "CHARACTERISATION", "CONTENTS", "ORIGIN" and "PRONOMID". Each of these deserve some description. Objects of doms:!ContentModel_File must have the datastreams "CHARACTERISATION", "CONTENTS" and "ORIGIN". Each of these deserve some description.
Line 12: Line 12:
==== CONTENTS ====
"CONTENTS" is where the actual file is. It will always have the the "E" controlgroup, meaning that the file is externally referenced. The reference must allways be to a File in Bitstorage, and only this datastream is ever allowed to reference files.
The datastream itself is just an URL, but if you access it through the API-A, you get the contents of the file.
=== CONTENTS ===
"CONTENTS" is where the actual file is. It will always have the the "E" controlgroup, meaning that the file is externally referenced. The reference must always be to a File in Bitstorage, and only this datastream is ever allowed to reference files.
The datastream itself is just an URL, but if you access it through [:Fedora_3.0_API#anchor_getDatastreamDissemination:"getDatastreamDissemination"] in the Fedora API-A, you get the contents of the file.
Line 16: Line 16:
The CONTENTS datastream must contain a "<foxml:ContentDigest type="MD5" value=""/> element, with the correct checksum for the referenced file. Given that, we can ensure that the referenced file has not been altered by some other means.
Line 17: Line 18:
==== PRONOMID ====
In order to perform proper digital preservation, we need to store the exact format of each file somehow. The National Archives, Great Britain have developed the PRONOM scheme, http://www.nationalarchives.gov.uk/pronom
==== Format URI ====
In order to perform proper digital preservation, we need to store the exact format of each file somehow. The National Archives, Great Britain, have developed the PRONOM scheme, http://www.nationalarchives.gov.uk/pronom
Line 23: Line 24:
The schema for this datastream is:
{{{
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance#"
            targetNamespace="http://doms.statsbiblioteket.dk/properties/pronomID/0/1/#"
            xmlns="http://doms.statsbiblioteket.dk/properties/pronomID/0/1/#"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified">
Fedora provide a way to store the format uri of datastream contents in the datastream header, in a tag called "formatURI". There exist, of course, a URI version of the pronom IDs, described in [http://www.nationalarchives.gov.uk/aboutapps/pronom/puid.htm] and [http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:pronom/].
Line 32: Line 26:
    <xsd:element name="pronomID" type="extpropertiesType"/>

    <xsd:complexType name="extpropertiesType">
        <xsd:attribute name="value" type="xsd:string"/>
    </xsd:complexType>
</xsd:schema>
}}}
As for validation, in the DS-COMPOSITE datastream, described in more detail in FedoraTypeChecking, you can place a number of <form FORMAT_URI="uri"/> tags. The validator requires that the datastream will have one of the thus specified format uris.
Line 44: Line 31:
This section depends on input from Elsebeth, so it has intentionally left blank. This section depends on input from Elsebeth, so it is intentionally left blank.
Line 48: Line 35:
When a file is uploaded to Bitstorage (see Bitstorage_API), it is automatically subjected to a series of characterisation tools. Aside from extraction the PRONOM ID, which should be stored in the "PRONOMID" datastream, they also extract a lot of other metadata. This metadata is highly dependant on the type of file in question, so parsing it in a general context is not feasable. Instead, a datastream have been made to hold it, so that it can be parsed at a later date, if the particular file becomes interesting. When a file is uploaded to Bitstorage (see Bitstorage_API), it is automatically subjected to a series of characterisation tools. Aside from extraction the PRONOM ID, which should be stored in the "PRONOMID" datastream, they also extract a lot of other metadata. This metadata is highly dependant on the type of file in question, so parsing it in a general context is not feasable. Instead, a datastream has been made to hold it, so that it can be parsed at a later date, if the particular file becomes interesting.
Line 55: Line 42:
            targetNamespace="http://doms.statsbiblioteket.dk/types/characterisation/0/1/#"
            xmlns="http://doms.statsbiblioteket.dk/types/characterisation/0/1/#"
            targetNamespace="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#"
            xmlns="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#"
Line 74: Line 61:
            <xsd:element name="output" type="outputType" maxOccurs="1"/>             <xsd:element name="formatURI" type="xsd:anyURI" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="valid" type="xsd:boolean" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="output" type="outputType" minOccurs="0" maxOccurs="1"/>
Line 80: Line 69:
            <xsd:any namespace="##any" processContents="skip" maxOccurs="unbounded"/>             <xsd:any namespace="##any" processContents="skip" minOccurs="0" maxOccurs="unbounded"/>
Line 83: Line 72:
Line 86: Line 76:
The characterisation datastream could look like this. The characterisation datastream should look like this.
Line 88: Line 78:
<?xml version="1.0" encoding="UTF-8"?>
<char:characterisation xsi:schemaLocation="http://doms.statsbiblioteket.dk/types/characterisation/0/1/# http://developer.statsbiblioteket.dk/DOMS/types/characterisation/0/1/characterisation/characterisation-0-1.xsd"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:char="http://developer.statsbiblioteket.dk/DOMS/types/characterisation/0/1/#"
  xmlns:jhove="">
  <char:characterisationRun>
    <char:tool>JHove</char:tool>
    <char:output>
      <jhove:...>
      </jhove:...>
    </char:output>
  </char:characterisationRun>
</char:characterisation>
<c:characterisation xmlns:c="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#">
    <c:characterisationRun>
        <c:tool>JHove</c:tool>
        <c:formatURI>info:pronom/fmt/42</c:formatURI>
        <c:valid>true</c:valid>
        <c:output></c:output>
    </c:characterisationRun>
</c:characterisation>
Line 101: Line 87:
This datastream declares that the file has been characterized by the tool JHove, that the format of the file is info:pronom/fmt/42, and that the file is valid in regards to this format. In addition, there is an empty block of output from the characterisation program.

doms:ContentModel_File

Extends [:DataModel/ContentModel_DOMS: doms:ContentModel_DOMS]

In DOMS, we have found it beneficial to separate the abstract concept of "Image" or "Audio" from the concrete implementations such as "jpeg" and "mp3". The descriptive metadata about the image will be relevant no matter the manifestation of the image, and as such should not reside along with the technical metadata about the manifestation. To support this separation, we have introduced the concept of File objects. For more details see DomsFileHandling.

A File object is an object, that contains a link to the file (in Bitstorage), and the technical metadata about this file. Only File objects are allowed to reference a file in Bitstorage. File objects must all have a Content Model that extends ContentModel_File.

Objects of ContentModel_File can have "doms-relations:hasOriginal" relations to other objects of "doms:ContentModel_File". If a file A is the result of a migration of file B and both are in Doms, the File A data object will have a "doms-relations:hasOriginal" relation to the data object for File B.

Objects of doms:ContentModel_File must have the datastreams "CHARACTERISATION", "CONTENTS" and "ORIGIN". Each of these deserve some description.

CONTENTS

"CONTENTS" is where the actual file is. It will always have the the "E" controlgroup, meaning that the file is externally referenced. The reference must always be to a File in Bitstorage, and only this datastream is ever allowed to reference files. The datastream itself is just an URL, but if you access it through [:Fedora_3.0_API#anchor_getDatastreamDissemination:"getDatastreamDissemination"] in the Fedora API-A, you get the contents of the file.

The CONTENTS datastream must contain a "<foxml:ContentDigest type="MD5" value=""/> element, with the correct checksum for the referenced file. Given that, we can ensure that the referenced file has not been altered by some other means.

Format URI

In order to perform proper digital preservation, we need to store the exact format of each file somehow. The National Archives, Great Britain, have developed the PRONOM scheme, http://www.nationalarchives.gov.uk/pronom For each file format, or version thereof, in the registry, they have a signature file, that enables their tool (DROID) to identify files of this type.

We selected PRONOM because they are currently able to identify all the relevant preservation file formats used in DOMS, and was the closest we could find to an uniformly accepted standard.

Fedora provide a way to store the format uri of datastream contents in the datastream header, in a tag called "formatURI". There exist, of course, a URI version of the pronom IDs, described in [http://www.nationalarchives.gov.uk/aboutapps/pronom/puid.htm] and [http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:pronom/].

As for validation, in the DS-COMPOSITE datastream, described in more detail in FedoraTypeChecking, you can place a number of <form FORMAT_URI="uri"/> tags. The validator requires that the datastream will have one of the thus specified format uris.

ORIGIN

This section depends on input from Elsebeth, so it is intentionally left blank.

CHARACTERISATION

When a file is uploaded to Bitstorage (see Bitstorage_API), it is automatically subjected to a series of characterisation tools. Aside from extraction the PRONOM ID, which should be stored in the "PRONOMID" datastream, they also extract a lot of other metadata. This metadata is highly dependant on the type of file in question, so parsing it in a general context is not feasable. Instead, a datastream has been made to hold it, so that it can be parsed at a later date, if the particular file becomes interesting.

There is no schema for the actual metadata. We have defined a schema for the containing structure. The metadata must be valid xml, but it does not need to be schema validated.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            targetNamespace="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#"
            xmlns="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified">

    <xsd:element name="characterisation" type="characterisationType"/>

    <xsd:complexType name="characterisationType">
        <xsd:sequence>
            <xsd:element name="characterisationRun"
                         minOccurs="1"
                         maxOccurs="unbounded"
                         type="characterisationRunType"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="characterisationRunType">
        <xsd:sequence>
            <xsd:element name="tool" type="xsd:string" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="formatURI" type="xsd:anyURI" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="valid" type="xsd:boolean" maxOccurs="1" minOccurs="1"/>
            <xsd:element name="output" type="outputType" minOccurs="0" maxOccurs="1"/>
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="outputType">
        <xsd:sequence>
            <xsd:any namespace="##any" processContents="skip" minOccurs="0" maxOccurs="unbounded"/>
        </xsd:sequence>
    </xsd:complexType>

</xsd:schema>

The characterisation datastream should look like this.

<c:characterisation xmlns:c="http://doms.statsbiblioteket.dk/types/characterisation/0/2/#">
    <c:characterisationRun>
        <c:tool>JHove</c:tool>
        <c:formatURI>info:pronom/fmt/42</c:formatURI>
        <c:valid>true</c:valid>
        <c:output></c:output>
    </c:characterisationRun>
</c:characterisation>

This datastream declares that the file has been characterized by the tool JHove, that the format of the file is info:pronom/fmt/42, and that the file is valid in regards to this format. In addition, there is an empty block of output from the characterisation program.

DataModel/ContentModel File (last edited 2010-03-17 13:12:54 by localhost)