Action Decide on Fedora Modules
- Assigned
- ABR: 2 JRG: 2
- Prev assigned
- Tasks adressed
- ["Tasks/3/3"]
- Time estimated
- 4md
- Time used
- 0.5md
- Priority
- ?
- Status
- In progress
- Iteration
- 22
- Notes
Problem
Decide on the layout of the final production system. Decide which supporting fedora modules should be used, and in what way.
Progress
Established a plan for the deployment of the system
Identified external modules
- Journaling
- Akubra
- OAI-Provider
- GSearch
- Tripple store
- REST API
- Database
Identified internal modules
- Domsclient
- Bitstorage
- ECM
- GUI
- Summa
System Layout
Used modules
Tripple Store
Fedora comes with a triple store, which is not enabled per default. Being able to perform queries about relations between objects are a crucial feature of many doms operations. Fedora has a choice between implementations of the Triple store, namely Mulgara, Kowari and MPTstore. By the performance recommendations of the Fedora developers, this group advise that the Mulgara tripple store be used.
Access to the triple store should not be available directly, but through designated api calls in the doms webservice.
REST api
The REST api was previously an optional part of Fedora. With the newest version of Fedora, it became equivalent to the other API methods. As such, we cannot disable the REST api selectively. Rather, we must accept that there are numerous ways of invoking Fedora, but all boils down to the same API functions. It is these api functions that should be considered, not the different interfaces to them.
The advice of this group is to do nothing further with the REST api. The eventual doms webservice might use this api to speak to fedora.
Database
Fedora already has a database, which is used as a cache for faster lookups of crucial information. The search system in Doms depends on fast lookup of which objects are in which views, and thus needs such a cache. We have the option of establishing a secondary database, or using the Fedora database. But there is some middle ground. Fedora does come with a database, but can easily be configured to use an external database. To ease maintenance, having just one, rather than two, database servers is preferable.
As such, the recommendation of this group is to establish such a separate database. Which database system to be used should be decided by the maintenance group, as they will be running it. All database dependent applications in DOMS should use this database server.
Domsclient/Server
DomsClient/DomsServer is the interface for front end applications to talk to back end applications.
DomsServer will provide a single administrative webservice, with all the high level API functions that are possible in doms. As such, it will aggregate the interfaces from
- Fedora
- Bitstorage
- ECM
- Search webservice
Ideally, this webservice will provide the same methods in both the REST and SOAP protocol. We suggest the use of SOAP over REST initially, as that will make the Client integration easier, but as the popularity of REST is increasing, we will need to support both.
In addition, another webservice will provide a REST based service for user data retrival from DOMS. All common user data retrival points should be implemented, abstracting away the doms infrastructure. REST is chosen here, for the ease which other systems can then access data from DOMS, and we do not plan on making a java frontend for this service.
Authentication will happen at these interfaces.
DomsClient will be a java based connector the the administrative DomsServer Webservice. It will model the Fedora objects and their behaviours (search, ecm and bitstorage connections) in an object oriented way. The purpose of this client is to make give a more productive way to interface java applications with DOMS, than a long list of methods. No behaviour should be enforced by the Client, as it should not act to ensure anything about the repository.
Bitstorage
We decided long ago to use a separate bitstorage, rather than managed Fedora datastreams. I am not entirely privy to the reasons for this decision, but it does ensure that we can hold the bitstorage to be immutable, while updating metadata, which has been a design criteria for DOMS.
The Bitstorage is a system maintained by IT Maintenance. The precise nature of the underlying hardware is not known to this group, and not regarded as interesting. The basic workings is as follows
- A file is uploaded to a stage dir, and md5 summed. If the sum does not match a sum provided by the client, the file is deleted again. From this stage dir the file will be internally accessible by an URL
- If the file is decided to be preserved, it can be approved. To do this, the client must provide the md5 sum of the file, to ensure that the file to be approved is the one he expected. The file is then moved to a storage dir on the server. The internally accessible URL to the file now becomes publicly accessible, and the URL becomes permanent. The file cannot longer be modified, and is assumed to be bitpreserved for the long term.
- If the file should not be preserved, it can be deleted from the stage dir. Approved files cannot be deleted.
Further methods for getting information about files in bitstorage, and the state of the bitstorage is available. Notes about the bitstorage implementation
- Upon upload, the client must provide the size of the file. This size is then reserved in the stage storage, so that an upload will not fail. If there is not sufficient space, the upload will file immediately.
- Upon approve, the file will be copied to two separate harddrives, to guard against data loss. When the file has been backed up to magnetic tape, one of the hard drive copies will be deleted. This way, an approved file will always exist on at least two different storage media.
Providing public authentication for accessing the approved files by users are the province of IT maintenance. The exact nature of their solution are not known to us.
The access to the bitstorage is through bash commands over a ssh connection. This interface is fronted by a webservice (lowlevelbitstorage) developed by us, which should be the only point of contact to the underlying storage.
A webservice (highlevelbitstorage) should be developed. It will provide the interface to high level file handling in DOMS, providing file characterisation as part of upload, and integrating Fedora File objects with bitstorage files. Further webservices might have to be developed, to distribute these functions, but these should work as slaves to highlevelbitstorage. Highlevelbitstorage should be the interface to file operations for DOMS.
ECM
Enhanced Content Models will be used in Fedora. We will deploy the ECM webservice alongside Fedora, to make use of the enhanced features of the new content models. The ECM webservice should be fronted by the DomsServer webservice, and few if any systems should speak directly to the ECM service. The ECM service is available through the REST protocol at the moment, but the nature of the communication between the DomsServer service and ECM is currently undecided.
GUI
The GUI, developed by Mjolner A/S and our library, will be the primary frontend for manipulating the contents of DOMS. Much have been said about the capabilities of the GUI, so I will not go into detail about that here.
The GUI is developed as a SEAM based webapplication, meaning that it requires a Java EE engine to run. We have the option of running it in a JBoss, or running an embedded JBoss inside a Tomcat. We have decided on the the second option, embedded JBoss in tomcat.
IT Maintenance will provide a tomcat with embedded JBoss installed.
The GUI is a java application, and will interact with the DOMS system through the DomsClient (see above). This will require rewriting the deeper layers of the GUI, which, ATM speaks directly to Fedora.
Automated ingest scripts
Summa/Search webservice
Authentication
Unused modules
Journaling
Journaling provides us with the option to always have a up2date mirror of the running repository. This mirror cannot be modified while mirroring, but it can be used for access. Possible scenarios
- Primary server fails, we switch server and is up and running in no time
- Primary server handles the GUI, and mirror handles public access requests.
This use of this functionality is heavily dependent on the uptime requirements to the final system.
We have received the requirements for uptime:
- Public access can be down for 4 hours
- GUI administration can be down for one working day
- One day of changes can be lost
It is believed that the 4 hours requirement on public access can be achieved by IT maintenance without journaling. This leaves journaling as a mean for load balancing. If we run the fedora system inside a virtual system, it can be duplicated to a number of systems by maintenance, and these can all be the receivers of public requests. They will be static, and not receive updates from the master fedora installation, but this is less of a concern.
After discussion with IT maintenance, we have agreed not to use journaling. System replication can be more easily achieved by replicating the machine running Fedora.
Akubra
Akubra blob storage lives on http://www.fedora-commons.org/confluence/display/AKUBRA/Akubra+Project It is a storage layer abstraction. At the moment it is in version 0.1, but in time it will grow to become the storage layer of Fedora.
The basic idea is a BlobStore, storing Blobs. These can be gotten back, and new ones can be made. At the moment, the DOMS system have two very different storage mechanisms, the one in Bitstorage, and the foxml storage. The akubra project would naturally integrate on top of the foxml storage. The specific system in Bitstorage is more of a problem, through. To use Akubra on the Bitstorage would require a fundamental redesign of how files are handled in DOMS.
It is the recommendation of this group that Akubra not be used in DOMS. Further investigations into managed datastreams in Fedora would be worthwhile and might change the decision.
OAI-Provider
http://fedora-commons.org/confluence/display/FCSVCS/OAI+Provider+Service+1.2
The OAI-Provider service adds proper OAI-PMH support to a Fedora repository. Unfortunately, it, like all non-DOMS system, cannot handle view blobs. A view blob is a collection of objects that together comprise a single record. Changes to any of these is regarded as a change to the entire record, but the record can only be made from one head object. As it is these records that should be disseminated, not the individual objects, a separate system is needed to keep track of which have been updated.
It is the conclusion of this group that building into the oai provider an understanding of the view system is more bother than it is worth. Instead a separate oai provider service should be developed, on top of the doms interface, which understands views. This could very possible be based on the http://proai.sourceforge.net/ which incidentally underlies the fedora oai provider.
GSearch
http://fedora-commons.org/confluence/display/FCSVCS/Generic+Search+Service+2.2 Gsearch is a that can produce a lucene index of the contents of a fedora repository. Like the OAI provider above, it suffers from the problem of understanding the doms view blobs, and thinks each object is a separate record. Like the OAI-provider above, this group finds that Gsearch is not worth the bother. Instead, a module for Summa integration should be developed.
Conclusion
Checklist For Working On An Action
The Life Cycle of an Action:
- Assign people for action definition: Done at start of iteration status meeting. Fill out Assigned 
- Define the action: Describe information about what is to be done and how. Fill out Tasks Addressed and Time Estimated. 
- Review the definition: Get another project group member to review the action definition, and update it. 
- Assign people for action implementation: Done by project manager, usually the same persons who wrote the definition. Fill out Assigned and Prev assigned if new persons are assigned. 
- Implement the action: See details below 
- Review the action: Get another project group member to review what is implemented (code and documentation), and update it. 
- Finish the action: Change the status to "Finished" and update the "time used" field on the action page. 
Please make sure that you address the below issues, when working on an action:
- Update the state of the action to "In Progress" when you start working on it.
- Check if the tasks addressed by this action have their status set to "In Progress". If that is not the case, then change the state of them.
- Keep track of how much time that has been spent working on the action. If it addresses more than one task, then make a note on the action page about how much of the elapsed time that has been spent on the individual tasks. Hint: Continually updating the "Time used" field will make it easier for you. 
- Update the "Progress History" and documentation pages of each task addressed by this action when appropriate. This depends on the situation, but in general, the task pages should hold all important related information about the work done, experiences gathered, identified requirements and so on.