Deployment and Design of DOMS

System Layout

The deployed DOMS system will consist of a number of servers.

Server

Description

Fedora tomcat server

Runs the Fedora webapp and the webapps that should be allowed to reference Fedora as localhost

Database

Acts as caching database for Fedora, and the update tracker service, and, optionally, as a cache for ECM.

Bitstorage

SSH server for file upload, and http server for file access

Gui server

(embedded) JBoss server hosting the GUI

Auth server

Tentative design: LDAP or similar user server. When a user logs in from another system, an account is created in this server

(Optional)Tomcat

Server hosting the webapps that should not live in the Gui server or the fedora server. Might not be needed

Most/All of these servers will run in virtual environments. At the moment, the agreement with Maintenance is that they will provide two virtual servers (fedora server and Gui server), a PostgreSQL server and the Bitstorage.

Maintenance Backup will be done by simple snapshotting the running virtual machines. This will be able to restore the state of DOMS at a later point EXACTLY, including the content of RAM. One such backupped system could be restored very quickly, should the DOMS system fail.

Components.png

Used modules

Fedora Modules

Triple Store

Fedora comes with a triple store, which is not enabled per default. Being able to perform queries about relations between objects are a crucial feature of many DOMS operations. Fedora has a choice between implementations of the Triple store, namely Mulgara, Kowari and MPTstore. By the performance recommendations of the Fedora developers, this group advise that the Mulgara triple store be used.

Access to the triple store should not be available directly, but through designated API calls in the DOMS webservice.

REST API

The REST API was previously an optional part of Fedora. With the newest version of Fedora, it became equivalent to the other API methods. As such, we cannot disable the REST API selectively. Rather, we must accept that there are numerous ways of invoking Fedora, but all boils down to the same API functions. It is these API functions that should be considered, not the different interfaces to them.

The advice of this group is to do nothing further with the REST API. The eventual DOMS webservice might use this API to speak to Fedora.

Database

Fedora already has a database, which is used as a cache for faster lookups of crucial information. The update tracker in Doms depends on fast lookup of which objects are in which views, and which have been updated, and thus needs such a cache. We have the option of establishing a secondary database, or using the Fedora database. But there is some middle ground. Fedora does come with a database, but can easily be configured to use an external database. To ease maintenance, having just one, rather than two, database servers is preferable.

As such, the recommendation of this group is to establish such a separate database. Which database system to be used should be decided by the maintenance group, as they will be running it (PostGresSql). All database dependent applications in DOMS should use this database server.

JMS

Fedora should be configured to publish notifications of updates via JMS. This is standard Fedora functionality, but not enabled by Default. Fedora does come with a message broker, but for performance purposes, it might be advisable to set up a separate system as message broker.

XACML and authentication

Fedora will know the Auth server, and just the Auth server. All users that want to access content in Fedora will use guest rights, or have an account on the Auth server.

We will use XACML to specific authorization. We have decided to build on the Fedora authorization system, rather than graft a new one onto the repository. As such, for metadata (ie. stuff in Fedora), all policies are expressed as XACML documents.

We need some sort of service to authenticate users, and to get properties for authenticated users. This will probably be the WAYF system, hereafter known as the AUTH system. This service is the single point where all authentication will be performed, and all other modules will just forward authentication (user credentials or a token) along with any requests.

Fedora will use the AUTH system to authenticate users wanting to access data. We have admin users, which should have some property set, which will allow them to change objects in Fedora, and disregard all object policies (expressed as an exception in the policies). Users without this property will adhere to policies. This enables us to model user restrictions without cluttering the librarians with user rights, and at the same time ensure that only the correct users can ever modify DOMS.

DOMS components

Domsclient/Server

DomsServer Depends: Fedora, Bitstorage, ECM, Update tracker, Triple Store

DomsClient Depends: DomsServer

!DomsClient/!DomsServer is the interface for frontend applications to talk to backend applications.

DomsServer will provide a single administrative webservice, with all the high level API functions that are possible in DOMS. As such, it will aggregate the interfaces from

This webservice will provide the methods in the SOAP protocol. We suggest the use of SOAP over REST, as that will make the Client integration easier, and the problems with expressing methods in REST. Authentication is done on the Auth server, and Fedora determines Authorization.

In addition to the previous webservice, optionally another webservice will provide a REST based service for user data retrival from DOMS. All common user data retrival points should be implemented, abstracting away the DOMS infrastructure. REST is chosen here, for the ease which other systems can then access data from DOMS, and we do not plan on making a java frontend for this service. This webservice should not have any functions not available through the administrative interface. Access to this service should be authenticated against the AUTH server, as it could be used by any user wanting a programmatic access to the DOMS data. Authorization for the user will be decided by the data objects or files.

DomsClient will be a java based connector the administrative DomsServer Webservice. It will model the Fedora objects and their behaviours (search, ecm and bitstorage connections) in an object oriented way. The purpose of this client is to make give a more productive way to interface java applications with DOMS, than a long list of methods. No behaviour should be enforced by the Client, as it should not act to ensure anything about the repository.

Bitstorage

Depends: Nothing

We decided long ago to use a separate bitstorage, rather than managed Fedora datastreams. I am not entirely privy to the reasons for this decision, but it does ensure that we can hold the bitstorage to be immutable, while updating metadata, which has been a design criteria for DOMS.

The Bitstorage is a system maintained by IT Maintenance. The precise nature of the underlying hardware is not known to the DOMS group, and our responsibility. The basic workings is as follows

  1. A file is uploaded to a stage dir, and md5 summed. If the sum does not match a sum provided by the client, the file is deleted again. From this stage dir the file will be internally accessible by an URL
  2. If the file is decided to be preserved, it can be approved. To do this, the client must provide the md5 sum of the file, to ensure that the file to be approved is the one he expected. The file is then moved to a storage dir on the server. The internally accessible URL to the file now becomes publicly accessible, and the URL becomes permanent. The file can no longer be modified, and is assumed to be bitpreserved for the long term.
  3. If the file should not be preserved, it can be deleted from the stage dir. Approved files cannot be deleted.

Further methods for getting information about files in bitstorage, and the state of the bitstorage is available. Notes about the bitstorage implementation

Providing public authentication for accessing the approved files by users are the province of IT maintenance. The exact nature of the solution is not decided yet, but see the Authentication section for more on that.

The access to the bitstorage is through bash commands over a ssh connection. This interface is fronted by a webservice (lowlevelbitstorage) developed by us, which should be the only point of contact to the underlying storage. This webservice should not use authentication, but rather only accept connections from specific IP adresses, i.e. the server where highlevevl bitstorage lives. This service should provide the logging for bitstorage modifications.

A webservice (highlevelbitstorage) should be developed. It will provide the interface to high level file handling in DOMS, providing file characterisation as part of upload, and integrating Fedora File objects with bitstorage files. Further webservices might have to be developed, to distribute these functions, but these should work as slaves to highlevelbitstorage. Highlevelbitstorage should be the interface to file operations for DOMS. The service will authorize the requests via Fedora.

ECM

Depends: Fedora, Database

Enhanced Content Models will be used in Fedora. We will deploy the ECM webservice alongside Fedora, to make use of the enhanced features of the new content models. The ECM webservice should be fronted by the DomsServer webservice, and and only systems inside DOMS hould speak directly to the ECM service. The ECM service is available through the REST protocol at the moment, but the nature of the communication between the DomsServer service and ECM is currently undecided.

The ECM object validator will be hooked into Fedora. Whenever an object is to be marked active, the validator will run. If the object is invalid, the change is prevented.

The ECM webservice should just forward whatever received authentication to Fedora, for Fedora to authenticate and authorize.

Optionally: ECM should precalculate the content model inheritance paths, and possible the corresponding compound content models, and store these in the database. Constructing deeply inherited compound content models require parsing of very many Fedora objects, and is a very expensive procedure. This is behavior that Fedora will possible do at some future point. A JMS client should be developed, for listening to changes to content models. Such changes should be rare, but they will require recalculation of the inheritance tree.

GUI

Depends: DomsClient

The GUI, developed by Mjolner A/S and our library, will be the primary frontend for manipulating the contents of DOMS. Much has been said about the capabilities of the GUI, so I will not go into detail about that here.

The GUI is developed as a SEAM based webapplication, meaning that it requires a Java EE engine to run. We have the option of running it in a JBoss, or running an embedded JBoss inside a Tomcat. We have decided on the second option, embedded JBoss in tomcat.

IT Maintenance will provide a tomcat with embedded JBoss installed.

The GUI is a java application, and will interact with the DOMS system through the DomsClient (see above). This will require rewriting the deeper layers of the GUI, which, ATM speak directly to Fedora.

The GUI will forward authentication to Fedora.

Automated ingest scripts

Depends: !DomsClient/Rest interface for DomsServer

An automated ingest procedure for assembly line digitalisation should be developed. The procedure will be something like this:

  1. A new record is made in DOMS, via the DOMS gui
  2. The ID of this record is input into the digitizer, along with the media to digitize
  3. The DOMS id is saved, along with the digital files
  4. The automated ingest script will pick up the newly digitized files (possible from a hot-dir), and add them as file objects to the DOMS id
  5. Possibly, the automated ingest script will also publish the DOMS record

The nature of the ingest scripts have not yet been decided, but it will be either java or some scripting language.

The scripts will forward authentication.

Update tracker

Depends: Fedora JMS, ECM, Database

A update tracker service should be developed. It will act as a listener for Fedora JMS messages. It will maintain a list of references between all pids, and the entry objects they belong to, in the database module. This list will be constructed by queries to the ECM view system. When an object is modified, this list should be modified with the information that the correct entry has been modified.

The update tracker service will provide a webservice interface, so that queries about changed entries in regards to timestamps can be executed. This service will be fronted by the DomsServer service. The service will do no authorization, as the information is controls it not sensitive. If this is a problem, a list of allowed IPs will be used.

Summa storage

Depends: DomsClient (Update tracker, ECM)

We will develop a SummaStorage module, corresponding to DOMS. A summa storage is an abstraction over a set of changing records. The summa storage module will make use of the update tracker service to provide lists of changes, and the ecm service to construct the databundles for entries. This will happen through the DomsServer webservice, accessed through the DomsClient java implementation.

Whenever Summa is going to present a record (longview) for a user, it will query the corresponding storage. As bundling objects can be an expensive operation, we propose that Summa develop a caching storage as a frontend for the DOMS storage. Here, the bundled objects will be held, and the requests will be done much cheaper. This is the first optimization we will do, if the performance of DOMS seems inadequate. This will be the work of the Summa crew, and will probably not use the Doms database, although this is yet to be decided. Usage statistics for DOMS records will have to be logged at this level, as the caching precludes logging further along the request.

Summa will access DOMS with rights to view all records (but without admin), or close to this. It is then the province of Summa to enforce that records are not shown to people not authorized to view these records. The records accessed will contain the full XACML policies for the record, to enable Summa to make these decisions.

OAI/PMH endpoint

Depends: DomsClient (Search Webservice, ECM)

A REST webservice that acts as an OAI/PMH endpoint. The purpose of this service is to allow other search engines and the like to index and access DOMS. The endpoint will only present records that are viewable without user credentials, as there are no way to forward user credentials.

It will function in much the same way as the SummaStorage mentioned above, but rather than providing a summa interface, it will provide a OAI-PMH interface.

This service is intended as a separate webservice, using the DomsClient to communicate with Fedora, although it could be modelled as a webservice alongside the DomsServer service.

Possibly, we will establish other OAI-PMH services, with hardcoded set of credentials for Fedora access. These could be used for partnerships with other search engines.

Authentication

Preliminary authentication design.

Short post showings will always be visible in Summa, as they are read from the Index.

Long post showings will be queried from DOMS, and thus subject to user rights.

We implement this thus:

  1. There will be a wayf-webservice, underlying summa.
  2. The summa long post link will be to this webservice.
  3. Upon invoking this service, it will try to get the record from doms with guest rights
  4. If this fails, it will redirect the user to a login site
  5. There the user will log in
  6. The webservice will query the user properties, and store these in some location, like a LDAP (the Auth server)
  7. The call will proceed to the fedora, with the credentials of the temp user in the Auth server.
  8. Fedora will then use the properties of the temp user, and return the correct object.
  9. The record will now be displayed in Summa.

Bitstorage Auth

  1. The bitstorage link in summa will be to the file in bitstorage
  2. the apache server hosting the file will have a apache module that will call a webservice, with the credentials supplied with the request.
  3. Said webservice will call fedora, to try to get the file object with the supplied credentials
  4. If sucess, the webservice will return to the module, which will then allow the file to be accessed.
  5. If not successful, and no credentials are supplied, the user will be redirected to a WAYF login.
  6. Otherwise the file will not be accessed.

Bitstorage Gotchas:

  1. The apache module doing authentication against fedora should only be invoked if the request did not originate from Fedora itself.

OAI-PMH

  1. Only publically viewable records will be presented through OAI-PMH, as the OAI-PMH client is not a browser and thus cannot be authenticated with the system.

Summa storage speedup

  1. Only publically viewable records can be cached by Summa, as the user authentication is nessesary for the others. The user authorization is done by the XACML engine in Fedora.

Licenses

  1. WAYF only gives information about where you are from and your relation to this organisation. We cannot add properties to the user accounts, so the original idea of licenses are dead. Instead, we must write the licenses in terms of organisations. When more orgs get access to a protected resource, we must update the license object.
  2. The above is not totally correct, but it does depend on the coorporation of the users home organization. The strategy is currently undecided.

Gui

  1. Gui users will not use WAYF for authentication. Rather, their user credentials will be stored permanently in the Auth server (or Fedora will use two Auth servers, one temp and one for GUI users). They will be given the attributes that enable them to access DOMS with admin rights.

Unused modules

Journaling

Journaling provides us with the option to always have a uptodate mirror of the running repository. This mirror cannot be modified while mirroring, but it can be used for access. Possible scenarios

  1. Primary server fails, we switch server and is up and running in no time
  2. Primary server handles the GUI, and mirror handles public access requests.

This use of this functionality is heavily dependent on the uptime requirements to the final system.

We have received the requirements for uptime:

It is believed that the 4 hours requirement on public access can be achieved by IT maintenance without journaling. This leaves journaling as a mean for load balancing. If we run the Fedora system inside a virtual system, it can be duplicated to a number of systems by maintenance, and these can all be the receivers of public requests. They will be static, and not receive updates from the master Fedora installation, but this is less of a concern.

After discussion with IT maintenance, we have agreed not to use journaling until we see a need. System replication can be more easily achieved by replicating the machine running Fedora.

Akubra

Akubra blob storage lives on http://www.Fedora-commons.org/confluence/display/AKUBRA/Akubra+Project It is a storage layer abstraction. At the moment it is in version 0.1, but in time it will grow to become the storage layer of Fedora.

The basic idea is a BlobStore, storing Blobs. These can be gotten back, and new ones can be made. At the moment, the DOMS system has two very different storage mechanisms, the one in Bitstorage, and the foxml storage. The akubra project would naturally integrate on top of the foxml storage. The specific system in Bitstorage is more of a problem, through. To use Akubra on the Bitstorage would require a fundamental redesign of how files are handled in DOMS.

It is the recommendation of this group that Akubra not be used in DOMS. Further investigations into managed datastreams in Fedora would be worthwhile and might change the decision.

OAI-Provider

http://Fedora-commons.org/confluence/display/FCSVCS/OAI+Provider+Service+1.2

The OAI-Provider service adds proper OAI-PMH support to a Fedora repository. Unfortunately, it, like all non-DOMS systems, cannot handle view blobs. A view blob is a collection of objects that together comprise a single record. Changes to any of these is regarded as a change to the entire record, but the record can only be made from one head object. As it is these records that should be disseminated, not the individual objects, a separate system is needed to keep track of which have been updated.

It is the conclusion of this group that building into the oai provider an understanding of the view system is more bother than it is worth. Instead a separate oai provider service should be developed, on top of the DOMS interface, which understands views. This could very possible be based on the http://proai.sourceforge.net/ which incidentally underlies the Fedora oai provider.

GSearch

http://Fedora-commons.org/confluence/display/FCSVCS/Generic+Search+Service+2.2 Gsearch is a service that can produce a lucene index of the contents of a Fedora repository. Like the OAI provider above, it suffers from the problem of understanding the DOMS view blobs, and thinks each object is a separate record. Like the OAI-provider above, this group finds that Gsearch is not worth the bother. Instead, a module for Summa integration should be developed.

Deployment and Design of DOMS (last edited 2010-03-17 13:12:54 by localhost)