Differences between revisions 5 and 6
Revision 5 as of 2009-10-27 14:05:57
Size: 5251
Editor: kfc
Comment:
Revision 6 as of 2010-03-17 13:09:05
Size: 5251
Editor: localhost
Comment: converted to 1.6 markup
No differences found!

2009-10-07 IT Maintenance Meeting

Participants: KFC, JRG, ABR, JHLJ, TGC

Backup

The big problem in backup, is getting a consistent backup of the various parts of DOMS data.

DOMS contains:

  • Data from Fedora
    • Object files in small XML files
    • An SQL database
    • An RDF database
  • Configuration files
  • Applications

(and of course bit storage)

The only really important data are the Fedora Objects files. The databases can be rebuilt, the configuration files are under version control, and the application files are of course available under version control as well as as release packages.

However, rebuilding the databases is time consuming (for a large repository counted in days or weeks), so a consistent backup would be preferable. Unfortunately, Fedora does not offer a checkpoint, where object files and databases are considered consistent. This may even be a problem on non-clean shutdowns of Fedora.

Decisions:

  • KFC contacts Fedora mailing list, to confirm that there is no transaction checkpoints in Fedora for backup
  • If possible: We run the system on a VMWare server, and simply snapshot the entire state. (JHLJ, TGC)
  • If possible: We shutdown Fedora every night, and run backup. DOMS services gracefully wait for Fedora to come up again. (KFC)

Bitstorage SSH shell script

We discussed a bit about the behaviour of the SSH scripts that are the interface to DOMS storage. Also, we requested HTTP access to files in stage and storage.

Decisions:

  • Knowledge of checksums should be required, to approve a file in "stage". (JHLJ)
  • Filesize should be given as extra input to upload, to ensure space is available, and gracefully fail if not. (JHLJ)
    • Note that uploads may be concurrent, so the space needs to be reserved
  • Bitstorage should keep to disk copies, until data is replicated to two backup tapes (JHLJ)
    • If this is not done, we need a status method to tell us whether a file is replicated, so the uploading user or some intermediate buffer knows when to delete it
  • JHLJ will consider whether "upload" and "approve" could be made idempotent, that is they will report success if we try to upload a file already present, or approve a file already approved (only on correct checksum, of course)
  • One API update has been done only on production scripts, not on test script. JHLJ fixes.
  • TGC will set up an apache that delivers files in stage and storage in-house.

Requested hardware and software

DOMS consists of Fedora and a number of web applications. Everything runs in tomcats, although two have special requirements for the tomcat environments (one requires special libraries and settings in server.xml, one requires an environment variable set when starting tomcat). Furthermore, Fedora requires a database, and works best if the database, an RDF database in the system and the objects are on different disks.

Decisions:

  • Maintenance department will maintain production tomcat. We will aim for tomcat 6 if possible.
  • We can have different disks for different parts of the system.
  • If possible, we should use Postgres for database.
  • We will try to run on a VMWare server, and migrate to a different server if this is not feasible

Configuration

DOMS requires configuration of tomcat and the various webapps. We want the production system to be configured by the maintenance department.

Decisions:

  • No configuration files should be packed in the WAR-files, unless overridable in context.xml
  • Maintenance will configure production environment. Configuration files are kept under version control
  • Development group will point out configuration files, and what settings are relevant, where applicable.

Test/Production deploy

Maintenance has good experience with running a test and a production environment. The test environment is identical in setup to the production environment. While developing, this is the domain of the developers. When things go into production, maintenance department takes over this environment and set up the system. After a successful test, this environment is migrated to the production system.

Decisions:

  • Maintenance department set up TEST system for developers (JHLJ, TGC)
  • When things need to go into production, developers deliver WAR-files and configuration files + descriptions of what needs to be done to tomcats.

System monitoring

Maintenance will set up Big Sister monitoring, by accessing a status page. The status page may report OK, or an error message to be reported in case of trouble. Status page may have "yellow" or "red" light indications.

Decisions:

  • Developers will create status page
  • Information about status page is delivered with production data.

Replicated systems

Fedora contains the capability of running a replicated system, for uptime and performance.

Decisions:

  • For a start we will not use this capability
  • In case of performance trouble we will consider adding this or migrating to different hardware
  • We may consider this option if our backup strategies as mentioned above do not allow fast recovery

2009-10-07 IT Maintenance Meeting (last edited 2010-03-17 13:09:05 by localhost)