A rough estimate on the numbers of objects in a repository for Statsbiblioteket is 4-5 million at this time. The bulk of these objects is a collection of Radioavisen manuscripts with 3 million TIFF files.
The numbers of relations between objects will be much higher. Related projects (primarily Wales National Library) sets the number of relations pr. object to 50-100. This number includes general metadata like Dublin Core.
Fedora creates a files for each object in the repository. This does not present a problem for standard access, as they are cached in a database, but it is potentially a showstopper for backup and replication.
Quick test
Redhat with ext3 and Windows XP with NTFS5, both using a single harddisk with a block size of 4KB. The ext3 had 7.5 million inodes free (dumpe2fs).
System |
Files |
Creation timing |
ZIP timing |
RedHat, ext3 |
200.000 |
3.3 min |
1.5 min |
RedHat, ext3 |
1.000.000 |
13.2 min |
9.3 min |
RedHat, ext3 |
5.000.000 |
72.9 min1 |
103.1 min2 |
Windows XP, NTFS 5 |
200.000 |
10.5 min |
10.5 min |
Windows XP, NTFS 5 |
1.000.000 |
51.2 min |
53.5 min |
Windows XP, NTFS 5 |
5.000.000 |
243.1 min |
246,6 min |
Test conclusion
A backup time of below two hours on a single harddisk, with a non-small-files-optimized filesystem, is acceptable (Toke). A bigger problem is the inconsistencies that occur due to changed files during backup.
Windows XP does not perform that well with NTFS5. One possibility could be to use another filesystem. There is a freeware ext2/3 driver for Windows XP located at http://fs-driver.org/
Problem
There are political and technical problems with backup of millions of small files. The web development group had this problem with Horizon. They have implemented a database-driven file handling system in order to reduce the number of files. Talk to Hans about this.
The Royal Library solves the problem by using XMLTapes for meta data (concatenate the XML files and index them), but that solution works best for very static metadata.
See also
Scalability tests on several different maschines: http://fedora.statsbiblioteket.dk/fedoraWiki/Fedora_performance