Hardware
Quick notes... More to come.
See also IndexBuilding.
Appetizer
The test:
- Logged queries are used sequentially for Lucene 2.1 searches on the index used at Statsbiblioteket.
- The test includes query parsing and extraction of a single field of size ~= ½ KB for the first 20 hits for a given query.
The index:
- The index at Statsbiblioteket currently holds about 9 million records taking up 37GB of disk space.
- Searching with the logged queries gives an average of 2500 hits.
Machines:
- metis: Server. Dual Core Xeon with 8GB fast RAM, 2 * 32GB Samsung SSDs in RAID 0 + 2 * 15.000 RPM conventional harddrives in RAID 1. Linux, 64bit, ext2, noatime.
pc990: Workstation. Single core Pentium 3.2GHz with 2GB of standard RAM, 7200 RPM conventional harddisk. Linux, 64bit, ext3. Note: The workstation used a reduced index of only 26GB (~7 million records) for this test.
Legend:
- 8GB/4GB/2GB refers to the amount of RAM available to the whole system. The test-code eats just below 1GB of that in itself.
- i37 refers to the index-size 37GB.
- q/s is queries/second averaged over all queries.
Threading
Performance of threaded searches with a single searcher shared between threads. The difference between this test and the one above is a larger index and the use of threading. The t1/t2/t3/t4 signifies the number of concurrent threads used for searching.
Performance of threaded searches with an unique searcher for each thread.
Observations
Individual searchers for each thread performs significantly better than a shared searcher for each thread. At least in the long run.
The sweet spot for the number of threads seems to be the number of CPU-cores or one more than the number of CPU-cores.
Warming up
Warming the searchers means running realistic queries before providing access to the searchers. It is a known fact that Lucene benefits a lot from a warming. This is partly due to Lucene's internal structures being initialized and values being cached and partly due to the system disk-cache being populated.
- The performance of threads with individual searchers is lower than threads with shared searchers until the system has been properly warmed.
- Solid State Drives can be warmed to approximately 2/3 of peak performance with only 1000 queries. This takes 8 seconds with two threads with shared searchers and 18 seconds with three threads with separate searchers.
- Harddrives can be warmed to approximately 2/3 of peak performance with about 15000-30000 queries. This takes 5-7 minutes with two threads with shared searchers and 10-15 minutes with three threads with separate searchers.
RAM vs SSD vs Harddisks
Turns out we could cram 14GB of index into RAM at our machine (see the page history for the previous test with 9GB). We tried loading the index into RAM and pitted it against the same index on SSD and conventional harddisks, with the available memory reduced to 3GB.
Observations
Using SSD with multiple independent searchers with just 3 GB of RAM can give performance nearly on par with the pure RAM-setup using 24 GB of RAM. It's very interesting to see that the RAM-based searchers also need some time to ramp up in speed, which shows that a substantial part of the warm-up time is due to Lucene's internal structure initialization - faster storage doesn't help here
CPU-core scaling
We got 3 new machines medio august 2008. They are quad-core Xeon machines with 6MB of level 2 cache, 16GB RAM and 4 * 64GB SSD in RAID0. We're in love.
In the graph below, metis is our "old" dual-core machine with 2 * 32GB MTRON SSDs in RAID 0, while prod is one of the mew machines. As can be seen, metis is a bit faster than prod for 1-2 threads, after which prod pulls ahead. Not surprising. What's interesting is that speed continues to increase at a great pace up to 4-5 threads - the CPU is the bottleneck, not the SSDs.
Looking at the performance increase, we can calculate how many more raw queries/sec we can deliver for each extra thread:
- 2 threads is 192% ~ 92% ekstra / ekstra thread
- 3 threads is 257% ~ 79% ekstra / ekstra thread
- 4 threads is 308%.~ 69% ekstra / ekstra thread
- 5 threads is 329% ~ 57% ekstra / ekstra thread
- 6 threads is 313% ~ 43% ekstra / ekstra thread
Closer to the real world
Work in progress
What we're aiming for with Summa is running updates of the index. Luckily the low warm-up time for Solid State Drives means that sub-minute update times are possible with a plain index. Lucene 2.3 promises better performance when re-opening an index though, so the lead by Solid State Drives might not be that huge in The Real World.
Scenario 1: An index of aproximately the same size and layout as our current one. Additions are made every minute and we accept 10 seconds from a commit until the changes are reflected (the warm up time). The warm-up will take place in the background, affecting the performance of the active searcher, but we'll ignore that for now. What we can't ignore is that it takes approximately 5 seconds just to open the index (pending experiments with re-open).
Scenario 2: An index of aproximately the same size and layout as our current one. Additions are made every 10 minutes and we accept that there is 10 minutes from a commit until the changes are reflected (the warm up time). As before the warm-up will take place in the background.
Scenario 3: An index of aproximately the same size and layout as our current one. Additions are made every day and we accept that there is 1 hour from a commit until the changes are reflected (the warm up time).
Upcoming production system
We got 3 Quad-core Xeon 6MB cache, 16GB RAM machines. Each one equipped with
4 * 64GB Samsung MCCOE64G5MPP-0VA00 Flash drives in RAID 0.
- 700 GB of standard speed storage for input files, logs, record storage and such.
Test-code
Due to legal issues, the package below only contains a sub-part of the files needed for actual compiling and running. However, it should be enough for a quick review.
