Following some comments on the linux-raid mailing list about poor throughput for raid5 in 2.6, I did some systematic measurement on my test machine to look for patterns. And the result were illuminating.
I compared raid0 with raid5 in 2.4.27 and 2.6.9. The results showed that reading from raid5 is significantly slower in 2.6 (as reported) and also that writes were quite a bit slower too, even for raid0 (raid0 reads improved in 2.6). However I didn't get the same degree of slow down that has been reported elsewhere.
UPDATE: more results are available in a subsequent article, showing that the raid0 write through shown here is misleading.
Configuration
These test were performed on a dual 3.20GHz Xeon with 4 Gig RAM. The 14 MAXTOR ATLAS15K_36SCA drives (15000 rpm, 36gig, SCSI) were connected, 7 each, to the two channels of an Adaptec AHA 3960 U160 SCSI card (I really should get a U320 SCSI card...).
Each array was created with the drives shared as evenly as possible between the two channels. RAID5 and RAID6 arrays were allowed to complete any resync before tests were performed.
Testing was done using "bonnie" on an ext3 filesystem created using all default settings. The md stripe size was also the default for mdadm (64K).
The Tests
The tests ran were a simple run of "bonnie" on a freshly created filesystem on raid0 and raid5 arrays with from 2 to 14 drives. Identical tests were run on 2.4.27 abnd 2.6.9. Also, raid6 was tested on 2.6.9 with arrays from 4 to 14 drives.
Ideal Results
The ideal expected results for raid0 would be a linear increase in throughput for both read and write as the array size increases, until a buss is saturated. At that point, the throughput should stay steadily at slightly under the theoretical limit of the buss.
For raid5, the expected read throughput would be the same as for a raid0 array with one fewer drives. The write throughput could well be less, but as bonnie does large sequntial writes, this should translate into lots of full-stripe writes which don't require pre-reading and so we should be able to approach the raw drive and buss speed.
For these system being used, the drive speed is about 70MB/sec and the buss speed is 320MB/sec (though with an odd number of drives, the buss imbalance will cause a drop below this). Thus we might expect linear speed improvements up to 4 drives, then a leveling out.
Actual Results
You can see the measured results of the tests in the graphs.
Read
raid0 in 2.6 behaves almost as expected. There is a linear increase in speed from 2 to 4 drives followed by a leveling out a little below the combined SCSI buss speed of 320MB/sec. The dip at 5 drives would be due to the buss imbalance. The buss with three drives will effectively limit those drives to about 50MB/sec each (instead of 70), and the nature of sequential reads over raid0 will effectively limit all drives to this speed, so the agregate would be expected to be about 250MB/sec, as is seen. However a similar calculation for 7 drives would suggest an agregate of 260MB/sec which less than what is seen, so there must be something else effecting the result.raid0 in 2.4 shows similar behaviour, though the throughput from each drive seems less, and the effect of unbalanced busses is more pronounced. This would suggest that general improvements in the block device, mm, fs, and SCSI subsystems have made 2.6 work more efficiently with disc drives than 2.4.
Against these raid0 results, we can look at raid5, and be unimpressed. In 2.4, raid5 behaves close to expectation. It is one-drive slower than raid0 until the buss is saturated, and then performs quite close to raid0 speeds.
2.6 is slightly faster than 2.4 for small arrays, but flattens out at a much lower throughput. The difference between raid5 and raid0 for larger arrays is around 80MB/sec. Either some extra data is being passed of the SCSI buss (though even if the parity block were always being read it wouldn't be this bad) or the buss is spending 25% of it's time idle.
Part of this slowdown could be explained by the fact that raid5 in 2.6 copies all data an extra time. Rather than reading directly into the request buffer, raid5 reads into a cache, and then copies from the cache into the request buffer. It is hard to imagine this accounting all of the 80MB/sec drop in throughput though.
raid6 shows very similar figures to raid5, which is not surprising considering the similarity of the code.
Write
raid0 writes in 2.4 show a very nice graph. It is perhaps not as high a throughput as we would like, but it is linear to 4 drives, and then flattens out at around 220MB/sec (interestingly similar to the read speed for raid5 in 2.6, but that is unlikely to be significant).
raid0 writes in 2.6 are a real mess!! It matches 2.4 to 3 drives and then seems to be all over the place with a sweet-spot at 11 drives. A retest is really needed here to see if the apparently random pattern is repeatable.
raid5 in 2.4 also shows a nice pattern, though it is predicatably well below raid0 for speed. raid5 in 2.6 is even slower, which isn't what we want to see.
It seems interesting that raid6 is one-drive slower than raid5 at 5-7 drives, then 2 drives slower up to 13 drives. However it is hard to see that this is significant.
Some of this drop in write speed must be attributed to the same cause that is affectiving raid0 so much. The rest, if there is any more, could be related to the read slowdown.
Where next
Some directions that could be taken to understand this better include- re-test raid0 writes in 2.6
- test throughput on a degraded raid5 array as there have been reposts that this is better
- test with larger read-ahead settings
- test with different io scheduler. Currently the anticipatory scheduler is being used, and that may not be the best, particularly for raid5 writes due to the write-after-read pattern (this scheduler expects read after read).
- implement cache-bypass for reads in raid5 (and raid6) in 2.6 and see how much it helps.
- pour over the code and try to figure out what is happening.