Having benchmarked different ways to zero an array, there's also the question of copying lumps of floating-point data from one place to another, which can be done in a similar range of different ways. Here I've benchmarked in the same way as in my first note, using the analogous approach in each case (except for method 9, which doesn't have an analogue here):
| Method | Mac PPC | Linux Intel |
|---|---|---|
| 1, sc3 | 21 % | 69 % |
| 2, for, array | 40 % | 75 % |
| 3, for, post | 38 % | 51 % |
| 4, for, pre | 38 % | 75 % |
| 5, do-while | 39 % | 75 % |
| 6, duff's, post | 40 % | 56 % |
| 7, duff's, pre | 40 % | 75 % |
| 8, memcpy | 13 % | 39 % |
| 10, unrolled-for | 39 % | 47 % |
(This shows results for copying aligned blocks of data. I also did a test using unaligned blocks, there are no differences worth reporting.)
For PPC Mac it's a very consistent story: all of the loopy methods basically take exactly the same amount of effort. JMC's crafty use of doubles is a clever optimisation here, but (as in the zeroing test) there's a definite outright winner, and it's simpler: memcpy.
For Intel Linux there's some variation in the results. For some reason postincremented pointers are better than their alternatives, and the unrolling in method 10 helps noticeably. But again, memcpy is the outright winner.
So it looks like the recommendation is a direct parallel of the first test: memcpy() please, in this kind of circumstance. YMMV.