(my cpu is an intel haswell 4770k)
so i was testing my mandelbrot program yesterday and was experiencing some intermittent shockingly-fast performance, seemingly without rhyme or reason. it would go really fast for a little while then start going slower. the instances where it would go really fast, i could hear my cpu fan spin up rather loud (it usually doesnt spin up at all). in fact, this was the first time ive heard my new cpu fan spin up at all since i got this monstrous new HSF a few months ago. i just figured the heatsink is so big maybe the fan doesnt need to spin up much (it IS spinning though; you can see it is always spinning at a low setting), though it seemed kind of strange as i would still expect the fan to spin up while under load running mandelbrot renders all the time and such. anyway the whole thing has me thinking, and wondering what exactly is going on, and what if anything might be able to be better configured to coax out that high performance.
the difference in performance is indeed rather shocking too: one of the times that it went fast, the render completed in 27 seconds. so i started re-running the same exact render to see what kind of performance difference i might notice: the second time it took 2m33s. the third time it took 3m0s. the fourth time it took 2m0s. then on successive tries after that, it would seemingly randomly take either 2m, 2.5m, or 3m. and i know my code is taking the same path and doing the same exact thing each time too, so its not just that my program is doing something different. i even have lots of timers and progress bars for all the different stages of computations: for reference iteration, delta initialization, perturbation iteration, and rendering. and you can clearly see each stage uniformly going faster or slower across the board.
that is upward of >6x performance difference! that is positively insane. i simply cant picture normal expected operating behavior of either turbo boosting or throttling accounting for that huge of a difference.
possibly related (my program is explicitly vectorized), possibly not: recently i noticed a discussion amongst some HPC people discussing some of the black-box voodoo that goes on inside modern cpus nowadays, specifically intel chips with avx. one interesting thing they mentioned was that when the cpu first starts getting hit with avx instructions, it will emulate them for a short while before it actually fires up the vector units. a more concerning thing mentioned was the existence of avx-specific throttling. googling for avx throttling finds some discussions related to xeon chips, and i found some stuff claiming that avx throttling isnt supposed to happen on desktop chips. not really sure what to make of this.
so anyhow, does anybody have any experience with / thoughts about any of this? not only is it maddening to think there could be that much latent performance in my system that i am rarely if ever accessing, it also makes it kind of a ridiculous prospect to work on optimizing your code when your system performance can vary so wildly. i dont know how i am even supposed to know when my code is better or worse under these conditions...