Btw, you told that you can understand why to write ASM rather than Delphi-code and that Delphi is slow. Do you have experience with this (I would be really interested in this topic, to exchange serious facts) or is this just an "opinion"?
It's not just an opinion. I am speaking on the basis of my experience writing Delphi code about 10 years ago. Back then, 32bit computers and 32bit Delphi were still relevant. As anyone who programmed in Delphi knows, Delphi compilers use their own Object Pascal dialect of Pascal. Although not an interpreted language (Delphi does generate native code), I did look at it as an inferior language, compared to C. Why? I stumbled into the fact that Delphi generated very slow/inefficient code for floating point arithmetic operations. So most of the time I converted everything dealing with floating point math into ASM. I did do profiling, and the difference was very much noticeable. Unfortunately, now I don't have the software or the actual profile timing results.
Here is some interesting reading material on this very topic:
Here are some quotes:
For floating-point arithmetic, the Delphi compiler is nowadays deprecated. For instance, the current Delphi compiler is outperformed by latest Javascript engines using on the fly compilaton into SSE: you'll have to code SSE by hand for acceptable results. I hope that SSE code in the upcoming 64 bit compiler will change the results here.
According to your code, what is slow with the 32 bit Delphi compiler is the floating point arithmetic support, which is far from optimized, and copy a lot of content on/to the FPU stack.
In respect to floating point arithmetic, not only Java JITted code will be faster. Even modern JavaScript JIT compilers can be much better than Delphi!
XE2 32bit compiler still uses the old FPU code
XE2 64bit compiler get a nice boost from using SSE2
Still those are mostly nitpickings compared to the massive issues of the old FPU code compilation (which, alas XE2 – Win32 still suffers from).
It looks to me that the overall best speeds go to the SSE2 optimized code, but especially the Double code for SSE2.
Unfortunately, a quick search of the Internet suggests the Delphi compiler doesn't have option for using SSE floating-point math in 32-bit code.
Anyways
I believe that before doing any optimization by converting code to assembly, one needs to first make sure the algorithm is optimal. Also, it's very important to do profiling. To actually measure how much efficient your code becomes after some "optimization trick".