Logo by lycium - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Visit us on facebook
 
*
Welcome, Guest. Please login or register. April 16, 2024, 12:12:54 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: 1 2 3 [4]   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: Branchless maximum/principle axis  (Read 11201 times)
Description: Compute principle axis without branching
0 Members and 1 Guest are viewing this topic.
laser blaster
Iterator
*
Posts: 178


« Reply #45 on: June 04, 2015, 10:22:14 PM »

I think my point was that Eiffies versions were actually slower than my naive version using three conditional branches and a normalize function. So don't be too afraid of branches.

And while abs is very likely to be a trivial function, I think it was unexpected that 'sign' turned out to be slower than 'normalize'. On my GPU 'sign' is compiled into the following intermediate code:
Code:
SLT.F R3.xyz, R1, {0, 0, 0, 0}.x;
SGT.F R1.xyz, R1, {0, 0, 0, 0}.x;
TRUNC.U R3.xyz, R3; // <- Truncate to unsigned integer
TRUNC.U R1.xyz, R1;
I2F.U R3.xyz, R3; // <- Convert back to float. Really?
I2F.U R1.xyz, R1;
ADD.F R1.xyz, R1, -R3;

while the 'normalize' is compiled into
Code:
MUL.F R1.xyz, R2, R1;
DP3.F R2.x, R1, R1; // <- a dot product
RSQ.F R2.x, R2.x;  // <- an inverse square root

which was quite unexpected for me.

So I'd advise to simply just measure the performance, instead of reasoning about it.

The normalize may be faster in practice, but the intermediate code isn't very good indicator of speed, because it doesn't doesn't map directly to modern GPU machine code. A 3-component dot product is implemented as 3 scalar multiply-adds on current GPU's- so that's 3 instructions. The 3-component multiply will also compile to 3 instructions. Only very old GPU's still use native vector instructions. There is no way that I know of to view the actual native assembly code for any modern GPU's.

And abs is definitely very fast- I've heard that on some GPU's it's effectively free, as it can be combined into other instructions. But sign() being so slow is quite a shocker. It should be implemented as (f>=0) ? 1 : -1, which shouldn't be more than 2 instructions: a compare, and a conditional move instruction. I don't know why they did it in such a complicated way.
Logged
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #46 on: June 05, 2015, 04:51:21 PM »

The normalize may be faster in practice, but the intermediate code isn't very good indicator of speed, because it doesn't doesn't map directly to modern GPU machine code. A 3-component dot product is implemented as 3 scalar multiply-adds on current GPU's- so that's 3 instructions. The 3-component multiply will also compile to 3 instructions. Only very old GPU's still use native vector instructions.

Yes, it will only hint at what is happening. Notice, that even with machine codes you would still need to know how many cycles each instruction uses.

Quote
There is no way that I know of to view the actual native assembly code for any modern GPU's.
And abs is definitely very fast- I've heard that on some GPU's it's effectively free, as it can be combined into other instructions. But sign() being so slow is quite a shocker. It should be implemented as (f>=0) ? 1 : -1, which shouldn't be more than 2 instructions: a compare, and a conditional move instruction. I don't know why they did it in such a complicated way.

The native assembly for ATI cards can be viewed using their "GPU ShaderAnalyzer" (you can choose between all their architectures and see the machine code). For instance. sign(x) on ATI compiles to:

Code:
      0  y: SETGT       ____,  0.0f,  KC0[0].x     
         z: SETGT       ____,  KC0[0].x,  0.0f     
      1  x: ADD         R0.x,  PV0.z, -PV0.y

while (f>=0) ? 1 : -1 compiles into

Code:
      0  y: SETGT_DX10  ____,  KC0[0].x,  0.0f     
      1  x: CNDE_INT    R0.x,  PV0.y,  -1082130432,  1065353216     

which is exactly as expected (the reason for difference is that sign(x) is required to have sign(0)=0).

I have also heard that abs (and saturate) should be free instructions, but I think that may depend on architecture. On ATI archs abs translated into a "MAX_DX10    ____,  KC0[0].y, -KC0[0].y" instruction.


Logged
eiffie
Guest
« Reply #47 on: June 05, 2015, 05:22:21 PM »

Thanks for the info Syntopia - very helpful as always.
Logged
Pages: 1 2 3 [4]   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
Maximum Zoom Factor Mandelbulb 3d The Rev 2 2374 Last post October 17, 2010, 08:36:39 PM
by The Rev
Maximum Security Prison Images Showcase (Rate My Fractal) thom 0 883 Last post June 13, 2012, 04:20:27 AM
by thom
ability to set maximum render time per frame Feature Requests erstwhile 4 3277 Last post December 08, 2013, 10:30:52 PM
by erstwhile
principle of diminishing marginal productivity Images Showcase (Rate My Fractal) thom 0 1021 Last post April 28, 2015, 03:44:20 AM
by thom
Maximum render size Kalles Fraktaler Dinkydau 4 1637 Last post February 16, 2017, 11:32:12 PM
by Dinkydau

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.146 seconds with 24 queries. (Pretty URLs adds 0.007s, 2q)