Logo by bib - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Support us via Flattr FLATTR Link
 
*
Welcome, Guest. Please login or register. April 25, 2024, 05:40:32 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: 1 2 [3] 4   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: Branchless maximum/principle axis  (Read 12418 times)
Description: Compute principle axis without branching
0 Members and 1 Guest are viewing this topic.
TruthSerum
Guest
« Reply #30 on: May 30, 2015, 01:21:11 AM »

Interesting results, I always knew there was something about branching on GPU. I think for texture fetches the adjacent pixel UV's must always be calculated, so it is able to select the mipmap LOD. Correct me if I am wrong about that.

I think the penalty on CPU is less, I'm not sure about a full pipeline flush, as dom767 said. I mean, consider how many background applications are doing branches even while your system is idle. CPU is designed to execute bad code very well smiley

I like the shader dom767, I left a comment on ShaderToy for it smiley
Logged
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #31 on: May 30, 2015, 01:48:53 PM »

I think the penalty on CPU is less, I'm not sure about a full pipeline flush, as dom767 said. I mean, consider how many background applications are doing branches even while your system is idle. CPU is designed to execute bad code very well smiley

An idle background application process would be probably have its branches correctly predicted most of the time, so there would be no penalty. Besides, the pipeline on modern CPUs is not that long: 12-14 cycles for the Core 2 architecture. This is completely neglible compared to the cost of switching between the background processes: something which involved storing CPU state and switching virtual adress spaces. The context switch cost on a Core 2 is roughly 5-7 microseconds or some 10000 instructions.

For very simple conditional statements, the compiler may also opt for using 'predicated' instructions, like the CMOV x86 instruction, which is only executed if a flag is set. This ensures no pipeline flushing is needed. (This is widely used on the ARM architecture, where most instructions are predicated.)

Btw, this is a great resource for CPU optimizations: http://www.agner.org/optimize/
Logged
lycium
Fractal Supremo
*****
Posts: 1158



WWW
« Reply #32 on: May 30, 2015, 02:43:28 PM »

Amazing amount of discussion here for fabs() smiley

I know computer architecture is a complex thing and one shouldn't oversimplify / make assumptions, but I'm going to do exactly that now cheesy

Straight up, I don't see how anything can be faster than *unconditionally * setting a single (sign) bit. If you think about the amount of logic that goes into branching on modern parallel and superscalar archs, there's just no way you want to do anything else. As mentioned it's even worse for GPUs due to serialisation of branch divergence.

So just use fabs, right? cheesy Always avoid a branch if it's cheap to do so, in general - that's why eiffie's method works well: abs, getting sign bit and selecting between 2 constant literals is all sort of free on a latency hiding, branch fearing GPU.
« Last Edit: May 30, 2015, 02:49:32 PM by lycium » Logged

Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #33 on: May 30, 2015, 04:21:40 PM »

I think my point was that Eiffies versions were actually slower than my naive version using three conditional branches and a normalize function. So don't be too afraid of branches.

And while abs is very likely to be a trivial function, I think it was unexpected that 'sign' turned out to be slower than 'normalize'. On my GPU 'sign' is compiled into the following intermediate code:
Code:
SLT.F R3.xyz, R1, {0, 0, 0, 0}.x;
SGT.F R1.xyz, R1, {0, 0, 0, 0}.x;
TRUNC.U R3.xyz, R3; // <- Truncate to unsigned integer
TRUNC.U R1.xyz, R1;
I2F.U R3.xyz, R3; // <- Convert back to float. Really?
I2F.U R1.xyz, R1;
ADD.F R1.xyz, R1, -R3;

while the 'normalize' is compiled into
Code:
MUL.F R1.xyz, R2, R1;
DP3.F R2.x, R1, R1; // <- a dot product
RSQ.F R2.x, R2.x;  // <- an inverse square root

which was quite unexpected for me.

So I'd advise to simply just measure the performance, instead of reasoning about it.
Logged
eiffie
Guest
« Reply #34 on: May 30, 2015, 05:25:41 PM »

Hey Syntopia did you by chance try Knighty's version where he converts float/bool I have always wondered what that compiles to. Does it leave them untouched?

...or sign(p)*max(vec3(greaterThan(a,max(a.yzx,a.zxy))),0.0);

@lycium I think bit checking is just too low an occupation for modern GPUs to be bothered with.
Logged
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #35 on: May 30, 2015, 06:57:21 PM »

Code:
vec3 v = abs(vec);
float m = max(v.x, max(v.y, v.z));
v = vec3(float(v.x == m), float(v.y == m && v.x !=m), float(v.z == m && v.x !=m && v.y !=m));

// restore original sign
return normalize(v*vec);
   

compiles to

Code:
MAX.F R2.x, |R1.y|, |R1.z|;
MAX.F R2.x, |R1|, R2;
SNE.F R2.z, |R1.x|, R2.x;
SEQ.F R2.y, |R1|, R2.x;
SEQ.F R3.x, |R1.z|, R2;
SNE.F R3.y, |R1|, R2.x;
SEQ.F R2.x, |R1|, R2;
TRUNC.U R2.x, R2;
TRUNC.U R2.z, R2;
TRUNC.U R2.y, R2;
AND.U R2.y, R2, R2.z;
TRUNC.U R3.x, R3;
AND.U R2.z, R2, R3.x;
TRUNC.U R3.x, R3.y;
AND.U R2.z, R2, R3.x;
I2F.U R2.x, R2;
I2F.U R2.y, R2;
I2F.U R2.z, R2;
MUL.F R1.xyz, R2, R1;
DP3.F R2.x, R1, R1;
RSQ.F R2.x, R2.x;
MUL.F R1.xyz, R2.x, R1;

so it seems booleans are represented as floats (SNE.F/SEQ.F), converted to ints (TRUNC.U), AND'ed and converted back to floats (I2F)
Logged
lycium
Fractal Supremo
*****
Posts: 1158



WWW
« Reply #36 on: May 30, 2015, 07:48:36 PM »

So I'd advise to simply just measure the performance, instead of reasoning about it.

I agree in general, but I don't think we're referring to the same thing - I was specifically referring to fabs, which just cannot be slower than anything else because it's a single OR-operation with 0x800000... the top (sign) bit of an IEEE number.

I have to say I'm surprised that a normalize does better than that other code, too - but that again is more complex story of data dependencies limiting pipelining etc.

@lycium I think bit checking is just too low an occupation for modern GPUs to be bothered with.

Not at all... I also don't really know what you mean by this... so if I have "x = y ^ z" somewhere in my code, the GPU goes "how dare you molest me with such trivial tasks?!!?"  cheesy

They are mostly like other CPUs, working on binary IEEE floating point data. In terms of raw throughput, you can't do better than an unconditional simple bit-operation.
« Last Edit: May 30, 2015, 07:59:55 PM by lycium » Logged

Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #37 on: May 30, 2015, 11:19:23 PM »

I agree in general, but I don't think we're referring to the same thing - I was specifically referring to fabs, which just cannot be slower than anything else because it's a single OR-operation with 0x800000... the top (sign) bit of an IEEE number.

Indeed - I was talking about branching, not abs.

Btw, I think your OR operation would make the number negative (IEEE754 numbers are multipled with (-1)^sign). You'd want to AND with 0x7fffffff instead. And AND'ing floats is not valid in neither C nor GLSL. So you have to do a reinterpretive cast. In GLSL this would be:

Code:
// Needs #version 330
float fabs(float f) {
return intBitsToFloat(floatBitsToInt(f) & 0x7fffffff);
}

Which *is* slower than using the builtin abs command (I tested)

The similar code in C would be (this snippet was found on the net):
Code:
float fastabs(float f) 
{int i=((*(int*)&f)&0x7fffffff);return (*(float*)&i);}

Logged
lycium
Fractal Supremo
*****
Posts: 1158



WWW
« Reply #38 on: May 30, 2015, 11:35:36 PM »

Ahhh yes, of course the sign bit should not be set for positive number smiley
Logged

Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #39 on: May 30, 2015, 11:55:48 PM »

Interestingly, this can be used to build a 'sign' function which is faster than the builtin function produces:
Code:
// Notice: returns +1 for 0 as input.
float fastSign(float f) {
   // Bitwise OR of the number '1' with the sign-bit of f
   return intBitsToFloat(0x3F800000 | (0x80000000 & floatBitsToInt(f) ));
}

vec3 fastSign3(vec3 f) {
   return vec3(fastSign(f.x),fastSign(f.y),fastSign(f.z));
}

Using this makes Eiffies function faster than the my branchy version.
« Last Edit: May 31, 2015, 12:32:28 AM by Syntopia » Logged
lycium
Fractal Supremo
*****
Posts: 1158



WWW
« Reply #40 on: May 31, 2015, 12:00:19 AM »

maybe just floatBitsToInt(f) >> 31?
Logged

TruthSerum
Guest
« Reply #41 on: May 31, 2015, 12:02:18 AM »

Also a proper sign function should handle the edge case sign(0)==0, which I don't think can be achieved with a single bit shift.
Logged
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #42 on: May 31, 2015, 12:40:24 AM »

maybe just floatBitsToInt(f) >> 31?

I don't think it will work: for a negative number, this will produce all binary 1's. (which is -1 as an integer). For a positive number, this will produce all binary 0's (which is 0 as an integer).

You could salvage it as:
Code:
float fastSign(float f) {
return float((floatBitsToInt(f) >> 31)<<1)+1; // "-1 << 1" is -2
}
But now you need a conversion from integer to float (instead of a reinterpretive cast which is free) and some additional operations. This makes this function somewhat slower than the one using AND and OR.
Logged
lycium
Fractal Supremo
*****
Posts: 1158



WWW
« Reply #43 on: May 31, 2015, 12:42:39 AM »

Sorry again, I really should have checked the rest of the code for context smiley

Code:
float fastSign(float f) { return (floatBitsToInt(f) >> 31) ? -1.0f : 1.0f; }

float fastSignWithZero(float f) { return (f == 0) ? 0 : ((floatBitsToInt(f) >> 31) ? -1.0f : 1.0f); }

Edit: Actually, this seems silly, it should just be (f >= 0) ? 1 : -1.
« Last Edit: May 31, 2015, 12:53:38 AM by lycium » Logged

TruthSerum
Guest
« Reply #44 on: June 04, 2015, 06:54:26 AM »

By the way, I noticed that 0.5+0.5*sign(x) (1) is slightly different from max(sign(x),0.0) (2)

With (1), if x=0.0, then it returns 0.5, which might not be desirable. This is not the case with (2), which returns 0.0 for x=0.0.

The point of this was to create a mask out of the sign of the value, presumably with the intention of using mix later, or similar, to select between two values without branching.
Logged
Pages: 1 2 [3] 4   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
Maximum Zoom Factor Mandelbulb 3d The Rev 2 2523 Last post October 17, 2010, 08:36:39 PM
by The Rev
Maximum Security Prison Images Showcase (Rate My Fractal) thom 0 901 Last post June 13, 2012, 04:20:27 AM
by thom
ability to set maximum render time per frame Feature Requests erstwhile 4 3916 Last post December 08, 2013, 10:30:52 PM
by erstwhile
principle of diminishing marginal productivity Images Showcase (Rate My Fractal) thom 0 1037 Last post April 28, 2015, 03:44:20 AM
by thom
Maximum render size Kalles Fraktaler Dinkydau 4 1739 Last post February 16, 2017, 11:32:12 PM
by Dinkydau

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.179 seconds with 24 queries. (Pretty URLs adds 0.01s, 2q)