Welcome to Fractal Forums

Fractal Software => Programming => Topic started by: TruthSerum on May 24, 2015, 06:52:40 PM




Title: Branchless maximum/principle axis
Post by: TruthSerum on May 24, 2015, 06:52:40 PM
How can I compute the principle axis of a vector without branching. By this I mean a function takes as input an arbitrary vector, and returns a unit vector that points along the main axis of the vector.

For example, if I have the vector (2.0, -4.0, 1.0), then the function I desire would return (0.0, -1.0, 0.0), since this input vector is "mostly" pointing in the negative Y direction.

Handling for corner cases has to be consistent, but never having more than a single element set to \pm1.0. So for example, (3.0, 3.0, 1.0) could be (1.0, 0.0, 0.0), or (0.0, 1.0, 0.0), but not (1.0, 1.0, 0.0).

I'm interested in the idea itself more than an implementation in any particular language. I'm sure this has been solved before.


Title: Re: Branchless maximum/principle axis
Post by: cKleinhuis on May 24, 2015, 07:15:41 PM
for this particular case i have no concrete solution, BUT any if query can be stated in terms of greater zero or smaller zero,

for example

if(a>1.0)then a=1.0
could be written (without branching)
a=min(a,1.0);

;)


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 24, 2015, 07:42:41 PM
Thanks, I tried subtracting the smallest component, which should set it to zero.

Code:
av = abs(v);
v -= sign(v) * min(av.x,min(av.y,av.z));

This will not work for corner vectors though: (1,1,1) and others will all become (0,0,0).


Title: Re: Branchless maximum/principle axis
Post by: hobold on May 24, 2015, 10:37:40 PM
Thanks, I tried subtracting the smallest component, which should set it to zero.

Code:
av = abs(v);
v -= sign(v) * min(av.x,min(av.y,av.z));

This will not work for corner vectors though: (1,1,1) and others will all become (0,0,0).
There is a standard trick to tweak algorithms which require distinct (non-equal) values. The basic idea is to append extra least significant digit(s) to the original input values.


A simplified illustrative example, with no regard to implementation:
Let's say the input vector's coordinates are stored with two fractional digits. So they can assume the values -1.00, -0.99. -0.98 ... and so on in increments of 0.01.

In order to make coordinate values unique, we append one extra digit, distinct and specific to each coordinate axis. Let's say "1" for the X axis, "2" for Y, and "3" for Z.

So, the problematic input vector (1, 1, 1) would turn into (1.001, 1.002, 1.003). All values are now unique.


In practice, with floating point values, it is probably easier to replace a few least significant bits uniquely instead of appending extra bits. So you would mask out the two least significant bits, and replace them uniquely for each coordinate with "00", "01", or "10".

This does make the algorithm imprecise. Depending on where the input data is coming from, the imprecision may be lost in the noise.


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 24, 2015, 11:58:13 PM
Love hobold's suggestion here. Very neat.

I got down to two compares robustly...

Code:
vec /= max(x,y,z)

if (abs(x)==1)
 return vec(x,0,0)
if (abs(y)==1)
 return vec(0,y,0)
return vec(0,0,z)

Losing a bit of precision is probably a good payoff for the conditionals!


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 25, 2015, 01:03:16 AM
Thanks, this routine might do for now actually. I think a==b can be performed using a mask:

Code:
float m = abs(sign(a-b)); // m=0.0 if a==b, m=1.0 otherwise

But then you'd need to mask out the other values because you can't return, as in your routine. It probably just works out faster to branch anyway.

The offset (adding a small fraction) might not be necessary after all. I think that I tend to avoid epsilon/tolerance values if I can help it.


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 26, 2015, 06:22:44 PM
Thanks to eiffie for his improved formula. Again, using two branches, but saves on the division for normalization.

Code:
vec3 paxis(vec3 p)
{
    vec3 a=abs(p),r = vec3(1.0,0.0,0.0);
    if(a.z>=max(a.x,a.y))r=r.yzx;
    else if(a.y>=a.x)r=r.zxy;
    return r*sign(p);
}


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 26, 2015, 07:28:23 PM
Eiffie's definitely going to win out on HLSL. Be careful if you're doing this on CPU using C++ because a lot of people do sign() using compares anyway to avoid having to understand their systems FP format, and vector swizzling isn't free on CPU...

The code I wrote is bugged btw, you'd need something like this...

Code:
vec /= max(abs(vec.x), abs(vec.y), abs(vec.z));

if (abs(vec.x)==1) vec = vector(vec.x,0,0);
else if (abs(vec.y==1) vec = vector(0,vec.y,0);
else vec = vector(0,0,vec.z);

return vec;

And anyway, the standard implementation of abs() in C++ uses a compare... (doh!)

Assuming HLSL has a floor function, how about this ugly little beast?

Code:
av = abs(vec)
vec /= max(av.x, max(av.y,av.z));
return floor(abs(vec))*vec;


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 26, 2015, 07:35:06 PM
Ahh ignore me. Same old problem with the vec(1,1,1) argument. :( So we'd need to use hobold's trick to solve that one...


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 26, 2015, 08:28:42 PM
The (1,1,1) edge case is annoying indeed. I wonder about the floor of negative -1<x<0 values of x, because I think floor(-0.5) = -1, not 0.0 as you would need here. I don't think there is a function for that, trunc(), or similar, in GLSL.

Also, I had problems with x/x being exactly 1 (so the abs(x)==1.0 comparisons failed sometimes). This seems strange because I thought the FPU guaranteed this.

Thanks for the advice about CPU implementations. I don't think branching is as bad though on PC (x86/x64), or so I have heard.


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 26, 2015, 09:59:24 PM
Floor (abs (vec))×vec takes care of the rounding down issue with negative numbers.

Branch cost on CPU is a pipeline flush multiplied by the probability of misprediction. Probably 5-10 cycles average although i lose track of how deep the pipelines are these days...

On GPU the control logic is usually shared between multiple pipelines so 2 branches which split 50/50 reduces execution speed by a factor of 4.


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 26, 2015, 11:49:35 PM
And just in case anyone was after the branchless principle axis algorithm...

Code:
av = abs(vec)
vec /= max(av.x, max(av.y,av.z));
av = floor(abs(vec)) * vector(1,0.9,0.8)
av /= max(av.x, max(av.y, av.z));
vec = floor(av) * vec;

branch-free, untested.

The *(1, 0.9, 0.8 ) simulates the trick hobold mentioned.
If a single value is set, then the /= max() returns the value to 1.0.
If multiple values are set this prioritises x over y over z.

I doubt very much that this has any useful purpose other than making me feel satisfied. :)


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 27, 2015, 12:40:35 AM
I tested it and you will be glad to know that, with some small syntax fixes, it works perfectly. I will leave it here for completeness:

Code:
vec3 paxis(vec3 vec) {
    vec3 av = abs(vec);
    vec /= max(av.x, max(av.y,av.z));
    av = floor(abs(vec)) * vec3(1.0,0.9,0.8);
    av /= max(av.x, max(av.y, av.z));
    vec = floor(av) * vec;
    return vec;
}

I would say that the branching version that eiffie supplied seems better for practical purposes though, so I expect I will continue to use that.


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 27, 2015, 12:57:01 AM
Heh, thanks for testing it out. Using Eiffie's far simpler to comprehend (and probably most often faster) solution sounds reasonable to me.

Thanks for the challenge though, enjoyed that one. :)


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 27, 2015, 12:35:36 PM
I should mention that it seems to be slightly different to the other routine, maybe in the use of the small "epsilon" offsets.

(http://i.imgur.com/3TYWtOkb.png) (how it should look)

(http://i.imgur.com/4BoYFgpb.png) (artifacts visible)

If you revisit it you will have to double check those values. I didn't try playing with them to try and fix it, but I just thought I should mention that I am seeing artifacts under some circumstances.

Annoying because I thought it worked well too.


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 27, 2015, 12:57:29 PM
Damn. Try this...

Code:
vec3 paxis(vec3 vec) {
    vec3 av = abs(vec);
    vec /= 0.99*max(av.x, max(av.y,av.z));
    av = floor(abs(vec)) * vec3(1.0,0.9,0.8);
    av /= 0.99*max(av.x, max(av.y, av.z));
    vec = floor(av) * vec;
    return vec;
}


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 27, 2015, 02:53:43 PM
Yes, that fixes the artifacts! :)

Too bad you had to introduce more tolerance/magic numbers though.


Title: Re: Branchless maximum/principle axis
Post by: hobold on May 27, 2015, 03:26:05 PM
Hm, GLSL seems to be more powerful than I thought. My first GLSL program:

Code:
  vec3 paxis(vec3 vec) {
    vec3 v = abs(vec);
    float m = max(v.x, max(v.y, v.z));
    v = vec3(float(v.x == m), float(v.y == m), float(v.z == m));

    // disambiguate multiple 1.0 values
    v = v * vec3(1.0, 0.9, 0.8);
    m = max(v.x, max(v.y, v.z));
    v = vec3(float(v.x == m), float(v.y == m), float(v.z == m));

    // restore original sign
    vec3 s = vec3(-float(vec.x < 0.0), -float(vec.y < 0.0), -float(vec.z < 0.0));

    return v*s;
  }

Does this work?

The approach is based on my experience with other SIMD machines, where vectors of boolean masks were the way to go. Special comparison instructions directly produced such masks.


Title: Re: Branchless maximum/principle axis
Post by: knighty on May 27, 2015, 04:52:14 PM
Using glsl built in functions gives:
Code:
 vec3 paxis(vec3 vec) {
    vec3 v = abs(vec);
    float m = max(v.x, max(v.y, v.z));
    v = vec3(equal(v,vec3(m)));

    // disambiguate multiple 1.0 values
    v = v * vec3(1.0, 0.9, 0.8);
    m = max(v.x, max(v.y, v.z));
    v = vec3(equal(v,vec3(m)));

    // restore original sign
    vec3 s = sign(vec);

    return v*s;
  }

You can also do:
Code:
 vec3 paxis(vec3 vec) {
    vec3 v = abs(vec);
    float m = max(v.x, max(v.y, v.z));
    v = vec3(float(v.x == m), float(v.y == m && v.x !=m), float(v.z == m && v.x !=m && v.y !=m));

    // restore original sign
    return v*sign(vec);
  }
:)
Does omitting disambiguation give artifacts?


Title: Re: Branchless maximum/principle axis
Post by: eiffie on May 27, 2015, 05:32:32 PM
I forgot to look here before posting on shadertoy "sphereflakes" but you can also do this (similar to knighty's)
Code:
vec3 a=abs(p);
a-=max(a.yzx,a.zxy);
return sign(p)*(sign(a)*0.5+0.5);


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 27, 2015, 10:15:27 PM
I wonder why he prefers 0.5+0.5*sign(x) over max(sign(x),0.0), seems like more arithmetic than is necessary.

It's also a nice routine though. Hard to test for performance differences unless this paxis routine happens to be the bottleneck.

Actually, the "sphere flake" shader doesn't make use of paxis() at all, which is why I incorrectly reported dom767's first routine as being well behaved for all inputs. I guess I left it in from testing.

Here are the results of testing the masking paxis routine as presented by hobold:

(http://i.imgur.com/p6Ao6xKb.png) (with hobold paxis) (http://i.imgur.com/p6Ao6xK.png)
(http://i.imgur.com/rr8IVndb.png) (with eiffie paxis) (http://i.imgur.com/rr8IVnd.png)

On the masking and == comparisons on floats to create masks: I'd be worried about that being supported properly by lots of GPU's. It should be, just like pow(-abs(x),y) should be consistently NaN or something. It comes in the category of "funny business" for me, and I'd be reluctant to use it - especially when (as we have seen) there are branchless and even very small routines that do branch, but that can do it without making use of the boolean vectors of floats. Specifically the cast from float(true) and float(false) seems a bit suspicious. Maybe someone knows?


Title: Re: Branchless maximum/principle axis
Post by: hobold on May 27, 2015, 11:34:19 PM
Specifically the cast from float(true) and float(false) seems a bit suspicious. Maybe someone knows?
Admittedly that looks weird. In ordinary C (and derived languages like C++ etc.), the rules are as follows:

1. int can be typecast to bool, where
    0 means false, and every other value means true

2. bool can be cast to int, where
    false means 0, and true means 1

3. int is then converted to float, thus
    false means 0.0 and true means 1.0

I only used this correspondence because I found documentation on GLSL claiming that the same conversion was legal there, but requires an explicit cast from bool to float.


BTW, eiffie's version makes use of several nifty "obscure truths". For example, a comparison to an arbitrary value is really a subtraction of that arbitrary value followed by a comparison to zero.

The performance of 0.5+0.5*sign(x) versus max(sign(x),0.0) is probably hardware specific. The former might be computed by a "fused multiply add", on any of the many floating point pipelines of a GPU, while the latter might be executed on a special function unit, of which there may be fewer. On a GPU that has sufficiently many hardware resources for either, the max() probably burns less energy and is therefor preferable (higher potential performance due to lower chances of thermal throttling).


As to the bug in my version: looks like my handling of signs isn't right.


Title: Re: Branchless maximum/principle axis
Post by: knighty on May 28, 2015, 02:10:10 PM
Conversions  between booleans, integers and floats are well defined. See glsl specifications document (available at opengl.org).
Quote
When a constructor is used to convert an int or a float to bool, 0 and 0.0 are converted to false, and nonzero
values are converted to true. When a constructor is used to convert a bool to an int or float, false is
converted to 0 or 0.0, and true is converted to 1 or 1.0.

For one, I find Eiffie's solution the most elegant so far, even if it's maybe not the most efficient.


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on May 29, 2015, 09:01:10 PM
Reasoning about GPU performance is hard.

Example:

Code:
#include "2D.frag"

// Eiffie
vec3 paxis(vec3 p) {
vec3 a=abs(p);
a-=max(a.yzx,a.zxy);
return sign(p)*(sign(a)*0.5+0.5);
}

// Branched approach
vec3 paxis2(vec3 v) {
vec3 a = abs(v);
return normalize(
(a.x>a.y) ? (a.x>a.z ? vec3(v.x,0,0) : vec3(0,0,v.z))
:  (a.z>a.y ? vec3(0,0,v.z) : vec3(0,v.y,0)));
}

vec3 color(vec2 c) {
vec3 v =  vec3(cos(c.x),sin(c.y),sin(c.x*c.y));
for (int i = 0; i < 2000; i++) {
v=paxis(v);
}
return v;
}

The Eiffie approach gives me a constant 5.9fps (on a Nvidia 850M) on all zoom scales.

But the branched approach gives between 20.3fps and 13.5fps depending on the zoom.

Branching cost on the GPU is somewhat of a myth: it is true that for warp divergence both paths must be executed for the threads on a multiprocessor, but if both paths are small this is not really a issue. At least for Nvidia - ATI may be different.
(The 20.3fps corresponds to when most of the screen is same color, and the same path is taken, the 13.5fps to the case where all warps diverge).

So at least on my Nvidia GPU branching is 2x-3x times faster.

Another example: changing from 'normalize' to 'sign' in paxis2, lowers performance from 20.3fps to 12.5fps. This would have been very difficult to predict from pure reasoning.

Finally, I also tried this on the integrated HD 4600 in my laptop, which gave:
Eiffie: 3.0 to 3.8 FPS
Branches: 3.8 to 5.9 FPS
(Interestingly, the Eiffie method also depends on zoom level here).


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on May 29, 2015, 09:25:15 PM
I just saw Eiffie's branched version - which has ~12fps (roughly indendently of zoom).

Interesting it can be made faster by getting rid of the 'sign' function:
Code:
vec3 paxis3(vec3 p)
{
    vec3 a=abs(p),r = vec3(1.0,0.0,0.0);
    if(a.z>=max(a.x,a.y))r=r.yzx;
   else if(a.y>=a.x)r=r.zxy;
   // return r*sign(p);
   return normalize(r*p);
}

This runs at 18fps (roughly independent of zoom) on my Nvidia.


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 29, 2015, 09:27:08 PM
Now that's very interesting. It's probably also worth pointing out that the slower and smaller snippet doesn't correctly handle the edge cases for vec(1,1,1) inputs etc. (I think).

Given that we're not bothering about the edge cases too much, how does this do?

Code:
av = abs(vec)
vec /= max(av.x, max(av.y,av.z));
return trunc(vec);

It might need the following bit of fudge...

Code:
av = abs(vec)
vec /= 0.999*max(av.x, max(av.y,av.z));
return trunc(vec);

The cost of branching will depend on the length of the fragment won't it? If paxis() is line 1 of a 200 line program then the branching will be a lot more costly I'd have thought. I guess this is just proving the value of good old fashioned profiling. :)


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on May 29, 2015, 10:23:03 PM
Your snippets are both running at ~32 FPS, so they are very fast. The first one has a lot of visual artifacts, but the second seems to be fine.

The cost of branching will depend on the length of the fragment won't it?

No. It will only depend on the length of the code inside the conditionals. A multiprocessor needs to execute the same statements for all threads in a warp (32 threads on modern Nvidia GPU's), so if the threads are taking different paths, it will mask out (ignore) some of the conditional code for the threads, see e.g.:

(http://15418.courses.cs.cmu.edu/spring2013content/lectures/02_basicarch/images/slide_030.png)

But after the conditional, code will execute as usual.


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 29, 2015, 11:45:22 PM
Ahh yeah, that makes sense. They're getting clever these GPUs. :)


Title: Re: Branchless maximum/principle axis
Post by: eiffie on May 30, 2015, 12:28:45 AM
Thanks I never would have guessed that sign was a bottleneck.


Title: Re: Branchless maximum/principle axis
Post by: dom767 on May 30, 2015, 12:49:24 AM
So this thread finally inspired me enough to go and try shadertoy...

https://www.shadertoy.com/view/ltS3W3

Kind of sad how much my GPU whips my CPU. :-/


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 30, 2015, 01:21:11 AM
Interesting results, I always knew there was something about branching on GPU. I think for texture fetches the adjacent pixel UV's must always be calculated, so it is able to select the mipmap LOD. Correct me if I am wrong about that.

I think the penalty on CPU is less, I'm not sure about a full pipeline flush, as dom767 said. I mean, consider how many background applications are doing branches even while your system is idle. CPU is designed to execute bad code very well :)

I like the shader dom767, I left a comment on ShaderToy for it :)


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on May 30, 2015, 01:48:53 PM
I think the penalty on CPU is less, I'm not sure about a full pipeline flush, as dom767 said. I mean, consider how many background applications are doing branches even while your system is idle. CPU is designed to execute bad code very well :)

An idle background application process would be probably have its branches correctly predicted most of the time, so there would be no penalty. Besides, the pipeline on modern CPUs is not that long: 12-14 cycles for the Core 2 architecture. This is completely neglible compared to the cost of switching between the background processes: something which involved storing CPU state and switching virtual adress spaces. The context switch cost on a Core 2 is roughly 5-7 microseconds or some 10000 instructions.

For very simple conditional statements, the compiler may also opt for using 'predicated' instructions, like the CMOV x86 instruction, which is only executed if a flag is set. This ensures no pipeline flushing is needed. (This is widely used on the ARM architecture, where most instructions are predicated.)

Btw, this is a great resource for CPU optimizations: http://www.agner.org/optimize/


Title: Re: Branchless maximum/principle axis
Post by: lycium on May 30, 2015, 02:43:28 PM
Amazing amount of discussion here for fabs() :)

I know computer architecture is a complex thing and one shouldn't oversimplify / make assumptions, but I'm going to do exactly that now :D

Straight up, I don't see how anything can be faster than *unconditionally * setting a single (sign) bit. If you think about the amount of logic that goes into branching on modern parallel and superscalar archs, there's just no way you want to do anything else. As mentioned it's even worse for GPUs due to serialisation of branch divergence.

So just use fabs, right? :D Always avoid a branch if it's cheap to do so, in general - that's why eiffie's method works well: abs, getting sign bit and selecting between 2 constant literals is all sort of free on a latency hiding, branch fearing GPU.


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on May 30, 2015, 04:21:40 PM
I think my point was that Eiffies versions were actually slower than my naive version using three conditional branches and a normalize function. So don't be too afraid of branches.

And while abs is very likely to be a trivial function, I think it was unexpected that 'sign' turned out to be slower than 'normalize'. On my GPU 'sign' is compiled into the following intermediate code:
Code:
SLT.F R3.xyz, R1, {0, 0, 0, 0}.x;
SGT.F R1.xyz, R1, {0, 0, 0, 0}.x;
TRUNC.U R3.xyz, R3; // <- Truncate to unsigned integer
TRUNC.U R1.xyz, R1;
I2F.U R3.xyz, R3; // <- Convert back to float. Really?
I2F.U R1.xyz, R1;
ADD.F R1.xyz, R1, -R3;

while the 'normalize' is compiled into
Code:
MUL.F R1.xyz, R2, R1;
DP3.F R2.x, R1, R1; // <- a dot product
RSQ.F R2.x, R2.x;  // <- an inverse square root

which was quite unexpected for me.

So I'd advise to simply just measure the performance, instead of reasoning about it.


Title: Re: Branchless maximum/principle axis
Post by: eiffie on May 30, 2015, 05:25:41 PM
Hey Syntopia did you by chance try Knighty's version where he converts float/bool I have always wondered what that compiles to. Does it leave them untouched?

...or sign(p)*max(vec3(greaterThan(a,max(a.yzx,a.zxy))),0.0);

@lycium I think bit checking is just too low an occupation for modern GPUs to be bothered with.


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on May 30, 2015, 06:57:21 PM
Code:
vec3 v = abs(vec);
float m = max(v.x, max(v.y, v.z));
v = vec3(float(v.x == m), float(v.y == m && v.x !=m), float(v.z == m && v.x !=m && v.y !=m));

// restore original sign
return normalize(v*vec);
   

compiles to

Code:
MAX.F R2.x, |R1.y|, |R1.z|;
MAX.F R2.x, |R1|, R2;
SNE.F R2.z, |R1.x|, R2.x;
SEQ.F R2.y, |R1|, R2.x;
SEQ.F R3.x, |R1.z|, R2;
SNE.F R3.y, |R1|, R2.x;
SEQ.F R2.x, |R1|, R2;
TRUNC.U R2.x, R2;
TRUNC.U R2.z, R2;
TRUNC.U R2.y, R2;
AND.U R2.y, R2, R2.z;
TRUNC.U R3.x, R3;
AND.U R2.z, R2, R3.x;
TRUNC.U R3.x, R3.y;
AND.U R2.z, R2, R3.x;
I2F.U R2.x, R2;
I2F.U R2.y, R2;
I2F.U R2.z, R2;
MUL.F R1.xyz, R2, R1;
DP3.F R2.x, R1, R1;
RSQ.F R2.x, R2.x;
MUL.F R1.xyz, R2.x, R1;

so it seems booleans are represented as floats (SNE.F/SEQ.F), converted to ints (TRUNC.U), AND'ed and converted back to floats (I2F)


Title: Re: Branchless maximum/principle axis
Post by: lycium on May 30, 2015, 07:48:36 PM
So I'd advise to simply just measure the performance, instead of reasoning about it.

I agree in general, but I don't think we're referring to the same thing - I was specifically referring to fabs, which just cannot be slower than anything else because it's a single OR-operation with 0x800000... the top (sign) bit of an IEEE number.

I have to say I'm surprised that a normalize does better than that other code, too - but that again is more complex story of data dependencies limiting pipelining etc.

@lycium I think bit checking is just too low an occupation for modern GPUs to be bothered with.

Not at all... I also don't really know what you mean by this... so if I have "x = y ^ z" somewhere in my code, the GPU goes "how dare you molest me with such trivial tasks?!!?"  :D

They are mostly like other CPUs, working on binary IEEE floating point data. In terms of raw throughput, you can't do better than an unconditional simple bit-operation.


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on May 30, 2015, 11:19:23 PM
I agree in general, but I don't think we're referring to the same thing - I was specifically referring to fabs, which just cannot be slower than anything else because it's a single OR-operation with 0x800000... the top (sign) bit of an IEEE number.

Indeed - I was talking about branching, not abs.

Btw, I think your OR operation would make the number negative (IEEE754 numbers are multipled with (-1)^sign). You'd want to AND with 0x7fffffff instead. And AND'ing floats is not valid in neither C nor GLSL. So you have to do a reinterpretive cast. In GLSL this would be:

Code:
// Needs #version 330
float fabs(float f) {
return intBitsToFloat(floatBitsToInt(f) & 0x7fffffff);
}

Which *is* slower than using the builtin abs command (I tested)

The similar code in C would be (this snippet was found on the net):
Code:
float fastabs(float f) 
{int i=((*(int*)&f)&0x7fffffff);return (*(float*)&i);}



Title: Re: Branchless maximum/principle axis
Post by: lycium on May 30, 2015, 11:35:36 PM
Ahhh yes, of course the sign bit should not be set for positive number :)


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on May 30, 2015, 11:55:48 PM
Interestingly, this can be used to build a 'sign' function which is faster than the builtin function produces:
Code:
// Notice: returns +1 for 0 as input.
float fastSign(float f) {
   // Bitwise OR of the number '1' with the sign-bit of f
   return intBitsToFloat(0x3F800000 | (0x80000000 & floatBitsToInt(f) ));
}

vec3 fastSign3(vec3 f) {
   return vec3(fastSign(f.x),fastSign(f.y),fastSign(f.z));
}

Using this makes Eiffies function faster than the my branchy version.


Title: Re: Branchless maximum/principle axis
Post by: lycium on May 31, 2015, 12:00:19 AM
maybe just floatBitsToInt(f) >> 31?


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on May 31, 2015, 12:02:18 AM
Also a proper sign function should handle the edge case sign(0)==0, which I don't think can be achieved with a single bit shift.


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on May 31, 2015, 12:40:24 AM
maybe just floatBitsToInt(f) >> 31?

I don't think it will work: for a negative number, this will produce all binary 1's. (which is -1 as an integer). For a positive number, this will produce all binary 0's (which is 0 as an integer).

You could salvage it as:
Code:
float fastSign(float f) {
return float((floatBitsToInt(f) >> 31)<<1)+1; // "-1 << 1" is -2
}
But now you need a conversion from integer to float (instead of a reinterpretive cast which is free) and some additional operations. This makes this function somewhat slower than the one using AND and OR.


Title: Re: Branchless maximum/principle axis
Post by: lycium on May 31, 2015, 12:42:39 AM
Sorry again, I really should have checked the rest of the code for context :)

Code:
float fastSign(float f) { return (floatBitsToInt(f) >> 31) ? -1.0f : 1.0f; }

float fastSignWithZero(float f) { return (f == 0) ? 0 : ((floatBitsToInt(f) >> 31) ? -1.0f : 1.0f); }

Edit: Actually, this seems silly, it should just be (f >= 0) ? 1 : -1.


Title: Re: Branchless maximum/principle axis
Post by: TruthSerum on June 04, 2015, 06:54:26 AM
By the way, I noticed that 0.5+0.5*sign(x) (1) is slightly different from max(sign(x),0.0) (2)

With (1), if x=0.0, then it returns 0.5, which might not be desirable. This is not the case with (2), which returns 0.0 for x=0.0.

The point of this was to create a mask out of the sign of the value, presumably with the intention of using mix later, or similar, to select between two values without branching.


Title: Re: Branchless maximum/principle axis
Post by: laser blaster on June 04, 2015, 10:22:14 PM
I think my point was that Eiffies versions were actually slower than my naive version using three conditional branches and a normalize function. So don't be too afraid of branches.

And while abs is very likely to be a trivial function, I think it was unexpected that 'sign' turned out to be slower than 'normalize'. On my GPU 'sign' is compiled into the following intermediate code:
Code:
SLT.F R3.xyz, R1, {0, 0, 0, 0}.x;
SGT.F R1.xyz, R1, {0, 0, 0, 0}.x;
TRUNC.U R3.xyz, R3; // <- Truncate to unsigned integer
TRUNC.U R1.xyz, R1;
I2F.U R3.xyz, R3; // <- Convert back to float. Really?
I2F.U R1.xyz, R1;
ADD.F R1.xyz, R1, -R3;

while the 'normalize' is compiled into
Code:
MUL.F R1.xyz, R2, R1;
DP3.F R2.x, R1, R1; // <- a dot product
RSQ.F R2.x, R2.x;  // <- an inverse square root

which was quite unexpected for me.

So I'd advise to simply just measure the performance, instead of reasoning about it.

The normalize may be faster in practice, but the intermediate code isn't very good indicator of speed, because it doesn't doesn't map directly to modern GPU machine code. A 3-component dot product is implemented as 3 scalar multiply-adds on current GPU's- so that's 3 instructions. The 3-component multiply will also compile to 3 instructions. Only very old GPU's still use native vector instructions. There is no way that I know of to view the actual native assembly code for any modern GPU's.

And abs is definitely very fast- I've heard that on some GPU's it's effectively free, as it can be combined into other instructions. But sign() being so slow is quite a shocker. It should be implemented as (f>=0) ? 1 : -1, which shouldn't be more than 2 instructions: a compare, and a conditional move instruction. I don't know why they did it in such a complicated way.


Title: Re: Branchless maximum/principle axis
Post by: Syntopia on June 05, 2015, 04:51:21 PM
The normalize may be faster in practice, but the intermediate code isn't very good indicator of speed, because it doesn't doesn't map directly to modern GPU machine code. A 3-component dot product is implemented as 3 scalar multiply-adds on current GPU's- so that's 3 instructions. The 3-component multiply will also compile to 3 instructions. Only very old GPU's still use native vector instructions.

Yes, it will only hint at what is happening. Notice, that even with machine codes you would still need to know how many cycles each instruction uses.

Quote
There is no way that I know of to view the actual native assembly code for any modern GPU's.
And abs is definitely very fast- I've heard that on some GPU's it's effectively free, as it can be combined into other instructions. But sign() being so slow is quite a shocker. It should be implemented as (f>=0) ? 1 : -1, which shouldn't be more than 2 instructions: a compare, and a conditional move instruction. I don't know why they did it in such a complicated way.

The native assembly for ATI cards can be viewed using their "GPU ShaderAnalyzer" (you can choose between all their architectures and see the machine code). For instance. sign(x) on ATI compiles to:

Code:
      0  y: SETGT       ____,  0.0f,  KC0[0].x      
         z: SETGT       ____,  KC0[0].x,  0.0f     
      1  x: ADD         R0.x,  PV0.z, -PV0.y

while (f>=0) ? 1 : -1 compiles into

Code:
      0  y: SETGT_DX10  ____,  KC0[0].x,  0.0f      
      1  x: CNDE_INT    R0.x,  PV0.y,  -1082130432,  1065353216     

which is exactly as expected (the reason for difference is that sign(x) is required to have sign(0)=0).

I have also heard that abs (and saturate) should be free instructions, but I think that may depend on architecture. On ATI archs abs translated into a "MAX_DX10    ____,  KC0[0].y, -KC0[0].y" instruction.




Title: Re: Branchless maximum/principle axis
Post by: eiffie on June 05, 2015, 05:22:21 PM
Thanks for the info Syntopia - very helpful as always.