TruthSerum
Guest
|
|
« Reply #30 on: May 30, 2015, 01:21:11 AM » |
|
Interesting results, I always knew there was something about branching on GPU. I think for texture fetches the adjacent pixel UV's must always be calculated, so it is able to select the mipmap LOD. Correct me if I am wrong about that. I think the penalty on CPU is less, I'm not sure about a full pipeline flush, as dom767 said. I mean, consider how many background applications are doing branches even while your system is idle. CPU is designed to execute bad code very well I like the shader dom767, I left a comment on ShaderToy for it
|
|
|
Logged
|
|
|
|
Syntopia
|
|
« Reply #31 on: May 30, 2015, 01:48:53 PM » |
|
I think the penalty on CPU is less, I'm not sure about a full pipeline flush, as dom767 said. I mean, consider how many background applications are doing branches even while your system is idle. CPU is designed to execute bad code very well An idle background application process would be probably have its branches correctly predicted most of the time, so there would be no penalty. Besides, the pipeline on modern CPUs is not that long: 12-14 cycles for the Core 2 architecture. This is completely neglible compared to the cost of switching between the background processes: something which involved storing CPU state and switching virtual adress spaces. The context switch cost on a Core 2 is roughly 5-7 microseconds or some 10000 instructions. For very simple conditional statements, the compiler may also opt for using 'predicated' instructions, like the CMOV x86 instruction, which is only executed if a flag is set. This ensures no pipeline flushing is needed. (This is widely used on the ARM architecture, where most instructions are predicated.) Btw, this is a great resource for CPU optimizations: http://www.agner.org/optimize/
|
|
|
Logged
|
|
|
|
lycium
|
|
« Reply #32 on: May 30, 2015, 02:43:28 PM » |
|
Amazing amount of discussion here for fabs() I know computer architecture is a complex thing and one shouldn't oversimplify / make assumptions, but I'm going to do exactly that now Straight up, I don't see how anything can be faster than *unconditionally * setting a single (sign) bit. If you think about the amount of logic that goes into branching on modern parallel and superscalar archs, there's just no way you want to do anything else. As mentioned it's even worse for GPUs due to serialisation of branch divergence. So just use fabs, right? Always avoid a branch if it's cheap to do so, in general - that's why eiffie's method works well: abs, getting sign bit and selecting between 2 constant literals is all sort of free on a latency hiding, branch fearing GPU.
|
|
« Last Edit: May 30, 2015, 02:49:32 PM by lycium »
|
Logged
|
|
|
|
Syntopia
|
|
« Reply #33 on: May 30, 2015, 04:21:40 PM » |
|
I think my point was that Eiffies versions were actually slower than my naive version using three conditional branches and a normalize function. So don't be too afraid of branches. And while abs is very likely to be a trivial function, I think it was unexpected that 'sign' turned out to be slower than 'normalize'. On my GPU 'sign' is compiled into the following intermediate code: SLT.F R3.xyz, R1, {0, 0, 0, 0}.x; SGT.F R1.xyz, R1, {0, 0, 0, 0}.x; TRUNC.U R3.xyz, R3; // <- Truncate to unsigned integer TRUNC.U R1.xyz, R1; I2F.U R3.xyz, R3; // <- Convert back to float. Really? I2F.U R1.xyz, R1; ADD.F R1.xyz, R1, -R3;
while the 'normalize' is compiled into MUL.F R1.xyz, R2, R1; DP3.F R2.x, R1, R1; // <- a dot product RSQ.F R2.x, R2.x; // <- an inverse square root
which was quite unexpected for me. So I'd advise to simply just measure the performance, instead of reasoning about it.
|
|
|
Logged
|
|
|
|
eiffie
Guest
|
|
« Reply #34 on: May 30, 2015, 05:25:41 PM » |
|
Hey Syntopia did you by chance try Knighty's version where he converts float/bool I have always wondered what that compiles to. Does it leave them untouched?
...or sign(p)*max(vec3(greaterThan(a,max(a.yzx,a.zxy))),0.0);
@lycium I think bit checking is just too low an occupation for modern GPUs to be bothered with.
|
|
|
Logged
|
|
|
|
Syntopia
|
|
« Reply #35 on: May 30, 2015, 06:57:21 PM » |
|
vec3 v = abs(vec); float m = max(v.x, max(v.y, v.z)); v = vec3(float(v.x == m), float(v.y == m && v.x !=m), float(v.z == m && v.x !=m && v.y !=m));
// restore original sign return normalize(v*vec);
compiles to MAX.F R2.x, |R1.y|, |R1.z|; MAX.F R2.x, |R1|, R2; SNE.F R2.z, |R1.x|, R2.x; SEQ.F R2.y, |R1|, R2.x; SEQ.F R3.x, |R1.z|, R2; SNE.F R3.y, |R1|, R2.x; SEQ.F R2.x, |R1|, R2; TRUNC.U R2.x, R2; TRUNC.U R2.z, R2; TRUNC.U R2.y, R2; AND.U R2.y, R2, R2.z; TRUNC.U R3.x, R3; AND.U R2.z, R2, R3.x; TRUNC.U R3.x, R3.y; AND.U R2.z, R2, R3.x; I2F.U R2.x, R2; I2F.U R2.y, R2; I2F.U R2.z, R2; MUL.F R1.xyz, R2, R1; DP3.F R2.x, R1, R1; RSQ.F R2.x, R2.x; MUL.F R1.xyz, R2.x, R1;
so it seems booleans are represented as floats (SNE.F/SEQ.F), converted to ints (TRUNC.U), AND'ed and converted back to floats (I2F)
|
|
|
Logged
|
|
|
|
lycium
|
|
« Reply #36 on: May 30, 2015, 07:48:36 PM » |
|
So I'd advise to simply just measure the performance, instead of reasoning about it.
I agree in general, but I don't think we're referring to the same thing - I was specifically referring to fabs, which just cannot be slower than anything else because it's a single OR-operation with 0x800000... the top (sign) bit of an IEEE number. I have to say I'm surprised that a normalize does better than that other code, too - but that again is more complex story of data dependencies limiting pipelining etc. @lycium I think bit checking is just too low an occupation for modern GPUs to be bothered with.
Not at all... I also don't really know what you mean by this... so if I have "x = y ^ z" somewhere in my code, the GPU goes "how dare you molest me with such trivial tasks?!!?" They are mostly like other CPUs, working on binary IEEE floating point data. In terms of raw throughput, you can't do better than an unconditional simple bit-operation.
|
|
« Last Edit: May 30, 2015, 07:59:55 PM by lycium »
|
Logged
|
|
|
|
Syntopia
|
|
« Reply #37 on: May 30, 2015, 11:19:23 PM » |
|
I agree in general, but I don't think we're referring to the same thing - I was specifically referring to fabs, which just cannot be slower than anything else because it's a single OR-operation with 0x800000... the top (sign) bit of an IEEE number.
Indeed - I was talking about branching, not abs. Btw, I think your OR operation would make the number negative (IEEE754 numbers are multipled with (-1)^sign). You'd want to AND with 0x7fffffff instead. And AND'ing floats is not valid in neither C nor GLSL. So you have to do a reinterpretive cast. In GLSL this would be: // Needs #version 330 float fabs(float f) { return intBitsToFloat(floatBitsToInt(f) & 0x7fffffff); }
Which *is* slower than using the builtin abs command (I tested) The similar code in C would be (this snippet was found on the net): float fastabs(float f) {int i=((*(int*)&f)&0x7fffffff);return (*(float*)&i);}
|
|
|
Logged
|
|
|
|
lycium
|
|
« Reply #38 on: May 30, 2015, 11:35:36 PM » |
|
Ahhh yes, of course the sign bit should not be set for positive number
|
|
|
Logged
|
|
|
|
Syntopia
|
|
« Reply #39 on: May 30, 2015, 11:55:48 PM » |
|
Interestingly, this can be used to build a 'sign' function which is faster than the builtin function produces: // Notice: returns +1 for 0 as input. float fastSign(float f) { // Bitwise OR of the number '1' with the sign-bit of f return intBitsToFloat(0x3F800000 | (0x80000000 & floatBitsToInt(f) )); }
vec3 fastSign3(vec3 f) { return vec3(fastSign(f.x),fastSign(f.y),fastSign(f.z)); }
Using this makes Eiffies function faster than the my branchy version.
|
|
« Last Edit: May 31, 2015, 12:32:28 AM by Syntopia »
|
Logged
|
|
|
|
lycium
|
|
« Reply #40 on: May 31, 2015, 12:00:19 AM » |
|
maybe just floatBitsToInt(f) >> 31?
|
|
|
Logged
|
|
|
|
TruthSerum
Guest
|
|
« Reply #41 on: May 31, 2015, 12:02:18 AM » |
|
Also a proper sign function should handle the edge case sign(0)==0, which I don't think can be achieved with a single bit shift.
|
|
|
Logged
|
|
|
|
Syntopia
|
|
« Reply #42 on: May 31, 2015, 12:40:24 AM » |
|
maybe just floatBitsToInt(f) >> 31?
I don't think it will work: for a negative number, this will produce all binary 1's. (which is -1 as an integer). For a positive number, this will produce all binary 0's (which is 0 as an integer). You could salvage it as: float fastSign(float f) { return float((floatBitsToInt(f) >> 31)<<1)+1; // "-1 << 1" is -2 }
But now you need a conversion from integer to float (instead of a reinterpretive cast which is free) and some additional operations. This makes this function somewhat slower than the one using AND and OR.
|
|
|
Logged
|
|
|
|
lycium
|
|
« Reply #43 on: May 31, 2015, 12:42:39 AM » |
|
Sorry again, I really should have checked the rest of the code for context float fastSign(float f) { return (floatBitsToInt(f) >> 31) ? -1.0f : 1.0f; }
float fastSignWithZero(float f) { return (f == 0) ? 0 : ((floatBitsToInt(f) >> 31) ? -1.0f : 1.0f); }
Edit: Actually, this seems silly, it should just be (f >= 0) ? 1 : -1.
|
|
« Last Edit: May 31, 2015, 12:53:38 AM by lycium »
|
Logged
|
|
|
|
TruthSerum
Guest
|
|
« Reply #44 on: June 04, 2015, 06:54:26 AM » |
|
By the way, I noticed that 0.5+0.5*sign(x) (1) is slightly different from max(sign(x),0.0) (2)
With (1), if x=0.0, then it returns 0.5, which might not be desirable. This is not the case with (2), which returns 0.0 for x=0.0.
The point of this was to create a mask out of the sign of the value, presumably with the intention of using mix later, or similar, to select between two values without branching.
|
|
|
Logged
|
|
|
|
|