Logo by Trifox - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Check out the originating "3d Mandelbulb" thread here
 
*
Welcome, Guest. Please login or register. March 28, 2024, 09:11:46 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: 1 [2] 3 4   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: Branchless maximum/principle axis  (Read 9501 times)
Description: Compute principle axis without branching
0 Members and 1 Guest are viewing this topic.
dom767
Explorer
****
Posts: 40


dom767
WWW
« Reply #15 on: May 27, 2015, 12:57:29 PM »

Damn. Try this...

Code:
vec3 paxis(vec3 vec) {
    vec3 av = abs(vec);
    vec /= 0.99*max(av.x, max(av.y,av.z));
    av = floor(abs(vec)) * vec3(1.0,0.9,0.8);
    av /= 0.99*max(av.x, max(av.y, av.z));
    vec = floor(av) * vec;
    return vec;
}
Logged
TruthSerum
Guest
« Reply #16 on: May 27, 2015, 02:53:43 PM »

Yes, that fixes the artifacts! smiley

Too bad you had to introduce more tolerance/magic numbers though.
Logged
hobold
Fractal Bachius
*
Posts: 573


« Reply #17 on: May 27, 2015, 03:26:05 PM »

Hm, GLSL seems to be more powerful than I thought. My first GLSL program:

Code:
 vec3 paxis(vec3 vec) {
    vec3 v = abs(vec);
    float m = max(v.x, max(v.y, v.z));
    v = vec3(float(v.x == m), float(v.y == m), float(v.z == m));

    // disambiguate multiple 1.0 values
    v = v * vec3(1.0, 0.9, 0.8);
    m = max(v.x, max(v.y, v.z));
    v = vec3(float(v.x == m), float(v.y == m), float(v.z == m));

    // restore original sign
    vec3 s = vec3(-float(vec.x < 0.0), -float(vec.y < 0.0), -float(vec.z < 0.0));

    return v*s;
  }

Does this work?

The approach is based on my experience with other SIMD machines, where vectors of boolean masks were the way to go. Special comparison instructions directly produced such masks.
Logged
knighty
Fractal Iambus
***
Posts: 819


« Reply #18 on: May 27, 2015, 04:52:14 PM »

Using glsl built in functions gives:
Code:
vec3 paxis(vec3 vec) {
    vec3 v = abs(vec);
    float m = max(v.x, max(v.y, v.z));
    v = vec3(equal(v,vec3(m)));

    // disambiguate multiple 1.0 values
    v = v * vec3(1.0, 0.9, 0.8);
    m = max(v.x, max(v.y, v.z));
    v = vec3(equal(v,vec3(m)));

    // restore original sign
    vec3 s = sign(vec);

    return v*s;
  }

You can also do:
Code:
vec3 paxis(vec3 vec) {
    vec3 v = abs(vec);
    float m = max(v.x, max(v.y, v.z));
    v = vec3(float(v.x == m), float(v.y == m && v.x !=m), float(v.z == m && v.x !=m && v.y !=m));

    // restore original sign
    return v*sign(vec);
  }
smiley
Does omitting disambiguation give artifacts?
Logged
eiffie
Guest
« Reply #19 on: May 27, 2015, 05:32:32 PM »

I forgot to look here before posting on shadertoy "sphereflakes" but you can also do this (similar to knighty's)
Code:
vec3 a=abs(p);
a-=max(a.yzx,a.zxy);
return sign(p)*(sign(a)*0.5+0.5);
Logged
TruthSerum
Guest
« Reply #20 on: May 27, 2015, 10:15:27 PM »

I wonder why he prefers 0.5+0.5*sign(x) over max(sign(x),0.0), seems like more arithmetic than is necessary.

It's also a nice routine though. Hard to test for performance differences unless this paxis routine happens to be the bottleneck.

Actually, the "sphere flake" shader doesn't make use of paxis() at all, which is why I incorrectly reported dom767's first routine as being well behaved for all inputs. I guess I left it in from testing.

Here are the results of testing the masking paxis routine as presented by hobold:

(with hobold paxis)
(with eiffie paxis)

On the masking and == comparisons on floats to create masks: I'd be worried about that being supported properly by lots of GPU's. It should be, just like pow(-abs(x),y) should be consistently NaN or something. It comes in the category of "funny business" for me, and I'd be reluctant to use it - especially when (as we have seen) there are branchless and even very small routines that do branch, but that can do it without making use of the boolean vectors of floats. Specifically the cast from float(true) and float(false) seems a bit suspicious. Maybe someone knows?
« Last Edit: May 27, 2015, 10:52:45 PM by TruthSerum » Logged
hobold
Fractal Bachius
*
Posts: 573


« Reply #21 on: May 27, 2015, 11:34:19 PM »

Specifically the cast from float(true) and float(false) seems a bit suspicious. Maybe someone knows?
Admittedly that looks weird. In ordinary C (and derived languages like C++ etc.), the rules are as follows:

1. int can be typecast to bool, where
    0 means false, and every other value means true

2. bool can be cast to int, where
    false means 0, and true means 1

3. int is then converted to float, thus
    false means 0.0 and true means 1.0

I only used this correspondence because I found documentation on GLSL claiming that the same conversion was legal there, but requires an explicit cast from bool to float.


BTW, eiffie's version makes use of several nifty "obscure truths". For example, a comparison to an arbitrary value is really a subtraction of that arbitrary value followed by a comparison to zero.

The performance of 0.5+0.5*sign(x) versus max(sign(x),0.0) is probably hardware specific. The former might be computed by a "fused multiply add", on any of the many floating point pipelines of a GPU, while the latter might be executed on a special function unit, of which there may be fewer. On a GPU that has sufficiently many hardware resources for either, the max() probably burns less energy and is therefor preferable (higher potential performance due to lower chances of thermal throttling).


As to the bug in my version: looks like my handling of signs isn't right.
Logged
knighty
Fractal Iambus
***
Posts: 819


« Reply #22 on: May 28, 2015, 02:10:10 PM »

Conversions  between booleans, integers and floats are well defined. See glsl specifications document (available at opengl.org).
Quote
When a constructor is used to convert an int or a float to bool, 0 and 0.0 are converted to false, and nonzero
values are converted to true. When a constructor is used to convert a bool to an int or float, false is
converted to 0 or 0.0, and true is converted to 1 or 1.0.

For one, I find Eiffie's solution the most elegant so far, even if it's maybe not the most efficient.
Logged
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #23 on: May 29, 2015, 09:01:10 PM »

Reasoning about GPU performance is hard.

Example:

Code:
#include "2D.frag"

// Eiffie
vec3 paxis(vec3 p) {
vec3 a=abs(p);
a-=max(a.yzx,a.zxy);
return sign(p)*(sign(a)*0.5+0.5);
}

// Branched approach
vec3 paxis2(vec3 v) {
vec3 a = abs(v);
return normalize(
(a.x>a.y) ? (a.x>a.z ? vec3(v.x,0,0) : vec3(0,0,v.z))
:  (a.z>a.y ? vec3(0,0,v.z) : vec3(0,v.y,0)));
}

vec3 color(vec2 c) {
vec3 v =  vec3(cos(c.x),sin(c.y),sin(c.x*c.y));
for (int i = 0; i < 2000; i++) {
v=paxis(v);
}
return v;
}

The Eiffie approach gives me a constant 5.9fps (on a Nvidia 850M) on all zoom scales.

But the branched approach gives between 20.3fps and 13.5fps depending on the zoom.

Branching cost on the GPU is somewhat of a myth: it is true that for warp divergence both paths must be executed for the threads on a multiprocessor, but if both paths are small this is not really a issue. At least for Nvidia - ATI may be different.
(The 20.3fps corresponds to when most of the screen is same color, and the same path is taken, the 13.5fps to the case where all warps diverge).

So at least on my Nvidia GPU branching is 2x-3x times faster.

Another example: changing from 'normalize' to 'sign' in paxis2, lowers performance from 20.3fps to 12.5fps. This would have been very difficult to predict from pure reasoning.

Finally, I also tried this on the integrated HD 4600 in my laptop, which gave:
Eiffie: 3.0 to 3.8 FPS
Branches: 3.8 to 5.9 FPS
(Interestingly, the Eiffie method also depends on zoom level here).
« Last Edit: May 29, 2015, 09:07:21 PM by Syntopia » Logged
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #24 on: May 29, 2015, 09:25:15 PM »

I just saw Eiffie's branched version - which has ~12fps (roughly indendently of zoom).

Interesting it can be made faster by getting rid of the 'sign' function:
Code:
vec3 paxis3(vec3 p)
{
    vec3 a=abs(p),r = vec3(1.0,0.0,0.0);
    if(a.z>=max(a.x,a.y))r=r.yzx;
   else if(a.y>=a.x)r=r.zxy;
   // return r*sign(p);
   return normalize(r*p);
}

This runs at 18fps (roughly independent of zoom) on my Nvidia.
Logged
dom767
Explorer
****
Posts: 40


dom767
WWW
« Reply #25 on: May 29, 2015, 09:27:08 PM »

Now that's very interesting. It's probably also worth pointing out that the slower and smaller snippet doesn't correctly handle the edge cases for vec(1,1,1) inputs etc. (I think).

Given that we're not bothering about the edge cases too much, how does this do?

Code:
av = abs(vec)
vec /= max(av.x, max(av.y,av.z));
return trunc(vec);

It might need the following bit of fudge...

Code:
av = abs(vec)
vec /= 0.999*max(av.x, max(av.y,av.z));
return trunc(vec);

The cost of branching will depend on the length of the fragment won't it? If paxis() is line 1 of a 200 line program then the branching will be a lot more costly I'd have thought. I guess this is just proving the value of good old fashioned profiling. smiley
Logged
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #26 on: May 29, 2015, 10:23:03 PM »

Your snippets are both running at ~32 FPS, so they are very fast. The first one has a lot of visual artifacts, but the second seems to be fine.

The cost of branching will depend on the length of the fragment won't it?

No. It will only depend on the length of the code inside the conditionals. A multiprocessor needs to execute the same statements for all threads in a warp (32 threads on modern Nvidia GPU's), so if the threads are taking different paths, it will mask out (ignore) some of the conditional code for the threads, see e.g.:



But after the conditional, code will execute as usual.
Logged
dom767
Explorer
****
Posts: 40


dom767
WWW
« Reply #27 on: May 29, 2015, 11:45:22 PM »

Ahh yeah, that makes sense. They're getting clever these GPUs. smiley
Logged
eiffie
Guest
« Reply #28 on: May 30, 2015, 12:28:45 AM »

Thanks I never would have guessed that sign was a bottleneck.
Logged
dom767
Explorer
****
Posts: 40


dom767
WWW
« Reply #29 on: May 30, 2015, 12:49:24 AM »

So this thread finally inspired me enough to go and try shadertoy...

https://www.shadertoy.com/view/ltS3W3

Kind of sad how much my GPU whips my CPU. :-/
Logged
Pages: 1 [2] 3 4   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
Maximum Zoom Factor Mandelbulb 3d The Rev 2 2227 Last post October 17, 2010, 08:36:39 PM
by The Rev
Maximum Security Prison Images Showcase (Rate My Fractal) thom 0 859 Last post June 13, 2012, 04:20:27 AM
by thom
ability to set maximum render time per frame Feature Requests erstwhile 4 2660 Last post December 08, 2013, 10:30:52 PM
by erstwhile
principle of diminishing marginal productivity Images Showcase (Rate My Fractal) thom 0 997 Last post April 28, 2015, 03:44:20 AM
by thom
Maximum render size Kalles Fraktaler Dinkydau 4 1537 Last post February 16, 2017, 11:32:12 PM
by Dinkydau

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.224 seconds with 24 queries. (Pretty URLs adds 0.009s, 2q)