Welcome to Fractal Forums

Fractal Software => Programming => Topic started by: TruthSerum on June 08, 2015, 08:24:35 PM




Title: shader function or instruction cost (performance)
Post by: TruthSerum on June 08, 2015, 08:24:35 PM
I was thinking of making a tool to calculate the computational cost of a line of shader code. The tool would then output the cost value for each line of code, possibly colouring the line of code to represent how expensive it is. As we have discussed previously (http://www.fractalforums.com/programming/branchless-maximumprinciple-axis/), it is difficult to gauge exactly how computationally expensive a particular shader function (refering to the intrinsic functions, min, max, sin, cos, .etc) is going to be.

But I believe you can make reasonable assumptions about the cost of a function. Actually, what I believe is important here is the ordering of cost w.r.t other functions. It should be reasonable to assume that the cost of evaluating the sine of an angle is more computational work than the cost of adding two scalars.

Here are some example costs that would be the basis for the calculation, which would just be a simple walk over the operations, and adding the operation cost to a total for that line number in the code. I expect that the numbers would be configurable in the tool anyhow:

scalar add,sub,mult,div1
vec4 add/mul/div/sub4*1
scalar fract()2
scalar sqrt()2
vec3 dot()3
vec3 normalize()5(dot + sqrt)
scalar sin()2+3(fract + dot)
vec4 sin()4*(3+2)(4 * sin)
mat4 * vec34*3(4 * dot)
mat4 * mat44*4*3(4 * 4 * dot)

In general, the cost of a n-vector operation is simply n * (cost of scalar operation), with some assumptions made about when these operations can be reduced to dot or fma operations. Also, for a branch, only the operations that make up the condition would count towards the cost.

Do you think these numbers make sense? Is there a better formula that could be applied to compute the cost of an individual line of code, as a function of the cost of it's individual operations?


Title: Re: shader function or instruction cost (performance)
Post by: cKleinhuis on June 08, 2015, 09:24:34 PM
hmm
if you ask me, there would theoretically exist the possibility to extract the concrete asm code from the snippet line you are examining, e.g. mark the part before with an arbitrary method to identify it in asm result, with that you can then use the mnemonics costs definition which should lay around somewhere ;)


Title: Re: shader function or instruction cost (performance)
Post by: laser blaster on June 09, 2015, 01:11:27 AM
The cost of some operations can vary depending on the GPU manufacturer and the architecture. This is what I could dig up. I might be wrong about some of the Nvidia stuff- I tried to guess as best as I could from the architectural details Nvidia has released. AMD was dead simple, though.

For both Nvidia and AMD:
Add, Multiply, Subtract and MAD/FMA - 1

For AMD:
1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x) - 4
x/y - 5 (one reciprocal, one mul)
sqrt(x) - probably 5 (one reciprocal square root, one mul)

For Nvidia Fermi architecture (GeForce 400 and 500 series) :8
1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x)
Edit: not quite sure how large the dispatch overhead is for these operations on Fermi... anyway, throughput for these functions is 1/8 the speed of normal ops, so try to limit these functions to 1/9 of all total ops on average.

For Nvidia Kepler (GeForce 700 series) and Maxwell architecture (800M and 900 series):
1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x) - 0 or close to 0, as long as they are limited to 1/9 of all total ops (can go up to 1/5 for Maxwell).
x/y - 1 (one special op, one mul)
sqrt(x) - 1 (one special op, one mul)

(obviously, for vector versions of the above functions, multiply by the width of the vector)

For all GPU's:
vec3 dot - 3
vec3 normalize - 6 + cost of a "1/sqrt(x)" op

asin, acos, atan - LOTS, probably around 25 (no native support on either Nvidia or AMD, so they emulate it in software).

Also, I think a conditional branch does come with some fixed overhead, but I don't know how much.

Reasons for my Nvidia numbers:
The reason for the "0" for various "special functions" (1/x, sin x, etx) on Nvidia Kepler/Maxwell, is because those functions are computed on a separate hardware (special functions unit, or SFU) unit in parallel to standard operations. Throughput is 1/8 that of standard ops (1/4 on Maxwell), so if your code has too many of those ops, they'll become a bottleneck. Hence the 1/9 and 1/5 guidelines. On Fermi, there is still a separate SFU for those functions, but due to the way instruction dispatching works, it looks like there might be some non-negligible overhead, but I'm not sure how much.


Title: Re: shader function or instruction cost (performance)
Post by: eiffie on June 09, 2015, 02:35:01 PM
This would be awesome if you can put it all together! For webgl (where you don't know the hardware) it would be nice to even just have the average.


Title: Re: shader function or instruction cost (performance)
Post by: TruthSerum on June 09, 2015, 05:42:04 PM
Thanks for the support. I've started working on it and got some preliminary results. I only added cost counts for a few operations so far:

(http://i.imgur.com/hgca2fbl.png)

This is basically how it will look, with costs on the right, line numbers on the left, source in-between. The output is HTML.

Issues that I am currently solving:

  • Operation cost in loops needs to be multiplied by the loop iteration count.
  • Operation cost in user-defined functions needs to be multiplied by the invocation count.

Here is a list of all the operators that the parser supports that need costs assigning to them. I expect I'll make this a JSON file or something that is loaded by the program. This would make it possible to have nvidia.json, intel.json .etc

Code:
//
// Operators used by the high-level (parse tree) representation.
//
enum TOperator {
    EOpNull,            // if in a node, should only mean a node is still being built
    EOpSequence,        // denotes a list of statements, or parameters, etc.
    EOpLinkerObjects,   // for aggregate node of objects the linker may need, if not reference by the rest of the AST
    EOpFunctionCall,    
    EOpFunction,        // For function definition
    EOpParameters,      // an aggregate listing the parameters to a function

    //
    // Unary operators
    //
    
    EOpNegative,
    EOpLogicalNot,
    EOpVectorLogicalNot,
    EOpBitwiseNot,

    EOpPostIncrement,
    EOpPostDecrement,
    EOpPreIncrement,
    EOpPreDecrement,

    EOpConvIntToBool,
    EOpConvUintToBool,
    EOpConvFloatToBool,
    EOpConvDoubleToBool,
    EOpConvBoolToFloat,
    EOpConvIntToFloat,
    EOpConvUintToFloat,
    EOpConvDoubleToFloat,
    EOpConvUintToInt,
    EOpConvFloatToInt,
    EOpConvBoolToInt,
    EOpConvDoubleToInt,
    EOpConvIntToUint,
    EOpConvFloatToUint,
    EOpConvBoolToUint,
    EOpConvDoubleToUint,
    EOpConvIntToDouble,
    EOpConvUintToDouble,
    EOpConvFloatToDouble,
    EOpConvBoolToDouble,

    //
    // binary operations
    //

    EOpAdd,
    EOpSub,
    EOpMul,
    EOpDiv,
    EOpMod,
    EOpRightShift,
    EOpLeftShift,
    EOpAnd,
    EOpInclusiveOr,
    EOpExclusiveOr,
    EOpEqual,
    EOpNotEqual,
    EOpVectorEqual,
    EOpVectorNotEqual,
    EOpLessThan,
    EOpGreaterThan,
    EOpLessThanEqual,
    EOpGreaterThanEqual,
    EOpComma,

    EOpVectorTimesScalar,
    EOpVectorTimesMatrix,
    EOpMatrixTimesVector,
    EOpMatrixTimesScalar,

    EOpLogicalOr,
    EOpLogicalXor,
    EOpLogicalAnd,

    EOpIndexDirect,
    EOpIndexIndirect,
    EOpIndexDirectStruct,

    EOpVectorSwizzle,

    EOpMethod,

    //
    // Built-in functions potentially mapped to operators
    //

    EOpRadians,
    EOpDegrees,
    EOpSin,
    EOpCos,
    EOpTan,
    EOpAsin,
    EOpAcos,
    EOpAtan,
    EOpSinh,
    EOpCosh,
    EOpTanh,
    EOpAsinh,
    EOpAcosh,
    EOpAtanh,

    EOpPow,
    EOpExp,
    EOpLog,
    EOpExp2,
    EOpLog2,
    EOpSqrt,
    EOpInverseSqrt,

    EOpAbs,
    EOpSign,
    EOpFloor,
    EOpTrunc,
    EOpRound,
    EOpRoundEven,
    EOpCeil,
    EOpFract,
    EOpModf,
    EOpMin,
    EOpMax,
    EOpClamp,
    EOpMix,
    EOpStep,
    EOpSmoothStep,

    EOpIsNan,
    EOpIsInf,

    EOpFloatBitsToInt,
    EOpFloatBitsToUint,
    EOpIntBitsToFloat,
    EOpUintBitsToFloat,
    EOpPackSnorm2x16,
    EOpUnpackSnorm2x16,
    EOpPackUnorm2x16,
    EOpUnpackUnorm2x16,
    EOpPackHalf2x16,
    EOpUnpackHalf2x16,

    EOpLength,
    EOpDistance,
    EOpDot,
    EOpCross,
    EOpNormalize,
    EOpFaceForward,
    EOpReflect,
    EOpRefract,

    EOpDPdx,            // Fragment only
    EOpDPdy,            // Fragment only
    EOpFwidth,          // Fragment only
    EOpDPdxFine,        // Fragment only
    EOpDPdyFine,        // Fragment only
    EOpFwidthFine,      // Fragment only
    EOpDPdxCoarse,      // Fragment only
    EOpDPdyCoarse,      // Fragment only
    EOpFwidthCoarse,    // Fragment only

    EOpMatrixTimesMatrix,
    EOpOuterProduct,
    EOpDeterminant,
    EOpMatrixInverse,
    EOpTranspose,

    EOpEmitVertex,           // geometry only
    EOpEndPrimitive,         // geometry only
    EOpEmitStreamVertex,     // geometry only
    EOpEndStreamPrimitive,   // geometry only

    EOpBarrier,
    EOpMemoryBarrier,
    EOpMemoryBarrierAtomicCounter,
    EOpMemoryBarrierBuffer,
    EOpMemoryBarrierImage,
    EOpMemoryBarrierShared,  // compute only
    EOpGroupMemoryBarrier,   // compute only

    EOpAtomicAdd,            // TODO: AST functionality: hook these up
    EOpAtomicMin,
    EOpAtomicMax,
    EOpAtomicAnd,
    EOpAtomicOr,
    EOpAtomicXor,
    EOpAtomicExchange,
    EOpAtomicCompSwap,

    EOpAny,
    EOpAll,

    //
    // Branch
    //

    EOpKill,            // Fragment only
    EOpReturn,
    EOpBreak,
    EOpContinue,
    EOpCase,
    EOpDefault,

    //
    // Constructors
    //

    EOpConstructGuardStart,
    EOpConstructInt,          // these first scalar forms also identify what implicit conversion is needed
    EOpConstructUint,
    EOpConstructBool,
    EOpConstructFloat,
    EOpConstructDouble,
    EOpConstructVec2,
    EOpConstructVec3,
    EOpConstructVec4,
    EOpConstructDVec2,
    EOpConstructDVec3,
    EOpConstructDVec4,
    EOpConstructBVec2,
    EOpConstructBVec3,
    EOpConstructBVec4,
    EOpConstructIVec2,
    EOpConstructIVec3,
    EOpConstructIVec4,
    EOpConstructUVec2,
    EOpConstructUVec3,
    EOpConstructUVec4,
    EOpConstructMat2x2,
    EOpConstructMat2x3,
    EOpConstructMat2x4,
    EOpConstructMat3x2,
    EOpConstructMat3x3,
    EOpConstructMat3x4,
    EOpConstructMat4x2,
    EOpConstructMat4x3,
    EOpConstructMat4x4,
    EOpConstructDMat2x2,
    EOpConstructDMat2x3,
    EOpConstructDMat2x4,
    EOpConstructDMat3x2,
    EOpConstructDMat3x3,
    EOpConstructDMat3x4,
    EOpConstructDMat4x2,
    EOpConstructDMat4x3,
    EOpConstructDMat4x4,
    EOpConstructStruct,
    EOpConstructGuardEnd,

    //
    // moves
    //
    
    EOpAssign,
    EOpAddAssign,
    EOpSubAssign,
    EOpMulAssign,
    EOpVectorTimesMatrixAssign,
    EOpVectorTimesScalarAssign,
    EOpMatrixTimesScalarAssign,
    EOpMatrixTimesMatrixAssign,
    EOpDivAssign,
    EOpModAssign,
    EOpAndAssign,
    EOpInclusiveOrAssign,
    EOpExclusiveOrAssign,
    EOpLeftShiftAssign,
    EOpRightShiftAssign,

    //
    // Array operators
    //

    EOpArrayLength,      // "Array" distinguishes from length(v) built-in function, but it applies to vectors and matrices as well.
};


Title: Re: shader function or instruction cost (performance)
Post by: laser blaster on June 09, 2015, 09:52:35 PM
Finally cleared up the confusion about special functions on Fermi.

"Special Function Units (SFUs) execute transcendental instructions such as sin, cosine,
reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp
executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the
dispatch unit to issue to other execution units while the SFU is occupied."

So it seems that SFU ops have a very negligible overhead. So just follow the 1/9 rule, and you don't have to worry about them. Note that you don't need to wait 8 instructions between SFU ops; as long as there is a good balance between normal and SFU ops within your innermost loop, it'll be fine (the GPU will just switch to another warp while waiting on the SFU's).

So a normalize would be worth just 6 points on Nvidia, but 10 on AMD.


Title: Re: shader function or instruction cost (performance)
Post by: Syntopia on June 09, 2015, 10:08:21 PM
As LB points out, this will be difficult, as Nvidia has both Scalar Processors and Special Function Unit, Double Precision Units, and Load/Store units each of which can become a bottleneck: for instance, the Mandelbulb uses trigonometric functions extensively in its inner loop.

Also, loops may not have a fixed iteration counts, and conditional can be hard to predict, because their computational cost depends on whether all scalar processors in a given multiprocessor follow the same path. And some functions (such as sign) may turn into very different low-level instruction on different platforms.

Other tricky parts: in some cases making 'uniforms' constant may alter rendering speed by as much as a factor 4x (http://blog.hvidtfeldts.net/index.php/2011/07/optimizing-glsl-code/) - this is hard to see from a static analysis.

This paper has some measures on the older GT200 architecture: http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

Btw, the AST parser looks nice and could be useful - for instance for converting a shader into emulated double precision. What language is it written in? Will you be sharing the code?


Title: Re: shader function or instruction cost (performance)
Post by: TruthSerum on June 09, 2015, 10:28:10 PM
So with these SFU's, are you suggesting that sin,cos,tan and transcendentals be assigned a cost of 1? It sounds kind of suspicious, I'm sure I've seen sine and cosine be expanded into it's Taylor series by HLSL (admittedly, not GLSL) constant folding.

For simplicity sake, and to keep things meaningful on a lot of machines, the operation cost could, instead, be defined in terms of the number of vector elements operated on. That would be consistent with your proposal, where scalar sine is 1.

Under this scheme, the example costs from the table become:

scalar add,sub,mult,div1
vec4 add/mul/div/sub4
scalar fract()1
scalar sqrt()1
vec3 dot()3
vec3 normalize()3
scalar sin()1
vec4 sin()4
mat3 * vec33*3
mat4 * mat44*4*4

Interesting points about the inability to predict the behaviour of loops. I think the GLSL version 100 requires that loops be predictable in some way, I've not yet looked at it in detail. But there would be no way to know that a loop will exit-early every time.

As for the code, I expect to make it available at some point, if the thing turns out to be at all useful. I've been told there are tools by nvidia that already do static analysis, which will be better than anything I can put together.


Title: Re: shader function or instruction cost (performance)
Post by: laser blaster on June 09, 2015, 11:16:26 PM
Yes, I think a cost of 1 is accurate for sin, cos, 1/x, 1/sqrt(x), 2^x, and log2(x).
And a vec4 sin is equivalent to 4 scalar sin's, so, yeah that would have a cost of 4.

Dispatching one of these SFU ops only takes one cycle, same as an add or a multiplication. Since the SFU's are independent, the GPU is free to work on other things while the SFU's are working. But there's a catch- the SFU's work at 1/8 the speed of the ALU's (which handle add, mul, sub, fma). The GPU will do it's best to distribute the workload evenly, but if your code is dominated by SFU operations, they will become a bottleneck. All that means in practice is that SFU operations should constitute no more than 1/9 of the total operations within your code's innermost loop. If you exceed that, the cost will jump all the way up to 8.

e^x and ln(x) have a cost of 2 - they are computed with one mul, plus either a log2(x) or a 2^(x).
Division has a cost of 2. One reciprocal, one mul.
Sqrt also has a cost of 2. One reciprocal square root, one reciprocal.
tan has a cost of 4- one sin, one cos, one reciprocal, and one mul.

If the Taylor expansion was just done at compile time to evaluate a constant expression, then it should have no bearing on what actually happens on the GPU. Keep in mind that for CUDA, the compiler will actually replace sin and cos with a software based polynomial approximation. But GLSL has looser standards for accuracy than CUDA, so it can use the faster hardware instructions by default. I don't know about HLSL.


Title: Re: shader function or instruction cost (performance)
Post by: TruthSerum on June 12, 2015, 12:00:56 AM
I've got it to the point where it's producing some sensible output. If you want to test it, let me know and I can make a build, or post the source code somewhere.

Here is the output after giving it the code to eiffie's paxis2 shader (https://www.shadertoy.com/view/XlSGDc):

(http://i.imgur.com/reFR8wTh.png) (http://i.imgur.com/reFR8wT.png)


Title: Re: shader function or instruction cost (performance)
Post by: eiffie on June 12, 2015, 12:31:57 AM
I would love to have this code. Partly as Syntopia said, for the parser. :-)


Title: Re: shader function or instruction cost (performance)
Post by: TruthSerum on June 12, 2015, 01:10:15 AM
Since people only seem to be interested in the code, I won't bother build a Windows executable.

I've pushed it to github, which you can find here (https://github.com/sigint9/shadertune).


Title: Re: shader function or instruction cost (performance)
Post by: cKleinhuis on June 12, 2015, 02:48:46 AM
lol, you need a "sum total" row ;)
... at best for each method separated  :tease:


Title: Re: shader function or instruction cost (performance)
Post by: Syntopia on June 12, 2015, 08:13:47 PM
Thanks for posting the source - I wasn't aware of the GLSLang tools.

I'm not sure I get these numbers. For instance, line 46 says '193' but surely it should be a multiple of 8 given the loop it is in? Line 36 says 0 - even though there several operations?


Title: Re: shader function or instruction cost (performance)
Post by: TruthSerum on June 12, 2015, 08:32:20 PM
It doesn't make complete sense. I was just trying to derive any meaningful numbers from the operation count. In that particular place that you point out, where it has a 0 cost for a line with some operations, I think the cost is being displaced by the if-statement somehow.

But I don't plan to develop it further, since there was not much interest in the thing, beyond revealing how I was parsing the input.


Title: Re: shader function or instruction cost (performance)
Post by: eiffie on June 12, 2015, 08:49:39 PM
The glslang tool seems like it would have many uses (not just the validation it was intended for). Thanks for putting this together.


Title: Re: shader function or instruction cost (performance)
Post by: eiffie on June 14, 2015, 09:01:27 PM
Even though Truth-ila isn't planning on developing it further it does work nicely. I'm just going to document some compile issues if anyone is building this or glslang on VS. I used the 2010 versions of the StandAlone and glslang and had to rip out the Bison precompile step. Just run it manually once. You also have to add a SPIRV project and build that lib. Once that compiles create the ShaderTune project exactly the same as StandAlone. I only had a couple issues with ShaderTune. If you have .frag files with /r/n line endings you need to adjust the code in loadAndCalc because size and read will not be equal. I also had to move the code from ShaderParser.CPP to CostCalcuator::CostCalculator or that shader data would simply go out of scope.


Title: Re: shader function or instruction cost (performance)
Post by: TruthSerum on June 15, 2015, 12:27:50 AM
Woops, that was a bad error. I should have simply streamed small chunks of the file into an std::string, instead of trying to allocate the whole thing up-front based on the byte-size of the file, which has line endings converted by the "r" flag to fread(). Perhaps the glslang parser supports \r\n line endings, in which case simply changing the fread() flags to "rb" will fix it. As for the scoping issues, not sure exactly what you meant by that.


Title: Re: shader function or instruction cost (performance)
Post by: eiffie on June 15, 2015, 03:57:04 PM
I think "rb" would work fine or just use if(read>0)buffer[read]=0. In StandAlone they find the character length reading the whole file char at a time then repositioning to the beginning. It is all the same - these are small files.

Here is a better explanation of the other error.

In CostCalculator...
Shader shader=Shader::parsefile(blah blah); //I don't remember the exact syntax

In ShaderParser.cpp ...
Shader::parsefile(blah blah){
  Shader shader;
  ...
  shader.parse(blah blah);//everything is fine after this line, shader has been loaded and the InfoSink and Tree are loaded.
  return shader;
}//everything is lost here as the variable shader goes out of scope with the end of the function

I tried ...

static Shader shader; //and that helped a bit but I still got an error later in the cost calcs so I just dumped the code into CostCalculator and everything was fine. ???


Title: Re: shader function or instruction cost (performance)
Post by: TruthSerum on June 15, 2015, 05:48:17 PM
I see, looks like I forgot to implement copy/assignment members of the Shader object. It worked for me with the compiler generated copy methods.