Logo by Maya - Contribute your own Logo!

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

this forum will stay online for reference
News: Support us via Flattr FLATTR Link
 
*
Welcome, Guest. Please login or register. April 18, 2024, 11:32:13 PM


Login with username, password and session length


The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!


Pages: [1] 2   Go Down
  Print  
Share this topic on DiggShare this topic on FacebookShare this topic on GoogleShare this topic on RedditShare this topic on StumbleUponShare this topic on Twitter
Author Topic: shader function or instruction cost (performance)  (Read 24975 times)
0 Members and 3 Guests are viewing this topic.
TruthSerum
Guest
« on: June 08, 2015, 08:24:35 PM »

I was thinking of making a tool to calculate the computational cost of a line of shader code. The tool would then output the cost value for each line of code, possibly colouring the line of code to represent how expensive it is. As we have discussed previously, it is difficult to gauge exactly how computationally expensive a particular shader function (refering to the intrinsic functions, min, max, sin, cos, .etc) is going to be.

But I believe you can make reasonable assumptions about the cost of a function. Actually, what I believe is important here is the ordering of cost w.r.t other functions. It should be reasonable to assume that the cost of evaluating the sine of an angle is more computational work than the cost of adding two scalars.

Here are some example costs that would be the basis for the calculation, which would just be a simple walk over the operations, and adding the operation cost to a total for that line number in the code. I expect that the numbers would be configurable in the tool anyhow:

scalar add,sub,mult,div1
vec4 add/mul/div/sub4*1
scalar fract()2
scalar sqrt()2
vec3 dot()3
vec3 normalize()5(dot + sqrt)
scalar sin()2+3(fract + dot)
vec4 sin()4*(3+2)(4 * sin)
mat4 * vec34*3(4 * dot)
mat4 * mat44*4*3(4 * 4 * dot)

In general, the cost of a n-vector operation is simply n * (cost of scalar operation), with some assumptions made about when these operations can be reduced to dot or fma operations. Also, for a branch, only the operations that make up the condition would count towards the cost.

Do you think these numbers make sense? Is there a better formula that could be applied to compute the cost of an individual line of code, as a function of the cost of it's individual operations?
Logged
cKleinhuis
Administrator
Fractal Senior
*******
Posts: 7044


formerly known as 'Trifox'


WWW
« Reply #1 on: June 08, 2015, 09:24:34 PM »

hmm
if you ask me, there would theoretically exist the possibility to extract the concrete asm code from the snippet line you are examining, e.g. mark the part before with an arbitrary method to identify it in asm result, with that you can then use the mnemonics costs definition which should lay around somewhere wink
Logged

---

divide and conquer - iterate and rule - chaos is No random!
laser blaster
Iterator
*
Posts: 178


« Reply #2 on: June 09, 2015, 01:11:27 AM »

The cost of some operations can vary depending on the GPU manufacturer and the architecture. This is what I could dig up. I might be wrong about some of the Nvidia stuff- I tried to guess as best as I could from the architectural details Nvidia has released. AMD was dead simple, though.

For both Nvidia and AMD:
Add, Multiply, Subtract and MAD/FMA - 1

For AMD:
1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x) - 4
x/y - 5 (one reciprocal, one mul)
sqrt(x) - probably 5 (one reciprocal square root, one mul)

For Nvidia Fermi architecture (GeForce 400 and 500 series) :8
1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x)
Edit: not quite sure how large the dispatch overhead is for these operations on Fermi... anyway, throughput for these functions is 1/8 the speed of normal ops, so try to limit these functions to 1/9 of all total ops on average.

For Nvidia Kepler (GeForce 700 series) and Maxwell architecture (800M and 900 series):
1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x) - 0 or close to 0, as long as they are limited to 1/9 of all total ops (can go up to 1/5 for Maxwell).
x/y - 1 (one special op, one mul)
sqrt(x) - 1 (one special op, one mul)

(obviously, for vector versions of the above functions, multiply by the width of the vector)

For all GPU's:
vec3 dot - 3
vec3 normalize - 6 + cost of a "1/sqrt(x)" op

asin, acos, atan - LOTS, probably around 25 (no native support on either Nvidia or AMD, so they emulate it in software).

Also, I think a conditional branch does come with some fixed overhead, but I don't know how much.

Reasons for my Nvidia numbers:
The reason for the "0" for various "special functions" (1/x, sin x, etx) on Nvidia Kepler/Maxwell, is because those functions are computed on a separate hardware (special functions unit, or SFU) unit in parallel to standard operations. Throughput is 1/8 that of standard ops (1/4 on Maxwell), so if your code has too many of those ops, they'll become a bottleneck. Hence the 1/9 and 1/5 guidelines. On Fermi, there is still a separate SFU for those functions, but due to the way instruction dispatching works, it looks like there might be some non-negligible overhead, but I'm not sure how much.
« Last Edit: June 09, 2015, 04:53:27 AM by laser blaster » Logged
eiffie
Guest
« Reply #3 on: June 09, 2015, 02:35:01 PM »

This would be awesome if you can put it all together! For webgl (where you don't know the hardware) it would be nice to even just have the average.
Logged
TruthSerum
Guest
« Reply #4 on: June 09, 2015, 05:42:04 PM »

Thanks for the support. I've started working on it and got some preliminary results. I only added cost counts for a few operations so far:



This is basically how it will look, with costs on the right, line numbers on the left, source in-between. The output is HTML.

Issues that I am currently solving:

  • Operation cost in loops needs to be multiplied by the loop iteration count.
  • Operation cost in user-defined functions needs to be multiplied by the invocation count.

Here is a list of all the operators that the parser supports that need costs assigning to them. I expect I'll make this a JSON file or something that is loaded by the program. This would make it possible to have nvidia.json, intel.json .etc

Code:
//
// Operators used by the high-level (parse tree) representation.
//
enum TOperator {
    EOpNull,            // if in a node, should only mean a node is still being built
    EOpSequence,        // denotes a list of statements, or parameters, etc.
    EOpLinkerObjects,   // for aggregate node of objects the linker may need, if not reference by the rest of the AST
    EOpFunctionCall,    
    EOpFunction,        // For function definition
    EOpParameters,      // an aggregate listing the parameters to a function

    //
    // Unary operators
    //
    
    EOpNegative,
    EOpLogicalNot,
    EOpVectorLogicalNot,
    EOpBitwiseNot,

    EOpPostIncrement,
    EOpPostDecrement,
    EOpPreIncrement,
    EOpPreDecrement,

    EOpConvIntToBool,
    EOpConvUintToBool,
    EOpConvFloatToBool,
    EOpConvDoubleToBool,
    EOpConvBoolToFloat,
    EOpConvIntToFloat,
    EOpConvUintToFloat,
    EOpConvDoubleToFloat,
    EOpConvUintToInt,
    EOpConvFloatToInt,
    EOpConvBoolToInt,
    EOpConvDoubleToInt,
    EOpConvIntToUint,
    EOpConvFloatToUint,
    EOpConvBoolToUint,
    EOpConvDoubleToUint,
    EOpConvIntToDouble,
    EOpConvUintToDouble,
    EOpConvFloatToDouble,
    EOpConvBoolToDouble,

    //
    // binary operations
    //

    EOpAdd,
    EOpSub,
    EOpMul,
    EOpDiv,
    EOpMod,
    EOpRightShift,
    EOpLeftShift,
    EOpAnd,
    EOpInclusiveOr,
    EOpExclusiveOr,
    EOpEqual,
    EOpNotEqual,
    EOpVectorEqual,
    EOpVectorNotEqual,
    EOpLessThan,
    EOpGreaterThan,
    EOpLessThanEqual,
    EOpGreaterThanEqual,
    EOpComma,

    EOpVectorTimesScalar,
    EOpVectorTimesMatrix,
    EOpMatrixTimesVector,
    EOpMatrixTimesScalar,

    EOpLogicalOr,
    EOpLogicalXor,
    EOpLogicalAnd,

    EOpIndexDirect,
    EOpIndexIndirect,
    EOpIndexDirectStruct,

    EOpVectorSwizzle,

    EOpMethod,

    //
    // Built-in functions potentially mapped to operators
    //

    EOpRadians,
    EOpDegrees,
    EOpSin,
    EOpCos,
    EOpTan,
    EOpAsin,
    EOpAcos,
    EOpAtan,
    EOpSinh,
    EOpCosh,
    EOpTanh,
    EOpAsinh,
    EOpAcosh,
    EOpAtanh,

    EOpPow,
    EOpExp,
    EOpLog,
    EOpExp2,
    EOpLog2,
    EOpSqrt,
    EOpInverseSqrt,

    EOpAbs,
    EOpSign,
    EOpFloor,
    EOpTrunc,
    EOpRound,
    EOpRoundEven,
    EOpCeil,
    EOpFract,
    EOpModf,
    EOpMin,
    EOpMax,
    EOpClamp,
    EOpMix,
    EOpStep,
    EOpSmoothStep,

    EOpIsNan,
    EOpIsInf,

    EOpFloatBitsToInt,
    EOpFloatBitsToUint,
    EOpIntBitsToFloat,
    EOpUintBitsToFloat,
    EOpPackSnorm2x16,
    EOpUnpackSnorm2x16,
    EOpPackUnorm2x16,
    EOpUnpackUnorm2x16,
    EOpPackHalf2x16,
    EOpUnpackHalf2x16,

    EOpLength,
    EOpDistance,
    EOpDot,
    EOpCross,
    EOpNormalize,
    EOpFaceForward,
    EOpReflect,
    EOpRefract,

    EOpDPdx,            // Fragment only
    EOpDPdy,            // Fragment only
    EOpFwidth,          // Fragment only
    EOpDPdxFine,        // Fragment only
    EOpDPdyFine,        // Fragment only
    EOpFwidthFine,      // Fragment only
    EOpDPdxCoarse,      // Fragment only
    EOpDPdyCoarse,      // Fragment only
    EOpFwidthCoarse,    // Fragment only

    EOpMatrixTimesMatrix,
    EOpOuterProduct,
    EOpDeterminant,
    EOpMatrixInverse,
    EOpTranspose,

    EOpEmitVertex,           // geometry only
    EOpEndPrimitive,         // geometry only
    EOpEmitStreamVertex,     // geometry only
    EOpEndStreamPrimitive,   // geometry only

    EOpBarrier,
    EOpMemoryBarrier,
    EOpMemoryBarrierAtomicCounter,
    EOpMemoryBarrierBuffer,
    EOpMemoryBarrierImage,
    EOpMemoryBarrierShared,  // compute only
    EOpGroupMemoryBarrier,   // compute only

    EOpAtomicAdd,            // TODO: AST functionality: hook these up
    EOpAtomicMin,
    EOpAtomicMax,
    EOpAtomicAnd,
    EOpAtomicOr,
    EOpAtomicXor,
    EOpAtomicExchange,
    EOpAtomicCompSwap,

    EOpAny,
    EOpAll,

    //
    // Branch
    //

    EOpKill,            // Fragment only
    EOpReturn,
    EOpBreak,
    EOpContinue,
    EOpCase,
    EOpDefault,

    //
    // Constructors
    //

    EOpConstructGuardStart,
    EOpConstructInt,          // these first scalar forms also identify what implicit conversion is needed
    EOpConstructUint,
    EOpConstructBool,
    EOpConstructFloat,
    EOpConstructDouble,
    EOpConstructVec2,
    EOpConstructVec3,
    EOpConstructVec4,
    EOpConstructDVec2,
    EOpConstructDVec3,
    EOpConstructDVec4,
    EOpConstructBVec2,
    EOpConstructBVec3,
    EOpConstructBVec4,
    EOpConstructIVec2,
    EOpConstructIVec3,
    EOpConstructIVec4,
    EOpConstructUVec2,
    EOpConstructUVec3,
    EOpConstructUVec4,
    EOpConstructMat2x2,
    EOpConstructMat2x3,
    EOpConstructMat2x4,
    EOpConstructMat3x2,
    EOpConstructMat3x3,
    EOpConstructMat3x4,
    EOpConstructMat4x2,
    EOpConstructMat4x3,
    EOpConstructMat4x4,
    EOpConstructDMat2x2,
    EOpConstructDMat2x3,
    EOpConstructDMat2x4,
    EOpConstructDMat3x2,
    EOpConstructDMat3x3,
    EOpConstructDMat3x4,
    EOpConstructDMat4x2,
    EOpConstructDMat4x3,
    EOpConstructDMat4x4,
    EOpConstructStruct,
    EOpConstructGuardEnd,

    //
    // moves
    //
    
    EOpAssign,
    EOpAddAssign,
    EOpSubAssign,
    EOpMulAssign,
    EOpVectorTimesMatrixAssign,
    EOpVectorTimesScalarAssign,
    EOpMatrixTimesScalarAssign,
    EOpMatrixTimesMatrixAssign,
    EOpDivAssign,
    EOpModAssign,
    EOpAndAssign,
    EOpInclusiveOrAssign,
    EOpExclusiveOrAssign,
    EOpLeftShiftAssign,
    EOpRightShiftAssign,

    //
    // Array operators
    //

    EOpArrayLength,      // "Array" distinguishes from length(v) built-in function, but it applies to vectors and matrices as well.
};
Logged
laser blaster
Iterator
*
Posts: 178


« Reply #5 on: June 09, 2015, 09:52:35 PM »

Finally cleared up the confusion about special functions on Fermi.

"Special Function Units (SFUs) execute transcendental instructions such as sin, cosine,
reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp
executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the
dispatch unit to issue to other execution units while the SFU is occupied."

So it seems that SFU ops have a very negligible overhead. So just follow the 1/9 rule, and you don't have to worry about them. Note that you don't need to wait 8 instructions between SFU ops; as long as there is a good balance between normal and SFU ops within your innermost loop, it'll be fine (the GPU will just switch to another warp while waiting on the SFU's).

So a normalize would be worth just 6 points on Nvidia, but 10 on AMD.
Logged
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #6 on: June 09, 2015, 10:08:21 PM »

As LB points out, this will be difficult, as Nvidia has both Scalar Processors and Special Function Unit, Double Precision Units, and Load/Store units each of which can become a bottleneck: for instance, the Mandelbulb uses trigonometric functions extensively in its inner loop.

Also, loops may not have a fixed iteration counts, and conditional can be hard to predict, because their computational cost depends on whether all scalar processors in a given multiprocessor follow the same path. And some functions (such as sign) may turn into very different low-level instruction on different platforms.

Other tricky parts: in some cases making 'uniforms' constant may alter rendering speed by as much as a factor 4x (http://blog.hvidtfeldts.net/index.php/2011/07/optimizing-glsl-code/) - this is hard to see from a static analysis.

This paper has some measures on the older GT200 architecture: http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

Btw, the AST parser looks nice and could be useful - for instance for converting a shader into emulated double precision. What language is it written in? Will you be sharing the code?
Logged
TruthSerum
Guest
« Reply #7 on: June 09, 2015, 10:28:10 PM »

So with these SFU's, are you suggesting that sin,cos,tan and transcendentals be assigned a cost of 1? It sounds kind of suspicious, I'm sure I've seen sine and cosine be expanded into it's Taylor series by HLSL (admittedly, not GLSL) constant folding.

For simplicity sake, and to keep things meaningful on a lot of machines, the operation cost could, instead, be defined in terms of the number of vector elements operated on. That would be consistent with your proposal, where scalar sine is 1.

Under this scheme, the example costs from the table become:

scalar add,sub,mult,div1
vec4 add/mul/div/sub4
scalar fract()1
scalar sqrt()1
vec3 dot()3
vec3 normalize()3
scalar sin()1
vec4 sin()4
mat3 * vec33*3
mat4 * mat44*4*4

Interesting points about the inability to predict the behaviour of loops. I think the GLSL version 100 requires that loops be predictable in some way, I've not yet looked at it in detail. But there would be no way to know that a loop will exit-early every time.

As for the code, I expect to make it available at some point, if the thing turns out to be at all useful. I've been told there are tools by nvidia that already do static analysis, which will be better than anything I can put together.
Logged
laser blaster
Iterator
*
Posts: 178


« Reply #8 on: June 09, 2015, 11:16:26 PM »

Yes, I think a cost of 1 is accurate for sin, cos, 1/x, 1/sqrt(x), 2^x, and log2(x).
And a vec4 sin is equivalent to 4 scalar sin's, so, yeah that would have a cost of 4.

Dispatching one of these SFU ops only takes one cycle, same as an add or a multiplication. Since the SFU's are independent, the GPU is free to work on other things while the SFU's are working. But there's a catch- the SFU's work at 1/8 the speed of the ALU's (which handle add, mul, sub, fma). The GPU will do it's best to distribute the workload evenly, but if your code is dominated by SFU operations, they will become a bottleneck. All that means in practice is that SFU operations should constitute no more than 1/9 of the total operations within your code's innermost loop. If you exceed that, the cost will jump all the way up to 8.

e^x and ln(x) have a cost of 2 - they are computed with one mul, plus either a log2(x) or a 2^(x).
Division has a cost of 2. One reciprocal, one mul.
Sqrt also has a cost of 2. One reciprocal square root, one reciprocal.
tan has a cost of 4- one sin, one cos, one reciprocal, and one mul.

If the Taylor expansion was just done at compile time to evaluate a constant expression, then it should have no bearing on what actually happens on the GPU. Keep in mind that for CUDA, the compiler will actually replace sin and cos with a software based polynomial approximation. But GLSL has looser standards for accuracy than CUDA, so it can use the faster hardware instructions by default. I don't know about HLSL.
Logged
TruthSerum
Guest
« Reply #9 on: June 12, 2015, 12:00:56 AM »

I've got it to the point where it's producing some sensible output. If you want to test it, let me know and I can make a build, or post the source code somewhere.

Here is the output after giving it the code to eiffie's paxis2 shader:

Logged
eiffie
Guest
« Reply #10 on: June 12, 2015, 12:31:57 AM »

I would love to have this code. Partly as Syntopia said, for the parser. :-)
Logged
TruthSerum
Guest
« Reply #11 on: June 12, 2015, 01:10:15 AM »

Since people only seem to be interested in the code, I won't bother build a Windows executable.

I've pushed it to github, which you can find here.
Logged
cKleinhuis
Administrator
Fractal Senior
*******
Posts: 7044


formerly known as 'Trifox'


WWW
« Reply #12 on: June 12, 2015, 02:48:46 AM »

lol, you need a "sum total" row wink
... at best for each method separated  tease
Logged

---

divide and conquer - iterate and rule - chaos is No random!
Syntopia
Fractal Molossus
**
Posts: 681



syntopiadk
WWW
« Reply #13 on: June 12, 2015, 08:13:47 PM »

Thanks for posting the source - I wasn't aware of the GLSLang tools.

I'm not sure I get these numbers. For instance, line 46 says '193' but surely it should be a multiple of 8 given the loop it is in? Line 36 says 0 - even though there several operations?
Logged
TruthSerum
Guest
« Reply #14 on: June 12, 2015, 08:32:20 PM »

It doesn't make complete sense. I was just trying to derive any meaningful numbers from the operation count. In that particular place that you point out, where it has a 0 cost for a line with some operations, I think the cost is being displaced by the if-statement somehow.

But I don't plan to develop it further, since there was not much interest in the thing, beyond revealing how I was parsing the input.
Logged
Pages: [1] 2   Go Down
  Print  
 
Jump to:  

Related Topics
Subject Started by Replies Views Last post
DirectX 11 compute shader / Directx 10 Pixel shader implementation Mandelbulb Implementation cbuchner1 2 16643 Last post November 29, 2009, 12:32:26 AM
by cbuchner1
3D fractals using shader 2 3D Fractal Generation David Makin 4 8340 Last post October 14, 2010, 06:10:59 PM
by David Makin
Under Divine Instruction Mandelbulb3D Gallery CO99A5 0 898 Last post April 11, 2015, 03:04:56 AM
by CO99A5
shader project - glsl DE function collection Let's collaborate on something! cKleinhuis 3 7469 Last post June 09, 2015, 11:51:32 AM
by cKleinhuis
rdrand cpu instruction Programming ker2x 13 2229 Last post September 06, 2016, 08:28:00 AM
by ker2x

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.226 seconds with 24 queries. (Pretty URLs adds 0.009s, 2q)