shader function or instruction cost (performance)

TruthSerum

Guest

« on: June 08, 2015, 08:24:35 PM »

I was thinking of making a tool to calculate the computational cost of a line of shader code. The tool would then output the cost value for each line of code, possibly colouring the line of code to represent how expensive it is. As we have discussed previously, it is difficult to gauge exactly how computationally expensive a particular shader function (refering to the intrinsic functions, min, max, sin, cos, .etc) is going to be.

But I believe you can make reasonable assumptions about the cost of a function. Actually, what I believe is important here is the ordering of cost w.r.t other functions. It should be reasonable to assume that the cost of evaluating the sine of an angle is more computational work than the cost of adding two scalars.

Here are some example costs that would be the basis for the calculation, which would just be a simple walk over the operations, and adding the operation cost to a total for that line number in the code. I expect that the numbers would be configurable in the tool anyhow:

scalar add,sub,mult,div	1
vec4 add/mul/div/sub	4*1
scalar fract()	2
scalar sqrt()	2
vec3 dot()	3
vec3 normalize()	5	(dot + sqrt)
scalar sin()	2+3	(fract + dot)
vec4 sin()	4*(3+2)	(4 * sin)
mat4 * vec3	4*3	(4 * dot)
mat4 * mat4	443	(4 * 4 * dot)

In general, the cost of a n-vector operation is simply n * (cost of scalar operation), with some assumptions made about when these operations can be reduced to dot or fma operations. Also, for a branch, only the operations that make up the condition would count towards the cost.

Do you think these numbers make sense? Is there a better formula that could be applied to compute the cost of an individual line of code, as a function of the cost of it's individual operations?


	Logged

cKleinhuis

Administrator
Fractal Senior

Posts: 7044

formerly known as 'Trifox'

Re: shader function or instruction cost (performance)

« Reply #1 on: June 08, 2015, 09:24:34 PM »

hmm
if you ask me, there would theoretically exist the possibility to extract the concrete asm code from the snippet line you are examining, e.g. mark the part before with an arbitrary method to identify it in asm result, with that you can then use the mnemonics costs definition which should lay around somewhere wink


	Logged

---

divide and conquer - iterate and rule - chaos is No random!

laser blaster

Iterator

Posts: 178

Re: shader function or instruction cost (performance)

« Reply #2 on: June 09, 2015, 01:11:27 AM »

The cost of some operations can vary depending on the GPU manufacturer and the architecture. This is what I could dig up. I might be wrong about some of the Nvidia stuff- I tried to guess as best as I could from the architectural details Nvidia has released. AMD was dead simple, though.

For both Nvidia and AMD:
Add, Multiply, Subtract and MAD/FMA - 1

For AMD:
1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x) - 4
x/y - 5 (one reciprocal, one mul)
sqrt(x) - probably 5 (one reciprocal square root, one mul)

For Nvidia Fermi architecture (GeForce 400 and 500 series) :8
1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x)
Edit: not quite sure how large the dispatch overhead is for these operations on Fermi... anyway, throughput for these functions is 1/8 the speed of normal ops, so try to limit these functions to 1/9 of all total ops on average.

For Nvidia Kepler (GeForce 700 series) and Maxwell architecture (800M and 900 series):
1/x, sin(x), cos(x), log2(x), exp2(x), 1/sqrt(x) - 0 or close to 0, as long as they are limited to 1/9 of all total ops (can go up to 1/5 for Maxwell).
x/y - 1 (one special op, one mul)
sqrt(x) - 1 (one special op, one mul)

(obviously, for vector versions of the above functions, multiply by the width of the vector)

For all GPU's:
vec3 dot - 3
vec3 normalize - 6 + cost of a "1/sqrt(x)" op

asin, acos, atan - LOTS, probably around 25 (no native support on either Nvidia or AMD, so they emulate it in software).

Also, I think a conditional branch does come with some fixed overhead, but I don't know how much.

Reasons for my Nvidia numbers:
The reason for the "0" for various "special functions" (1/x, sin x, etx) on Nvidia Kepler/Maxwell, is because those functions are computed on a separate hardware (special functions unit, or SFU) unit in parallel to standard operations. Throughput is 1/8 that of standard ops (1/4 on Maxwell), so if your code has too many of those ops, they'll become a bottleneck. Hence the 1/9 and 1/5 guidelines. On Fermi, there is still a separate SFU for those functions, but due to the way instruction dispatching works, it looks like there might be some non-negligible overhead, but I'm not sure how much.


« Last Edit: June 09, 2015, 04:53:27 AM by laser blaster »	Logged

eiffie

Guest

Re: shader function or instruction cost (performance)

« Reply #3 on: June 09, 2015, 02:35:01 PM »

This would be awesome if you can put it all together! For webgl (where you don't know the hardware) it would be nice to even just have the average.


	Logged

TruthSerum

Guest

Re: shader function or instruction cost (performance)

« Reply #4 on: June 09, 2015, 05:42:04 PM »

Thanks for the support. I've started working on it and got some preliminary results. I only added cost counts for a few operations so far:

This is basically how it will look, with costs on the right, line numbers on the left, source in-between. The output is HTML.

Issues that I am currently solving:

Operation cost in loops needs to be multiplied by the loop iteration count.
Operation cost in user-defined functions needs to be multiplied by the invocation count.

Here is a list of all the operators that the parser supports that need costs assigning to them. I expect I'll make this a JSON file or something that is loaded by the program. This would make it possible to have nvidia.json, intel.json .etc

Code:

//
// Operators used by the high-level (parse tree) representation.
//
enum TOperator {
   EOpNull, // if in a node, should only mean a node is still being built
   EOpSequence, // denotes a list of statements, or parameters, etc.
   EOpLinkerObjects, // for aggregate node of objects the linker may need, if not reference by the rest of the AST
   EOpFunctionCall,
   EOpFunction, // For function definition
   EOpParameters, // an aggregate listing the parameters to a function

   //
   // Unary operators
   //

   EOpNegative,
   EOpLogicalNot,
   EOpVectorLogicalNot,
   EOpBitwiseNot,

   EOpPostIncrement,
   EOpPostDecrement,
   EOpPreIncrement,
   EOpPreDecrement,

   EOpConvIntToBool,
   EOpConvUintToBool,
   EOpConvFloatToBool,
   EOpConvDoubleToBool,
   EOpConvBoolToFloat,
   EOpConvIntToFloat,
   EOpConvUintToFloat,
   EOpConvDoubleToFloat,
   EOpConvUintToInt,
   EOpConvFloatToInt,
   EOpConvBoolToInt,
   EOpConvDoubleToInt,
   EOpConvIntToUint,
   EOpConvFloatToUint,
   EOpConvBoolToUint,
   EOpConvDoubleToUint,
   EOpConvIntToDouble,
   EOpConvUintToDouble,
   EOpConvFloatToDouble,
   EOpConvBoolToDouble,

   //
   // binary operations
   //

   EOpAdd,
   EOpSub,
   EOpMul,
   EOpDiv,
   EOpMod,
   EOpRightShift,
   EOpLeftShift,
   EOpAnd,
   EOpInclusiveOr,
   EOpExclusiveOr,
   EOpEqual,
   EOpNotEqual,
   EOpVectorEqual,
   EOpVectorNotEqual,
   EOpLessThan,
   EOpGreaterThan,
   EOpLessThanEqual,
   EOpGreaterThanEqual,
   EOpComma,

   EOpVectorTimesScalar,
   EOpVectorTimesMatrix,
   EOpMatrixTimesVector,
   EOpMatrixTimesScalar,

   EOpLogicalOr,
   EOpLogicalXor,
   EOpLogicalAnd,

   EOpIndexDirect,
   EOpIndexIndirect,
   EOpIndexDirectStruct,

   EOpVectorSwizzle,

   EOpMethod,

   //
   // Built-in functions potentially mapped to operators
   //

   EOpRadians,
   EOpDegrees,
   EOpSin,
   EOpCos,
   EOpTan,
   EOpAsin,
   EOpAcos,
   EOpAtan,
   EOpSinh,
   EOpCosh,
   EOpTanh,
   EOpAsinh,
   EOpAcosh,
   EOpAtanh,

   EOpPow,
   EOpExp,
   EOpLog,
   EOpExp2,
   EOpLog2,
   EOpSqrt,
   EOpInverseSqrt,

   EOpAbs,
   EOpSign,
   EOpFloor,
   EOpTrunc,
   EOpRound,
   EOpRoundEven,
   EOpCeil,
   EOpFract,
   EOpModf,
   EOpMin,
   EOpMax,
   EOpClamp,
   EOpMix,
   EOpStep,
   EOpSmoothStep,

   EOpIsNan,
   EOpIsInf,

   EOpFloatBitsToInt,
   EOpFloatBitsToUint,
   EOpIntBitsToFloat,
   EOpUintBitsToFloat,
   EOpPackSnorm2x16,
   EOpUnpackSnorm2x16,
   EOpPackUnorm2x16,
   EOpUnpackUnorm2x16,
   EOpPackHalf2x16,
   EOpUnpackHalf2x16,

   EOpLength,
   EOpDistance,
   EOpDot,
   EOpCross,
   EOpNormalize,
   EOpFaceForward,
   EOpReflect,
   EOpRefract,

   EOpDPdx, // Fragment only
   EOpDPdy, // Fragment only
   EOpFwidth, // Fragment only
   EOpDPdxFine, // Fragment only
   EOpDPdyFine, // Fragment only
   EOpFwidthFine, // Fragment only
   EOpDPdxCoarse, // Fragment only
   EOpDPdyCoarse, // Fragment only
   EOpFwidthCoarse, // Fragment only

   EOpMatrixTimesMatrix,
   EOpOuterProduct,
   EOpDeterminant,
   EOpMatrixInverse,
   EOpTranspose,

   EOpEmitVertex, // geometry only
   EOpEndPrimitive, // geometry only
   EOpEmitStreamVertex, // geometry only
   EOpEndStreamPrimitive, // geometry only

   EOpBarrier,
   EOpMemoryBarrier,
   EOpMemoryBarrierAtomicCounter,
   EOpMemoryBarrierBuffer,
   EOpMemoryBarrierImage,
   EOpMemoryBarrierShared, // compute only
   EOpGroupMemoryBarrier, // compute only

   EOpAtomicAdd, // TODO: AST functionality: hook these up
   EOpAtomicMin,
   EOpAtomicMax,
   EOpAtomicAnd,
   EOpAtomicOr,
   EOpAtomicXor,
   EOpAtomicExchange,
   EOpAtomicCompSwap,

   EOpAny,
   EOpAll,

   //
   // Branch
   //

   EOpKill, // Fragment only
   EOpReturn,
   EOpBreak,
   EOpContinue,
   EOpCase,
   EOpDefault,

   //
   // Constructors
   //

   EOpConstructGuardStart,
   EOpConstructInt, // these first scalar forms also identify what implicit conversion is needed
   EOpConstructUint,
   EOpConstructBool,
   EOpConstructFloat,
   EOpConstructDouble,
   EOpConstructVec2,
   EOpConstructVec3,
   EOpConstructVec4,
   EOpConstructDVec2,
   EOpConstructDVec3,
   EOpConstructDVec4,
   EOpConstructBVec2,
   EOpConstructBVec3,
   EOpConstructBVec4,
   EOpConstructIVec2,
   EOpConstructIVec3,
   EOpConstructIVec4,
   EOpConstructUVec2,
   EOpConstructUVec3,
   EOpConstructUVec4,
   EOpConstructMat2x2,
   EOpConstructMat2x3,
   EOpConstructMat2x4,
   EOpConstructMat3x2,
   EOpConstructMat3x3,
   EOpConstructMat3x4,
   EOpConstructMat4x2,
   EOpConstructMat4x3,
   EOpConstructMat4x4,
   EOpConstructDMat2x2,
   EOpConstructDMat2x3,
   EOpConstructDMat2x4,
   EOpConstructDMat3x2,
   EOpConstructDMat3x3,
   EOpConstructDMat3x4,
   EOpConstructDMat4x2,
   EOpConstructDMat4x3,
   EOpConstructDMat4x4,
   EOpConstructStruct,
   EOpConstructGuardEnd,

   //
   // moves
   //

   EOpAssign,
   EOpAddAssign,
   EOpSubAssign,
   EOpMulAssign,
   EOpVectorTimesMatrixAssign,
   EOpVectorTimesScalarAssign,
   EOpMatrixTimesScalarAssign,
   EOpMatrixTimesMatrixAssign,
   EOpDivAssign,
   EOpModAssign,
   EOpAndAssign,
   EOpInclusiveOrAssign,
   EOpExclusiveOrAssign,
   EOpLeftShiftAssign,
   EOpRightShiftAssign,

   //
   // Array operators
   //

   EOpArrayLength, // "Array" distinguishes from length(v) built-in function, but it applies to vectors and matrices as well.
};


	Logged

laser blaster

Iterator

Posts: 178

Re: shader function or instruction cost (performance)

« Reply #5 on: June 09, 2015, 09:52:35 PM »

Finally cleared up the confusion about special functions on Fermi.

"Special Function Units (SFUs) execute transcendental instructions such as sin, cosine,
reciprocal, and square root. Each SFU executes one instruction per thread, per clock; a warp
executes over eight clocks. The SFU pipeline is decoupled from the dispatch unit, allowing the
dispatch unit to issue to other execution units while the SFU is occupied."

So it seems that SFU ops have a very negligible overhead. So just follow the 1/9 rule, and you don't have to worry about them. Note that you don't need to wait 8 instructions between SFU ops; as long as there is a good balance between normal and SFU ops within your innermost loop, it'll be fine (the GPU will just switch to another warp while waiting on the SFU's).

So a normalize would be worth just 6 points on Nvidia, but 10 on AMD.


	Logged

Syntopia

Fractal Molossus

Posts: 681

Re: shader function or instruction cost (performance)

« Reply #6 on: June 09, 2015, 10:08:21 PM »

As LB points out, this will be difficult, as Nvidia has both Scalar Processors and Special Function Unit, Double Precision Units, and Load/Store units each of which can become a bottleneck: for instance, the Mandelbulb uses trigonometric functions extensively in its inner loop.

Also, loops may not have a fixed iteration counts, and conditional can be hard to predict, because their computational cost depends on whether all scalar processors in a given multiprocessor follow the same path. And some functions (such as sign) may turn into very different low-level instruction on different platforms.

Other tricky parts: in some cases making 'uniforms' constant may alter rendering speed by as much as a factor 4x (http://blog.hvidtfeldts.net/index.php/2011/07/optimizing-glsl-code/) - this is hard to see from a static analysis.

This paper has some measures on the older GT200 architecture: http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

Btw, the AST parser looks nice and could be useful - for instance for converting a shader into emulated double precision. What language is it written in? Will you be sharing the code?


	Logged

TruthSerum

Guest

Re: shader function or instruction cost (performance)

« Reply #7 on: June 09, 2015, 10:28:10 PM »

So with these SFU's, are you suggesting that sin,cos,tan and transcendentals be assigned a cost of 1? It sounds kind of suspicious, I'm sure I've seen sine and cosine be expanded into it's Taylor series by HLSL (admittedly, not GLSL) constant folding.

For simplicity sake, and to keep things meaningful on a lot of machines, the operation cost could, instead, be defined in terms of the number of vector elements operated on. That would be consistent with your proposal, where scalar sine is 1.

Under this scheme, the example costs from the table become:

scalar add,sub,mult,div	1
vec4 add/mul/div/sub	4
scalar fract()	1
scalar sqrt()	1
vec3 dot()	3
vec3 normalize()	3
scalar sin()	1
vec4 sin()	4
mat3 * vec3	3*3
mat4 * mat4	444

Interesting points about the inability to predict the behaviour of loops. I think the GLSL version 100 requires that loops be predictable in some way, I've not yet looked at it in detail. But there would be no way to know that a loop will exit-early every time.

As for the code, I expect to make it available at some point, if the thing turns out to be at all useful. I've been told there are tools by nvidia that already do static analysis, which will be better than anything I can put together.


	Logged

laser blaster

Iterator

Posts: 178

Re: shader function or instruction cost (performance)

« Reply #8 on: June 09, 2015, 11:16:26 PM »

Yes, I think a cost of 1 is accurate for sin, cos, 1/x, 1/sqrt(x), 2^x, and log2(x).
And a vec4 sin is equivalent to 4 scalar sin's, so, yeah that would have a cost of 4.

Dispatching one of these SFU ops only takes one cycle, same as an add or a multiplication. Since the SFU's are independent, the GPU is free to work on other things while the SFU's are working. But there's a catch- the SFU's work at 1/8 the speed of the ALU's (which handle add, mul, sub, fma). The GPU will do it's best to distribute the workload evenly, but if your code is dominated by SFU operations, they will become a bottleneck. All that means in practice is that SFU operations should constitute no more than 1/9 of the total operations within your code's innermost loop. If you exceed that, the cost will jump all the way up to 8.

e^x and ln(x) have a cost of 2 - they are computed with one mul, plus either a log2(x) or a 2^(x).
Division has a cost of 2. One reciprocal, one mul.
Sqrt also has a cost of 2. One reciprocal square root, one reciprocal.
tan has a cost of 4- one sin, one cos, one reciprocal, and one mul.

If the Taylor expansion was just done at compile time to evaluate a constant expression, then it should have no bearing on what actually happens on the GPU. Keep in mind that for CUDA, the compiler will actually replace sin and cos with a software based polynomial approximation. But GLSL has looser standards for accuracy than CUDA, so it can use the faster hardware instructions by default. I don't know about HLSL.


	Logged

TruthSerum

Guest

Re: shader function or instruction cost (performance)

« Reply #9 on: June 12, 2015, 12:00:56 AM »

I've got it to the point where it's producing some sensible output. If you want to test it, let me know and I can make a build, or post the source code somewhere.

Here is the output after giving it the code to eiffie's paxis2 shader:


	Logged

eiffie

Guest

Re: shader function or instruction cost (performance)

« Reply #10 on: June 12, 2015, 12:31:57 AM »

I would love to have this code. Partly as Syntopia said, for the parser. :-)


	Logged

TruthSerum

Guest

Re: shader function or instruction cost (performance)

« Reply #11 on: June 12, 2015, 01:10:15 AM »

Since people only seem to be interested in the code, I won't bother build a Windows executable.

I've pushed it to github, which you can find here.


	Logged

cKleinhuis

Administrator
Fractal Senior

Posts: 7044

formerly known as 'Trifox'

Re: shader function or instruction cost (performance)

« Reply #12 on: June 12, 2015, 02:48:46 AM »

lol, you need a "sum total" row wink

... at best for each method separated tease


	Logged

---

divide and conquer - iterate and rule - chaos is No random!

Syntopia

Fractal Molossus

Posts: 681

Re: shader function or instruction cost (performance)

« Reply #13 on: June 12, 2015, 08:13:47 PM »

Thanks for posting the source - I wasn't aware of the GLSLang tools.

I'm not sure I get these numbers. For instance, line 46 says '193' but surely it should be a multiple of 8 given the loop it is in? Line 36 says 0 - even though there several operations?


	Logged

TruthSerum

Guest

Re: shader function or instruction cost (performance)

« Reply #14 on: June 12, 2015, 08:32:20 PM »

It doesn't make complete sense. I was just trying to derive any meaningful numbers from the operation count. In that particular place that you point out, where it has a 0 cost for a line with some operations, I think the cost is being displaced by the if-statement somehow.

But I don't plan to develop it further, since there was not much interest in the thing, beyond revealing how I was parsing the input.


	Logged

Pages: [1] 2 Go Down

« previous next »

	Author	Topic: shader function or instruction cost (performance) (Read 24975 times)
		Description:
0 Members and 3 Guests are viewing this topic.

Related Topics
	Subject	Started by	Replies	Views	Last post
	DirectX 11 compute shader / Directx 10 Pixel shader implementation Mandelbulb Implementation	cbuchner1	2	16643	November 29, 2009, 12:32:26 AM by cbuchner1
	3D fractals using shader 2 3D Fractal Generation	David Makin	4	8340	October 14, 2010, 06:10:59 PM by David Makin
	Under Divine Instruction Mandelbulb3D Gallery	CO99A5	0	898	April 11, 2015, 03:04:56 AM by CO99A5
	shader project - glsl DE function collection Let's collaborate on something!	cKleinhuis	3	7469	June 09, 2015, 11:51:32 AM by cKleinhuis
	rdrand cpu instruction Programming	ker2x	13	2229	September 06, 2016, 08:28:00 AM by ker2x

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again

The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!

	Welcome, Guest. Please login or register.	April 18, 2024, 11:32:13 PM
		Login with username, password and session length

END OF AN ERA, FRACTALFORUMS.COM IS CONTINUED ON FRACTALFORUMS.ORG

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval, thanks and see you perhaps in 10 years again

The All New FractalForums is now in Public Beta Testing! Visit FractalForums.org and check it out!

it was a great time but no longer maintainable by c.Kleinhuis contact him for any data retrieval,
thanks and see you perhaps in 10 years again