This gives a small performance improvement. Don't enable NVfp for it though,
because the NVfp penalty is bigger than the gain from this patch. But if NVfp
is enabled anyway, make use of it.
This reverts patch ba35760f9f.
The original patch did not achive its goal, because CMP is a macro that is
expanded to SLT, SGE, MUL, MAD, at least on nvidia hardware. To make matters
worse, it uses a temporary register, and the assembler usually is not clever
enough to find a free temporary from the shader code. If we generate the code
outselves we can pick one of our temps for this job.
Many 2.0 and 3.0 shaders end with a "mov oC0, rx". If sRGB writing is enabled,
the ARB backend writes to a TMP_COLOR temporary, and at the end of the shader
writes the sRGB corrected color to result.color. If oC0 is not partially
rewritten after the mov, we can ignore the mov, not declare TMP_COLOR at all,
and just use the rx register as input for the sRGB correction code. This saves
a temporary and an instruction.
This reduces the number of methods in the shader backend(the instr
modifiers can be handled in that wrapper) and it will help flow
control emulation in the ARB backend.
Once upon a time this was used for creating fake vertex shader
attribute semantics for d3d8 shaders. We don't need this anymore since
device_stream_info_from_declaration() will use the vertex
declaration's output slot to load the data, if present. That also
avoids the potentially expensive matching of attribute semantics
between vertex shader and declaration for d3d8.
SCS is unfortunately a fragment program only instruction. If we have the NV
extensions we can use SIN and COS. Otherwise we have to approximate sine and
cosine with a taylor series. Luckily we're provided with the necessary
constants by the application.
TMP_POS is only used in vertex shaders, declare it in the vshader
specific code. The sRGB constants are only used by pixel shaders, so
move them to the ps specific code, and avoid reading the stateblock.
In D3D10 shaders input/output semantics are strings rather than predefined
types. Unfortunately, the code in vshader_get_input() can be performance
critical, depending on application behaviour. Since vshader_get_input() is
only relevant for d3d9 shaders anyway, just store the usage and usage_idx for
these shaders.
This isn't completely right, since as far as I'm aware SM4 doesn't have shader
limits in the same sense as previous shader models, but this should do for now.