| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reimplement the buffer cache using cached bindings and page level
granularity for modification tracking. This also drops the usage of
shared pointers and virtual functions from the cache.
- Bindings are cached, allowing to skip work when the game changes few
bits between draws.
- OpenGL Assembly shaders no longer copy when a region has been modified
from the GPU to emulate constant buffers, instead GL_EXT_memory_object
is used to alias sub-buffers within the same allocation.
- OpenGL Assembly shaders stream constant buffer data using
glProgramBufferParametersIuivNV, from NV_parameter_buffer_object. In
theory this should save one hash table resolve inside the driver
compared to glBufferSubData.
- A new OpenGL stream buffer is implemented based on fences for drivers
that are not Nvidia's proprietary, due to their low performance on
partial glBufferSubData calls synchronized with 3D rendering (that
some games use a lot).
- Most optimizations are shared between APIs now, allowing Vulkan to
cache more bindings than before, skipping unnecesarry work.
This commit adds the necessary infrastructure to use Vulkan object from
OpenGL. Overall, it improves performance and fixes some bugs present on
the old cache. There are still some edge cases hit by some games that
harm performance on some vendors, this are planned to be fixed in later
commits.
|
|
|
|
|
| |
This reverts #4713. The implementation in that PR is not accurate.
It does not reflect the behavior seen in hardware.
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The current texture cache has several points that hurt maintainability
and performance. It's easy to break unrelated parts of the cache
when doing minor changes. The cache can easily forget valuable
information about the cached textures by CPU writes or simply by its
normal usage.The current texture cache has several points that hurt
maintainability and performance. It's easy to break unrelated parts
of the cache when doing minor changes. The cache can easily forget
valuable information about the cached textures by CPU writes or simply
by its normal usage.
This commit aims to address those issues.
|
| |
| |
| |
| |
| | |
Cleans out the rest of the occurrences of variable shadowing and makes
any further occurrences of shadowing compiler errors.
|
| | |
|
|/ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Games using D3D idioms can join images and samplers when a shader
executes, instead of baking them into a combined sampler image. This is
also possible on Vulkan.
One approach to this solution would be to use separate samplers on
Vulkan and leave this unimplemented on OpenGL, but we can't do this
because there's no consistent way of determining which constant buffer
holds a sampler and which one an image. We could in theory find the
first bit and if it's in the TIC area, it's an image; but this falls
apart when an image or sampler handle use an index of zero.
The used approach is to track for a LOP.OR operation (this is done at an
IR level, not at an ISA level), track again the constant buffers used as
source and store this pair. Then, outside of shader execution, join
the sample and image pair with a bitwise or operation.
This approach won't work on games that truly use separate samplers in a
meaningful way. For example, pooling textures in a 2D array and
determining at runtime what sampler to use.
This invalidates OpenGL's disk shader cache :)
- Used mostly by D3D ports to Switch
|
| |
|
|\
| |
| | |
shader/texture: Support multiple unknown sampler properties
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This allows deducing some properties from the texture instruction before
asking the runtime. By doing this we can handle type mismatches in some
instructions from the renderer instead of the shader decoder.
Fixes texelFetch issues with games using 2D texture instructions on a 1D
sampler.
|
| | |
|
|/
|
|
|
|
|
|
| |
Deduplicate code shared between vk_pipeline_cache and gl_shader_cache as
well as shader decoder code.
While we are at it, fix a bug in gl_shader_cache where compute shaders
had an start offset of a stage shader.
|
|\
| |
| | |
shader/video: Partially implement VMNMX
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Implements the common usages for VMNMX. Inputs with a different size
than 32 bits are not supported and sign mismatches aren't supported
either.
VMNMX works as follows:
It grabs Ra and Rb and applies a maximum/minimum on them (this is
defined by .MX), having in mind the input sign. This result can then be
saturated. After the intermediate result is calculated, it applies
another operation on it using Rc. These operations are merges,
accumulations or another min/max pass.
This instruction allows to implement with a more flexible approach GCN's
min3 and max3 instructions (for instance).
|
| | |
|
| | |
|
|/ |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
Using the same technique we used for u8 on LDG, implement u16.
In the case of STG, load memory and insert the value we want to set
into it with bitfieldInsert. Then set that value.
|
| |
|
|
|
|
|
|
|
| |
This commit introduces a mechanism by which shader IR code can be
amended and extended. This useful for track algorithms where certain
information can derived from before the track such as indexes to array
samplers.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
Instead of specializing shaders to separate texture buffers from 1D
textures, use the locker to deduce them while they are being decoded.
|
| |
|
|\
| |
| | |
shader/node: Unpack bindless texture encoding
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Bindless textures were using u64 to pack the buffer and offset from
where they come from. Drop this in favor of separated entries in the
struct.
Remove the usage of std::set in favor of std::list (it's not std::vector
to avoid reference invalidations) for samplers and images.
|
| |
| |
| |
| |
| |
| | |
Originally on the last commit I thought TLD4 acted the same as TLD4S and
didn't have a mask. It actually does have a component mask. This commit
corrects that.
|
|/
|
|
|
|
| |
This commit fixes an issue where not all 4 results of tld4 were being
written, the color component was defaulted to red, among other things.
It also implements the bindless variant.
|
|\
| |
| | |
Implement Fast BRX, fix TXQ and addapt the Shader Cache for it
|
| | |
|
| | |
|
| | |
|
| | |
|
|\ \
| |/
|/| |
Shader_Ir: Fix TLD4S from using a component mask.
|
| |
| |
| |
| |
| |
| | |
TLD4S always outputs 4 values, the previous code checked a component
mask and omitted those values that weren't part of it. This commit
corrects that and makes sure all 4 values are set.
|
|/
|
|
|
|
|
|
|
|
|
| |
Ignore global memory operations instead of invoking undefined behaviour
when constant buffer tracking fails and we are blasting through asserts,
ignore the operation.
In the case of LDG this means filling the destination registers with
zeroes; for STG this means ignore the instruction as a whole.
The default behaviour is still to abort execution on failure.
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| | |
shader/image: Implement SULD and fix SUATOM
|
| |
| |
| |
| |
| |
| | |
In the process remove implementation of SUATOM.MIN and SUATOM.MAX as
these require a distinction between U32 and S32. These have to be
implemented with imageCompSwap loop.
|
|/ |
|
|\
| |
| | |
shader_ir: Implement shared memory
|
| |
| |
| |
| |
| | |
This instruction writes to a memory buffer shared with threads within
the same work group. It is known as "shared" memory in GLSL.
|
| | |
|
|/ |
|
|\
| |
| | |
shader/decode: Implement S2R Tic
|
| | |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement VOTE using Nvidia's intrinsics. Documentation about these can
be found here
https://developer.nvidia.com/reading-between-threads-shader-intrinsics
Instead of using portable ARB instructions I opted to use Nvidia
intrinsics because these are the closest we have to how Tegra X1
hardware renders.
To stub VOTE on non-Nvidia drivers (including nouveau) this commit
simulates a GPU with a warp size of one, returning what is meaningful
for the instruction being emulated:
* anyThreadNV(value) -> value
* allThreadsNV(value) -> value
* allThreadsEqualNV(value) -> true
ballotARB, also known as "uint64_t(activeThreadsNV())", emits
VOTE.ANY Rd, PT, PT;
on nouveau's compiler. This doesn't match exactly to Nvidia's code
VOTE.ALL Rd, PT, PT;
Which is emulated with activeThreadsNV() by this commit. In theory this
shouldn't really matter since .ANY, .ALL and .EQ affect the predicates
(set to PT on those cases) and not the registers.
|
|
|
|
|
|
| |
This is more accurate in terms of describing what the functions are
actually doing. Temporal relates to time, not the setting of a temporary
itself.
|
|
|
|
| |
Removes unnecessary header dependencies.
|
|\
| |
| | |
shader/track: Track indirect buffers
|
| |
| |
| |
| |
| |
| | |
While changing this code, simplify tracking code to allow returning
the base address node, this way callers don't have to manually rebuild
it on each invocation.
|
|\ \
| |/
|/| |
gl_shader_decompiler: Implement gl_ViewportIndex and gl_Layer in vertex shaders
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This commit implements gl_ViewportIndex and gl_Layer in vertex and
geometry shaders. In the case it's used in a vertex shader, it requires
ARB_shader_viewport_layer_array. This extension is available on AMD and
Nvidia devices (mesa and proprietary drivers), but not available on
Intel on any platform. At the moment of writing this description I don't
know if this is a hardware limitation or a driver limitation.
In the case that ARB_shader_viewport_layer_array is not available,
writes to these registers on a vertex shader are ignored, with the
appropriate logging.
|
| |
| |
| |
| | |
Also shows Nvidia's address space on comments.
|
| | |
|
| | |
|
| | |
|
|/ |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
Analysis passes do not have a good reason to depend on shader_ir.h to
work on top of nodes. This splits node-related declarations to their own
file and leaves the IR in shader_ir.h
|
|
|
|
|
|
|
|
|
| |
Instead of having a vector of unique_ptr stored in a vector and
returning star pointers to this, use shared_ptr. While changing
initialization code, move it to a separate file when possible.
This is a first step to allow code analysis and node generation beyond
the ShaderIR class.
|
|\
| |
| | |
shader: Implement S2R Tid{XYZ} and CtaId{XYZ}
|
| | |
|
|\ \
| | |
| | | |
shader/memory: Implement generic memory stores and loads (ST and LD)
|
| |/ |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows for forming comment nodes without making unnecessary copies
of the std::string instance.
e.g. previously:
Comment(fmt::format("Base address is c[0x{:x}][0x{:x}]",
cbuf->GetIndex(), cbuf_offset));
Would result in a copy of the string being created, as CommentNode()
takes a std::string by value (a const ref passed to a value parameter
results in a copy).
Now, only one instance of the string is ever moved around. (fmt::format
returns a std::string, and since it's returned from a function by value,
this is a prvalue (which can be treated like an rvalue), so it's moved
into Comment's string parameter), we then move it into the CommentNode
constructor, which then moves the string into its member variable).
|
|\
| |
| | |
shader: Implement AL2P and ALD.PHYS
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
| |
| |
| |
| | |
constexpr internally links by default, so the inline specifier is
unnecessary.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Many of these constructors don't even need to be templated. The only
ones that need to be templated are the ones that actually make use of
the parameter pack.
Even then, since std::vector accepts an initializer list, we can supply
the parameter pack directly to it instead of creating our own copy of
the list, then copying it again into the std::vector.
|
| |
| |
| |
| |
| | |
These overloads don't actually make use of the parameter pack, so they
can be turned into regular non-template function overloads.
|
| |
| |
| |
| |
| | |
These don't actually modify instance state, so they can be marked as
const member functions
|
|/
|
|
|
|
| |
Given the class contains quite a lot of non-trivial types, place the
constructor and destructor within the cpp file to avoid inlining
construction and destruction code everywhere the class is used.
|
| |
|
| |
|
|\
| |
| | |
shader_ir/decode: Miscellaneous fixes to half-float decompilation
|
| |
| |
| |
| |
| |
| |
| | |
Operations done before the main half float operation (like HAdd) were
managing a packed value instead of the unpacked one. Adding an unpacked
operation allows us to drop the per-operand MetaHalfArithmetic entry,
simplifying the code overall.
|
| | |
|
| | |
|
|\ \
| | |
| | | |
Implement Bindless Textures on Shader Decompiler and GL backend
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| |/ |
|
|/ |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
This was originally included because texture operations returned a vec4.
These operations now return a single float and the F4 prefix doesn't
mean anything.
|
|
|
|
|
|
|
|
|
| |
Previous code relied on GLSL parameter order (something that's always
ill-formed on an IR design). This approach passes spatial coordiantes
through operation nodes and array and depth compare values in the the
texture metadata. It still contains an "extra" vector containing generic
nodes for bias and component index (for example) which is still a bit
ill-formed but it should be better than the previous approach.
|
|\
| |
| | |
shader/track: Add a more permissive global memory tracking
|
| |
| |
| |
| | |
It's not always used as a basic block. Rename it for consistency.
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Some games call LDG at the top of a basic block, making the tracking
heuristic to fail. This commit lets the heuristic the decoded nodes as a
whole instead of per basic blocks.
This may lead to some false positives but allows it the heuristic to
track cases it previously couldn't.
|
|/ |
|
|
|
|
|
|
|
| |
Constant buffer values on the shader IR were using different offsets if
the access direct or indirect. cbuf34 has a non-multiplied offset while
cbuf36 does. On shader decoding this commit multiplies it by four on
cbuf34 queries.
|
| |
|
|
|
|
|
| |
Given we're in the area, these are three trivial typos that can be
corrected.
|
|
|
|
|
| |
Orders the class members in the same order that they would actually be
initialized in. Gets rid of two compiler warnings.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
| |
Co-Authored-By: ReinUsesLisp <reinuseslisp@airmail.cc>
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|