You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The predicate b28 is predicating the stmatrix call which is good, but it should also guard the TMA store, otherwise all CTAs in the grid will be writing the output instead of just the last CTA in each split-K segment, which is both slow and incorrect.
The text was updated successfully, but these errors were encountered:
Note that the code above can give correct results as long as the CTAs are ordered linearly. There will be multiple TMA stores to each output location, but the last one should be the correct one. This is still inefficient though.
On Hopper, when we have both splitk and smem epilogue enabled, we generate code like the following:
The predicate
b28
is predicating the stmatrix call which is good, but it should also guard the TMA store, otherwise all CTAs in the grid will be writing the output instead of just the last CTA in each split-K segment, which is both slow and incorrect.The text was updated successfully, but these errors were encountered: