[ET-VK] Better work group sizes for matmul #13185

SS-JIA · 2025-08-07T16:05:03Z

Stack from ghstack (oldest at bottom):

Context

Currently default_pick_local_wg_size() (which internally calls ComputeGraph::create_local_wg_size) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of {8, 8, 1} works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size {W, H, 1} the data required to compute the output would be W * OUTPUT_TILE_W columns of the weight tensor and H * OUTPUT_TILE_H rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If H==W, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming OUTPUT_TILE_W == OUTPUT_TILE_H == 1, a local work group of size {64, 1, 1} would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in (1 + 64) * K = 65K elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size {8, 8, 1} would require 8 unique rows / 8 unique columns resulting in only (8 + 8) * K = 16K unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

Changes

Introduce pick_hw_square_wg_size
Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: D79813236

## Context Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like ``` shader globalwg size localwg size =========== ===================== ==================== ============= linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487 matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712 ``` for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409. However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor. If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded. This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders. ## Changes * Introduce `pick_hw_square_wg_size` * Use the new local work group size determination function for Quantized Linear, Matmul, and Linear Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/) [ghstack-poisoned]

pytorch-bot · 2025-08-07T16:05:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13185

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 4 Unrelated Failures

As of commit 665483a with merge base b36d6b6 ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
pull / unittest-arm-backend-with-no-fvp (test_pytest_ops) / linux-job (gh)
RuntimeError: Command docker exec -t 9f2b72765033913ff7643ca8967990709b6155a306df63d5a89e136bc068ac06 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_8a4w_recipe

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / linux / linux-job (gh) (trunk failure)
examples/models/llama/tests/test_ring_attention.py::TestRingAttention::test_single_token_processing_quantized
pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-arm-backend-with-no-fvp (test_pytest_models) / linux-job (gh) (trunk failure)
backends/arm/test/models/stable_diffusion/test_vae_AutoencoderKL.py::TestAutoencoderKL::test_AutoencoderKL_tosa_MI
pull / unittest-editable / linux / linux-job (gh) (trunk failure)
examples/models/llama/tests/test_ring_attention.py::TestRingAttention::test_single_token_processing_quantized

This comment was automatically generated by Dr. CI and updates every 15 minutes.

## Context Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like ``` shader globalwg size localwg size =========== ===================== ==================== ============= linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487 matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712 ``` for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409. However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor. If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded. This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders. ## Changes * Introduce `pick_hw_square_wg_size` * Use the new local work group size determination function for Quantized Linear, Matmul, and Linear Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/) ghstack-source-id: 301415132 Pull Request resolved: #13185

facebook-github-bot · 2025-08-07T16:05:25Z

This pull request was exported from Phabricator. Differential Revision: D79813236

github-actions · 2025-08-07T16:06:00Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

backends/vulkan/runtime/graph/ops/impl/Common.cpp

## Context Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like ``` shader globalwg size localwg size =========== ===================== ==================== ============= linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487 matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712 ``` for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409. However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor. If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded. This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders. ## Changes * Introduce `pick_hw_square_wg_size` * Use the new local work group size determination function for Quantized Linear, Matmul, and Linear Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/) [ghstack-poisoned]

facebook-github-bot · 2025-08-11T13:54:26Z

This pull request was exported from Phabricator. Differential Revision: D79813236

## Context Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like ``` shader globalwg size localwg size =========== ===================== ==================== ============= linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487 matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712 ``` for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409. However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor. If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded. This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders. ## Changes * Introduce `pick_hw_square_wg_size` * Use the new local work group size determination function for Quantized Linear, Matmul, and Linear Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/) [ghstack-poisoned]

facebook-github-bot · 2025-08-13T01:18:38Z

This pull request was exported from Phabricator. Differential Revision: D79813236

andreanicastro

lgtm

## Context Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like ``` shader globalwg size localwg size =========== ===================== ==================== ============= linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487 matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712 ``` for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409. However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor. If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded. This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders. ## Changes * Introduce `pick_hw_square_wg_size` * Use the new local work group size determination function for Quantized Linear, Matmul, and Linear Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/) [ghstack-poisoned]

facebook-github-bot · 2025-08-13T13:54:23Z

This pull request was exported from Phabricator. Differential Revision: D79813236

SS-JIA mentioned this pull request Aug 7, 2025

[ET-VK] Add mechanism to trigger command buffer re-encode only when necessary #13184

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 7, 2025

facebook-github-bot added the fb-exported label Aug 7, 2025

msluszniak reviewed Aug 7, 2025

View reviewed changes

backends/vulkan/runtime/graph/ops/impl/Common.cpp Outdated Show resolved Hide resolved

andreanicastro closed this Aug 13, 2025

andreanicastro had a problem deploying to cherry-pick-bot August 13, 2025 07:25 — with GitHub Actions Failure

andreanicastro reopened this Aug 13, 2025

andreanicastro approved these changes Aug 13, 2025

View reviewed changes

facebook-github-bot merged commit bb27ab3 into gh/SS-JIA/272/base Aug 13, 2025
98 of 106 checks passed

facebook-github-bot deleted the gh/SS-JIA/272/head branch August 13, 2025 17:52

facebook-github-bot temporarily deployed to cherry-pick-bot August 13, 2025 17:52 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Aug 13, 2025

[ET-VK] Better work group sizes for matmul #13378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK] Better work group sizes for matmul #13185

[ET-VK] Better work group sizes for matmul #13185

Uh oh!

SS-JIA commented Aug 7, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Aug 11, 2025

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

andreanicastro left a comment

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

[ET-VK] Better work group sizes for matmul #13185

[ET-VK] Better work group sizes for matmul #13185

Uh oh!

Conversation

SS-JIA commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes

Uh oh!

pytorch-bot bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13185

❌ 3 New Failures, 4 Unrelated Failures

Uh oh!

facebook-github-bot commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 7, 2025

This PR needs a release notes: label

Uh oh!

Uh oh!

facebook-github-bot commented Aug 11, 2025

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

andreanicastro left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

SS-JIA commented Aug 7, 2025 •

edited

Loading

pytorch-bot bot commented Aug 7, 2025 •

edited

Loading

This PR needs a `release notes:` label