lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250601232937.3510379-51-sashal@kernel.org>
Date: Sun,  1 Jun 2025 19:28:43 -0400
From: Sasha Levin <sashal@...nel.org>
To: patches@...ts.linux.dev,
	stable@...r.kernel.org
Cc: Arvind Yadav <Arvind.Yadav@....com>,
	Christian König <Christian.Koenig@....com>,
	Alex Deucher <alexander.deucher@....com>,
	Christian König <christian.koenig@....com>,
	Shashank Sharma <shashank.sharma@....com>,
	Arvind Yadav <arvind.yadav@....com>,
	Sasha Levin <sashal@...nel.org>,
	airlied@...il.com,
	simona@...ll.ch,
	Jack.Xiao@....com,
	srinivasan.shanmugam@....com,
	Prike.Liang@....com,
	Jiadong.Zhu@....com,
	mukul.joshi@....com,
	shaoyun.liu@....com,
	Arunpravin.PaneerSelvam@....com,
	Hawking.Zhang@....com,
	michael.chen@....com,
	sunil.khatri@....com,
	jesse.zhang@....com,
	amd-gfx@...ts.freedesktop.org,
	dri-devel@...ts.freedesktop.org,
	linux-kernel@...r.kernel.org
Subject: [PATCH AUTOSEL 6.14 051/102] drm/amdgpu: fix MES GFX mask

From: Arvind Yadav <Arvind.Yadav@....com>

[ Upstream commit 9d3afcb7b9f950b9b7c58ceeeb9e71f3476e69ed ]

Current MES GFX mask prevents FW to enable oversubscription. This patch
does the following:
- Fixes the mask values and adds a description for the same
- Removes the central mask setup and makes it IP specific, as it would
  be different when the number of pipes and queues are different.

v2: squash in fix from Shashank

Cc: Christian König <Christian.Koenig@....com>
Cc: Alex Deucher <alexander.deucher@....com>
Acked-by: Christian König <christian.koenig@....com>
Signed-off-by: Shashank Sharma <shashank.sharma@....com>
Signed-off-by: Arvind Yadav <arvind.yadav@....com>
Signed-off-by: Alex Deucher <alexander.deucher@....com>
Signed-off-by: Sasha Levin <sashal@...nel.org>
---

**YES** This commit should be backported to stable kernel trees. Here's
my extensive analysis: ## Bug Fix Classification This is clearly a
**functional bug fix** that addresses a critical hardware configuration
issue. The commit message explicitly states the problem: "Current MES
GFX mask prevents FW to enable oversubscription." ## Detailed Code
Change Analysis ### 1. Removal of Incorrect Central Mask Setup
(amdgpu_mes.c) ```c // REMOVED - This was the source of the bug: for (i
= 0; i < AMDGPU_MES_MAX_GFX_PIPES; i++) adev->mes.gfx_hqd_mask[i] = i ?
0 : 0xfffffffe; ``` The original code used a generic `0xfffffffe` mask
(all bits set except bit 0) for all hardware, which was incorrect. This
prevented the MES firmware from properly configuring oversubscription
because: - It didn't account for hardware-specific queue topologies -
Different GPU generations have different numbers of available queues -
Some pipes have hardware limitations that weren't respected ### 2.
Hardware-Specific Mask Implementation **For MES v11.0 (mes_v11_0.c):**
```c static void mes_v11_0_set_gfx_hqd_mask(union
MESAPI_SET_HW_RESOURCES *pkt) { // GFX pipe 0 queue 0 is being used by
Kernel queue. // Set GFX pipe 0 queue 1 for MES scheduling (mask = 10b)
// GFX pipe 1 can't be used for MES due to HW limitation.
pkt->gfx_hqd_mask[0] = 0x2; // Only queue 1 available
pkt->gfx_hqd_mask[1] = 0; // Pipe 1 disabled due to HW limitation } ```
**For MES v12.0 (mes_v12_0.c):** ```c static void
mes_v12_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt) { // GFX
V12 has only one GFX pipe, but 8 queues in it. // GFX pipe 0 queue 0 is
being used by Kernel queue. // Set GFX pipe 0 queue 1-7 for MES
scheduling (mask = 1111 1110b) pkt->gfx_hqd_mask[0] = 0xFE; // Queues
1-7 available } ``` ### 3. Impact on Functionality **Critical functional
impact:** - **Oversubscription disabled**: The incorrect mask prevented
MES firmware from enabling oversubscription, a core GPU efficiency
feature - **Resource underutilization**: GPUs couldn't fully utilize
their compute queues - **Performance degradation**: Systems experienced
reduced GPU throughput ## Comparison with Similar Commits Looking at the
historical examples: - **Similar Commit #2 & #3** (Status: NO/YES):
"drm/amdgpu/mes: fix mes ring buffer overflow" - This was a clear bug
fix that got backported with `Fixes:` tags and `Cc:
stable@...r.kernel.org` - **Similar Commits #1, #4, #5** (Status: NO):
These were feature additions or code improvements, not functional bug
fixes ## Backport Suitability Assessment **✅ Meets stable tree
criteria:** 1. **Fixes important bug**: Prevents MES firmware from
enabling oversubscription 2. **Small and contained**: Only changes mask
values and organization, no architectural changes 3. **Clear side
effects**: Minimal - only corrects queue configuration 4. **No
architectural changes**: Moves mask setup from generic to hardware-
specific without changing core logic 5. **Critical subsystem**: GPU
scheduling affects system performance 6. **Low regression risk**: Only
corrects obviously wrong mask values **✅ Stable tree rules compliance:**
- **Important bugfix**: ✅ Fixes core GPU scheduling functionality -
**Minimal risk**: ✅ Only corrects configuration values - **No new
features**: ✅ Fixes existing functionality rather than adding new
capabilities - **Confined to subsystem**: ✅ Changes only affect AMD GPU
MES subsystem ## Conclusion This commit should definitely be backported
because it fixes a functional regression where MES firmware cannot
properly configure oversubscription due to incorrect hardware queue
masks. The fix is safe, well-understood, and addresses a clear
performance issue without introducing architectural changes or new
features. Users with affected AMD GPUs would see immediate improvement
in GPU utilization and compute performance.

 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c |  3 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 +-
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 15 +++++++++++++--
 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c  | 15 ++++++++++++---
 4 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index e4251d0691c9c..3077e3918dd4a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -150,9 +150,6 @@ int amdgpu_mes_init(struct amdgpu_device *adev)
 		adev->mes.compute_hqd_mask[i] = 0xc;
 	}
 
-	for (i = 0; i < AMDGPU_MES_MAX_GFX_PIPES; i++)
-		adev->mes.gfx_hqd_mask[i] = i ? 0 : 0xfffffffe;
-
 	for (i = 0; i < AMDGPU_MES_MAX_SDMA_PIPES; i++) {
 		if (amdgpu_ip_version(adev, SDMA0_HWIP, 0) <
 		    IP_VERSION(6, 0, 0))
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index e98ea7ede1bab..6dbe32f8aff3e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -111,8 +111,8 @@ struct amdgpu_mes {
 
 	uint32_t                        vmid_mask_gfxhub;
 	uint32_t                        vmid_mask_mmhub;
-	uint32_t                        compute_hqd_mask[AMDGPU_MES_MAX_COMPUTE_PIPES];
 	uint32_t                        gfx_hqd_mask[AMDGPU_MES_MAX_GFX_PIPES];
+	uint32_t                        compute_hqd_mask[AMDGPU_MES_MAX_COMPUTE_PIPES];
 	uint32_t                        sdma_hqd_mask[AMDGPU_MES_MAX_SDMA_PIPES];
 	uint32_t                        aggregated_doorbells[AMDGPU_MES_PRIORITY_NUM_LEVELS];
 	uint32_t                        sch_ctx_offs[AMDGPU_MAX_MES_PIPES];
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index ec7ef8763f932..8dd92389fc5a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -666,6 +666,18 @@ static int mes_v11_0_misc_op(struct amdgpu_mes *mes,
 			offsetof(union MESAPI__MISC, api_status));
 }
 
+static void mes_v11_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt)
+{
+	/*
+	 * GFX pipe 0 queue 0 is being used by Kernel queue.
+	 * Set GFX pipe 0 queue 1 for MES scheduling
+	 * mask = 10b
+	 * GFX pipe 1 can't be used for MES due to HW limitation.
+	 */
+	pkt->gfx_hqd_mask[0] = 0x2;
+	pkt->gfx_hqd_mask[1] = 0;
+}
+
 static int mes_v11_0_set_hw_resources(struct amdgpu_mes *mes)
 {
 	int i;
@@ -690,8 +702,7 @@ static int mes_v11_0_set_hw_resources(struct amdgpu_mes *mes)
 		mes_set_hw_res_pkt.compute_hqd_mask[i] =
 			mes->compute_hqd_mask[i];
 
-	for (i = 0; i < MAX_GFX_PIPES; i++)
-		mes_set_hw_res_pkt.gfx_hqd_mask[i] = mes->gfx_hqd_mask[i];
+	mes_v11_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt);
 
 	for (i = 0; i < MAX_SDMA_PIPES; i++)
 		mes_set_hw_res_pkt.sdma_hqd_mask[i] = mes->sdma_hqd_mask[i];
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
index 53d059a2a42e0..01f7f4b7f452a 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
@@ -692,6 +692,17 @@ static int mes_v12_0_set_hw_resources_1(struct amdgpu_mes *mes, int pipe)
 			offsetof(union MESAPI_SET_HW_RESOURCES_1, api_status));
 }
 
+static void mes_v12_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt)
+{
+	/*
+	 * GFX V12 has only one GFX pipe, but 8 queues in it.
+	 * GFX pipe 0 queue 0 is being used by Kernel queue.
+	 * Set GFX pipe 0 queue 1-7 for MES scheduling
+	 * mask = 1111 1110b
+	 */
+	pkt->gfx_hqd_mask[0] = 0xFE;
+}
+
 static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe)
 {
 	int i;
@@ -714,9 +725,7 @@ static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe)
 			mes_set_hw_res_pkt.compute_hqd_mask[i] =
 				mes->compute_hqd_mask[i];
 
-		for (i = 0; i < MAX_GFX_PIPES; i++)
-			mes_set_hw_res_pkt.gfx_hqd_mask[i] =
-				mes->gfx_hqd_mask[i];
+		mes_v12_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt);
 
 		for (i = 0; i < MAX_SDMA_PIPES; i++)
 			mes_set_hw_res_pkt.sdma_hqd_mask[i] =
-- 
2.39.5


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ