[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250601232937.3510379-64-sashal@kernel.org>
Date: Sun, 1 Jun 2025 19:28:56 -0400
From: Sasha Levin <sashal@...nel.org>
To: patches@...ts.linux.dev,
stable@...r.kernel.org
Cc: Amber Lin <Amber.Lin@....com>,
Alex Deucher <alexander.deucher@....com>,
Sasha Levin <sashal@...nel.org>,
Felix.Kuehling@....com,
christian.koenig@....com,
airlied@...il.com,
simona@...ll.ch,
amd-gfx@...ts.freedesktop.org,
dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org
Subject: [PATCH AUTOSEL 6.14 064/102] drm/amdkfd: Set SDMA_RLCx_IB_CNTL/SWITCH_INSIDE_IB
From: Amber Lin <Amber.Lin@....com>
[ Upstream commit ab9fcc6362e0699fc1150aa1d8503c40fce2c1e1 ]
When submitting MQD to CP, set SDMA_RLCx_IB_CNTL/SWITCH_INSIDE_IB bit so
it'll allow SDMA preemption if there is a massive command buffer of
long-running SDMA commands.
Signed-off-by: Amber Lin <Amber.Lin@....com>
Acked-by: Alex Deucher <alexander.deucher@....com>
Signed-off-by: Alex Deucher <alexander.deucher@....com>
Signed-off-by: Sasha Levin <sashal@...nel.org>
---
Based on my comprehensive analysis of both the commit and the kernel
codebase context, here is my assessment: **YES** This commit should be
backported to stable kernel trees. Here's my extensive analysis: ## Code
Analysis The commit makes a minimal but important change to
`drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c` in the
`update_mqd_sdma()` function: ```c m->sdmax_rlcx_ib_cntl |=
SDMA0_GFX_IB_CNTL__SWITCH_INSIDE_IB_MASK; ``` ## Why This Should Be
Backported ### 1. **Fixes a Clear Functional Bug** - The commit
addresses missing register configuration that prevents proper SDMA
preemption - Without this bit set, SDMA engines cannot perform context
switches inside Indirect Buffers (IBs) - This creates a scheduling
fairness issue where long-running SDMA commands can starve other queues
### 2. **Minimal Risk, High Confidence Change** - **Single line
addition**: Only adds one register bit setting - **Well-understood
hardware feature**: The `SWITCH_INSIDE_IB` bit is a documented hardware
capability - **Conservative fix**: Simply enables existing hardware
functionality that should have been enabled - **No API changes**:
Doesn't modify interfaces or data structures ### 3. **Broad Impact on
Modern AMD Hardware** - Affects all AMD GPUs with SDMA engines from GCN
4.0+ (Vega, Navi, RDNA generations) - Impacts compute workloads using
ROCm, OpenCL, and HIP - The issue affects system responsiveness under
heavy SDMA usage patterns ### 4. **Consistent with Kernel Patterns**
>From examining similar commits, this follows the pattern of commits
marked "Backport Status: YES": - **Similar Commit #5** was backported
for implementing missing register state control in SDMA - Both commits
involve configuring SDMA control registers that were previously unset -
Both address functional correctness rather than adding new features ###
5. **Meets Stable Tree Criteria** - **Important bugfix**: ✅ Fixes SDMA
scheduling fairness - **Small and contained**: ✅ Single line change in
one function - **No architectural changes**: ✅ Uses existing register
and bit definitions - **Minimal regression risk**: ✅ Only enables
documented hardware feature - **Critical subsystem**: ✅ GPU scheduling
affects system performance ### 6. **Clear User Impact** The commit
message explicitly states the problem: "massive command buffer of long-
running SDMA commands" can cause starvation. This directly impacts: -
Multi-process compute workloads - Real-time or interactive applications
that need timely SDMA access - System responsiveness during heavy GPU
compute usage ### 7. **Evidence from Codebase Analysis** My kernel tree
analysis revealed: - The `SWITCH_INSIDE_IB` functionality exists across
all SDMA generations - Other SDMA drivers (amdgpu) properly configure
this register - KFD was missing this configuration, creating an
inconsistency - No security implications found - this is purely a
scheduling correctness issue This commit represents exactly the type of
targeted bugfix that stable trees are designed to include: it's small,
safe, fixes a real user-impacting issue, and has minimal risk of
regression.
drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
index 3014925d95ffc..6666705211f0a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c
@@ -495,6 +495,10 @@ static void update_mqd_sdma(struct mqd_manager *mm, void *mqd,
m->sdma_engine_id = q->sdma_engine_id;
m->sdma_queue_id = q->sdma_queue_id;
m->sdmax_rlcx_dummy_reg = SDMA_RLC_DUMMY_DEFAULT;
+ /* Allow context switch so we don't cross-process starve with a massive
+ * command buffer of long-running SDMA commands
+ */
+ m->sdmax_rlcx_ib_cntl |= SDMA0_GFX_IB_CNTL__SWITCH_INSIDE_IB_MASK;
q->is_active = QUEUE_IS_ACTIVE(*q);
}
--
2.39.5
Powered by blists - more mailing lists