[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250814032153.227285-1-jeffbai@aosc.io>
Date: Thu, 14 Aug 2025 11:21:36 +0800
From: Mingcong Bai <jeffbai@...c.io>
To: linux-kernel@...r.kernel.org
Cc: Wentao Guan <guanwentao@...ontech.com>,
WangYuli <wangyuli@...ontech.com>,
Huacai Chen <chenhuacai@...nel.org>,
Kexy Biscuit <kexybiscuit@...c.io>,
Mingcong Bai <jeffbai@...c.io>,
Zhang Yuhao <xinmu@...mu.moe>,
Felix Kuehling <Felix.Kuehling@....com>,
Alex Deucher <alexander.deucher@....com>,
Christian König <christian.koenig@....com>,
David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>,
amd-gfx@...ts.freedesktop.org,
dri-devel@...ts.freedesktop.org
Subject: [RFC PATCH] drm/amdkfd: disable HSA_AMD_SVM on LoongArch and AArch64
While testing my ROCm port for LoongArch and AArch64 (patches pending) on
the following platforms:
- LoongArch ...
- Loongson AC612A0_V1.1 (Loongson 3C6000/S) + AMD Radeon RX 6800
- AArch64 ...
- FD30M51 (Phytium FT-D3000) + AMD Radeon RX 7600
- Huawei D920S10 (Huawei Kunpeng 920) + AMD Radeon RX 7600
When HSA_AMD_SVM is enabled, amdgpu would fail to initialise at all on
LoongArch (no output):
amdgpu 0000:0d:00.0: amdgpu: kiq ring mec 2 pipe 1 q 0
CPU 0 Unable to handle kernel paging request at virtual address ffffffffff800034, era == 9000000001058044, ra == 9000000001058660
Oops[#1]:
CPU: 0 UID: 0 PID: 202 Comm: kworker/0:3 Not tainted 6.16.0+ #103 PREEMPT(full)
Hardware name: To be filled by O.E.M.To be fill To be filled by O.E.M.To be fill/To be filled by O.E.M.To be fill, BIOS Loongson-UDK2018-V4.0.
Workqueue: events work_for_cpu_fn
pc 9000000001058044 ra 9000000001058660 tp 9000000101500000 sp 9000000101503aa0
a0 ffffffffff800000 a1 0000000ffffe0000 a2 0000000000000000 a3 90000001207c58e0
a4 9000000001a4c310 a5 0000000000000001 a6 0000000000000000 a7 0000000000000001
t0 000003ffff800000 t1 0000000000000001 t2 0000040000000000 t3 03ffff0000002000
t4 0000000000000000 t5 0001010101010101 t6 ffff800000000000 t7 0001000000000000
t8 000000000000002f u0 0000000000800000 s9 9000000002026000 s0 90000001207c58e0
s1 0000000000000001 s2 9000000001935c40 s3 0000001000000000 s4 0000000000000001
s5 0000000ffffe0000 s6 0000000000000040 s7 0001000000000001 s8 0001000000000000
ra: 9000000001058660 memmap_init_zone_device+0x120/0x1b0
ERA: 9000000001058044 __init_zone_device_page.constprop.0+0x4/0x1a0
CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
PRMD: 00000004 (PPLV0 +PIE -PWE)
EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
ESTAT: 00020000 [PIS] (IS= ECode=2 EsubCode=0)
BADV: ffffffffff800034
PRID: 0014d010 (Loongson-64bit, Loongson-3C6000/S)
Modules linked in: amdgpu(+) vfat fat cfg80211 rfkill 8021q garp stp mrp llc snd_hda_codec_atihdmi snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic drm_client_lib drm_ttm_helper syscopyarea ttm sysfillrect sysimgblt fb_sys_fops drm_panel_backlight_quirks video drm_exec drm_suballoc_helper amdxcp mfd_core drm_buddy gpu_sched drm_display_helper drm_kms_helper cec snd_hda_intel ipmi_ssif snd_intel_dspcfg snd_hda_codec snd_hda_core acpi_ipmi snd_hwdep snd_pcm fb loongson3_cpufreq lcd igc snd_timer ipmi_si spi_loongson_pci spi_loongson_core snd ipmi_devintf soundcore ipmi_msghandler binfmt_misc fuse drm drm_panel_orientation_quirks backlight dm_mod dax nfnetlink
Process kworker/0:3 (pid: 202, threadinfo=00000000eb7cd5d6, task=000000004ca22b1b)
Stack : 0000000000001440 0000000000000000 ffffffffff800000 0000000000000001
90000000020b5978 9000000101503b38 0000000000000001 0000000000000001
0000000000000000 90000000020b5978 90000000020b3f48 0000000000001440
0000000000000000 90000001207c58e0 90000001207c5970 9000000000575e20
90000000010e2e00 90000000020b3f48 900000000205c238 0000000000000000
00000000000001d3 90000001207c58e0 9000000001958f28 9000000120790848
90000001207b3510 0000000000000000 9000000120780000 9000000120780010
90000001207d6000 90000001207c58e0 90000001015660c8 9000000120780000
0000000000000000 90000000005763a8 90000001207c58e0 00000003ff000000
9000000120780000 ffff80000296b820 900000012078f968 90000001207c6000
...
Call Trace:
[<9000000001058044>] __init_zone_device_page.constprop.0+0x4/0x1a0
[<900000000105865c>] memmap_init_zone_device+0x11c/0x1b0
[<9000000000575e1c>] memremap_pages+0x24c/0x7b0
[<90000000005763a4>] devm_memremap_pages+0x24/0x80
[<ffff80000296b81c>] kgd2kfd_init_zone_device+0x11c/0x220 [amdgpu]
[<ffff80000265d09c>] amdgpu_device_init+0x27dc/0x2bf0 [amdgpu]
[<ffff80000265ece8>] amdgpu_driver_load_kms+0x18/0x90 [amdgpu]
[<ffff800002651fbc>] amdgpu_pci_probe+0x22c/0x890 [amdgpu]
[<9000000000916adc>] local_pci_probe+0x3c/0xb0
[<90000000002976c8>] work_for_cpu_fn+0x18/0x30
[<900000000029aeb4>] process_one_work+0x164/0x320
[<900000000029b96c>] worker_thread+0x37c/0x4a0
[<90000000002a695c>] kthread+0x12c/0x220
[<9000000001055b64>] ret_from_kernel_thread+0x24/0xc0
[<9000000000237524>] ret_from_kernel_thread_asm+0xc/0x88
Code: 00000000 00000000 0280040d <2980d08d> 02bffc0e 2980c08e 02c0208d 29c0208d 1400004f
---[ end trace 0000000000000000 ]---
Or lock up and/or driver reset during computate tasks, such as when
running llama.cpp over ROCm, at which point the compute process must be
killed before the reset could complete:
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
amdgpu 0000:0a:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1202
amdgpu 0000:0a:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
amdgpu 0000:0a:00.0: amdgpu: Failed to evict queue 3
amdgpu 0000:0a:00.0: amdgpu: GPU reset begin!
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
amdgpu 0000:0a:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1004
amdgpu 0000:0a:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
amdgpu 0000:0a:00.0: amdgpu: Failed to evict queue 2
amdgpu 0000:0a:00.0: amdgpu: Failed to evict queue 1
amdgpu 0000:0a:00.0: amdgpu: Failed to evict queue 0
amdgpu: Failed to quiesce KFD
amdgpu 0000:0a:00.0: amdgpu: Dumping IP State
amdgpu 0000:0a:00.0: amdgpu: Dumping IP State Completed
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
amdgpu 0000:0a:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
[drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
amdgpu 0000:0a:00.0: amdgpu: MODE1 reset
amdgpu 0000:0a:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:0a:00.0: amdgpu: GPU smu mode1 reset
amdgpu 0000:0a:00.0: amdgpu: GPU reset succeeded, trying to resume
Disabling the aforementioned option makes the issue go away, though it is
unclear whether this is a platform-specific issue or one that lies within
the amdkfd code.
This patch has been tested on all the aforementioned platform
combinations, and sent as an RFC to encourage discussion.
Signed-off-by: Zhang Yuhao <xinmu@...mu.moe>
Signed-off-by: Mingcong Bai <jeffbai@...c.io>
Tested-by: Mingcong Bai <jeffbai@...c.io>
---
drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig b/drivers/gpu/drm/amd/amdkfd/Kconfig
index 16e12c9913f94..5d2fa86f60bf8 100644
--- a/drivers/gpu/drm/amd/amdkfd/Kconfig
+++ b/drivers/gpu/drm/amd/amdkfd/Kconfig
@@ -14,7 +14,7 @@ config HSA_AMD
config HSA_AMD_SVM
bool "Enable HMM-based shared virtual memory manager"
- depends on HSA_AMD && DEVICE_PRIVATE
+ depends on HSA_AMD && DEVICE_PRIVATE && !LOONGARCH && !ARM64
default y
select HMM_MIRROR
select MMU_NOTIFIER
--
2.50.1
Powered by blists - more mailing lists