linux-kernel - Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <942c02f2-6496-4406-a73b-941d096aadfb@amd.com>
Date: Tue, 7 Jan 2025 15:08:45 +0100
From: Christian König <christian.koenig@....com>
To: Philipp Reisner <philipp.reisner@...bit.com>,
 dri-devel@...ts.freedesktop.org
Cc: linux-kernel@...r.kernel.org, Nirmoy Das <nirmoy.das@....com>,
 Simona Vetter <simona@...ll.ch>
Subject: Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

Am 07.01.25 um 15:02 schrieb Philipp Reisner:
> The following OOPS plagues me on about every 10th suspend and resume:
>
> [160640.791304] BUG: kernel NULL pointer dereference, address: 0000000000000008
> [160640.791309] #PF: supervisor read access in kernel mode
> [160640.791311] #PF: error_code(0x0000) - not-present page
> [160640.791313] PGD 0 P4D 0
> [160640.791316] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [160640.791320] CPU: 12 UID: 1001 PID: 648526 Comm: kscreenloc:cs0 Tainted: G           OE      6.11.7-300.fc41.x86_64 #1
> [160640.791324] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> [160640.791325] Hardware name: Micro-Star International Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
> [160640.791327] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
> [160640.791337] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 31 39 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
> [160640.791340] RSP: 0018:ffffb2ef5e6cb9b8 EFLAGS: 00010206
> [160640.791342] RAX: 0000000000000000 RBX: ffff9d804cc62800 RCX: ffff9d784020f0d0
> [160640.791344] RDX: 0000000000000000 RSI: ffff9d784d3b9cd0 RDI: ffff9d784020f638
> [160640.791345] RBP: ffff9d784020f610 R08: ffff9d78414e4268 R09: 2072656c75646568
> [160640.791346] R10: 686373205d6d7264 R11: 632072656c756465 R12: 0000000000000000
> [160640.791347] R13: 0000000000000001 R14: ffffb2ef5e6cba38 R15: 0000000000000000
> [160640.791349] FS:  00007f8f30aca6c0(0000) GS:ffff9d873ec00000(0000) knlGS:0000000000000000
> [160640.791351] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [160640.791352] CR2: 0000000000000008 CR3: 000000069de82000 CR4: 0000000000350ef0
> [160640.791354] Call Trace:
> [160640.791357]  <TASK>
> [160640.791360]  ? __die_body.cold+0x19/0x27
> [160640.791367]  ? page_fault_oops+0x15a/0x2f0
> [160640.791372]  ? exc_page_fault+0x7e/0x180
> [160640.791376]  ? asm_exc_page_fault+0x26/0x30
> [160640.791380]  ? drm_sched_job_arm+0x23/0x60 [gpu_sched]
> [160640.791384]  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
> [160640.791390]  amdgpu_cs_ioctl+0x170c/0x1e40 [amdgpu]
> [160640.792011]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> [160640.792341]  drm_ioctl_kernel+0xb0/0x100
> [160640.792346]  drm_ioctl+0x28b/0x540
> [160640.792349]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> [160640.792673]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
> [160640.792994]  __x64_sys_ioctl+0x94/0xd0
> [160640.792999]  do_syscall_64+0x82/0x160
> [160640.793006]  ? __count_memcg_events+0x75/0x130
> [160640.793009]  ? count_memcg_events.constprop.0+0x1a/0x30
> [160640.793014]  ? handle_mm_fault+0x21b/0x330
> [160640.793016]  ? do_user_addr_fault+0x55a/0x7b0
> [160640.793020]  ? exc_page_fault+0x7e/0x180
> [160640.793023]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> The OOPS happens because the rq member of entity is NULL in
> drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
>
> In drm_sched_entity_select_rq(), the code considers that
> drb_sched_pick_best() might return a NULL value. When NULL, it assigns
> NULL to entity->rq even if it had a non-NULL value before.
>
> drm_sched_job_arm() does not deal with entities having a rq of NULL.
>
> Fix this by leaving the entity on the engine it was instead of
> assigning a NULL to its run queue member.

Well that is clearly not the correct approach to fixing this. So clearly 
a NAK from my side.

The real question is why is amdgpu_cs_ioctl() called when all of 
userspace should be frozen?

Regards,
Christian.

>
> Link: https://retrace.fedoraproject.org/faf/reports/1038619/
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3746
> Signed-off-by: Philipp Reisner <philipp.reisner@...bit.com>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index a75eede8bf8d..495bc087588b 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -557,10 +557,12 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>   
>   	spin_lock(&entity->rq_lock);
>   	sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
> -	rq = sched ? sched->sched_rq[entity->priority] : NULL;
> -	if (rq != entity->rq) {
> -		drm_sched_rq_remove_entity(entity->rq, entity);
> -		entity->rq = rq;
> +	if (sched) {
> +		rq = sched->sched_rq[entity->priority];
> +		if (rq != entity->rq) {
> +			drm_sched_rq_remove_entity(entity->rq, entity);
> +			entity->rq = rq;
> +		}
>   	}
>   	spin_unlock(&entity->rq_lock);
>