linux-cve-announce - CVE-2023-53351: drm/sched: Check scheduler work queue before calling timeout handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <2025091720-CVE-2023-53351-7b67@gregkh>
Date: Wed, 17 Sep 2025 16:56:50 +0200
From: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
To: linux-cve-announce@...r.kernel.org
Cc: Greg Kroah-Hartman <gregkh@...nel.org>
Subject: CVE-2023-53351: drm/sched: Check scheduler work queue before calling timeout handling

From: Greg Kroah-Hartman <gregkh@...nel.org>

Description
===========

In the Linux kernel, the following vulnerability has been resolved:

drm/sched: Check scheduler work queue before calling timeout handling

During an IGT GPU reset test we see again oops despite of
commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
timeout handling).

It uses ready condition whether to call drm_sched_fault which unwind
the TDR leads to GPU reset.
However it looks the ready condition is overloaded with other meanings,
for example, for the following stack is related GPU reset :

0  gfx_v9_0_cp_gfx_start
1  gfx_v9_0_cp_gfx_resume
2  gfx_v9_0_cp_resume
3  gfx_v9_0_hw_init
4  gfx_v9_0_resume
5  amdgpu_device_ip_resume_phase2

does the following:
	/* start the ring */
	gfx_v9_0_cp_gfx_start(adev);
	ring->sched.ready = true;

The same approach is for other ASICs as well :
gfx_v8_0_cp_gfx_resume
gfx_v10_0_kiq_resume, etc...

As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
and then drm_sched_fault. However now it depends on whether the interrupt service routine
drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
field of the scheduler to true even  for uninitialized schedulers and causes oops vs
no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
NULL pointer dereference does not occur.

Use the field timeout_wq  to prevent oops for uninitialized schedulers.
The field could be initialized by the work queue of resetting the domain.

v1: Corrections to commit message (Luben)

The Linux kernel CVE team has assigned CVE-2023-53351 to this issue.

Affected and fixed versions
===========================

	Issue introduced in 6.3 with commit 11b3b9f461c5c4f700f6c8da202fcc2fd6418e1f and fixed in 6.3.4 with commit c43a96fc00b662cef1ef0eb22d40441ce2abae8f
	Issue introduced in 6.3 with commit 11b3b9f461c5c4f700f6c8da202fcc2fd6418e1f and fixed in 6.4 with commit 2da5bffe9eaa5819a868e8eaaa11b3fd0f16a691

Please see https://www.kernel.org for a full list of currently supported
kernel versions by the kernel community.

Unaffected versions might change over time as fixes are backported to
older supported kernel versions.  The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2023-53351
will be updated if fixes are backported, please check that for the most
up to date information about this issue.

Affected files
==============

The file(s) affected by this issue are:
	drivers/gpu/drm/scheduler/sched_main.c

Mitigation
==========

The Linux kernel CVE team recommends that you update to the latest
stable kernel version for this, and many other bugfixes.  Individual
changes are never tested alone, but rather are part of a larger kernel
release.  Cherry-picking individual commits is not recommended or
supported by the Linux kernel community at all.  If however, updating to
the latest release is impossible, the individual changes to resolve this
issue can be found at these commits:
	https://git.kernel.org/stable/c/c43a96fc00b662cef1ef0eb22d40441ce2abae8f
	https://git.kernel.org/stable/c/2da5bffe9eaa5819a868e8eaaa11b3fd0f16a691