linux-kernel - Re: [PATCH] drm/sched: Fix UB in spsc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <05603d39-0aeb-493e-a1ed-8051a99dfc41@amd.com>
Date: Mon, 10 Nov 2025 16:14:04 +0100
From: Christian König <christian.koenig@....com>
To: phasta@...nel.org, David Airlie <airlied@...il.com>,
 Simona Vetter <simona@...ll.ch>, Alex Deucher <alexander.deucher@....com>,
 Andrey Grodzovsky <Andrey.Grodzovsky@....com>, dakr@...nel.org,
 Matthew Brost <matthew.brost@...el.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH] drm/sched: Fix UB in spsc_queue

On 11/10/25 15:20, Philipp Stanner wrote:
> On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote:
>> On 11/10/25 13:27, Philipp Stanner wrote:
>>> Please don't top-post :(
>>> FDFY:
>>>
>>>
>>>> On 11/10/25 09:19, Philipp Stanner wrote:
>>>>> The spsc_queue is an unlocked, highly asynchronous piece of
>>>>> infrastructure. Its inline function spsc_queue_peek() obtains the head
>>>>> entry of the queue.
>>>>>
>>>>> This access is performed without READ_ONCE() and is, therefore,
>>>>> undefined behavior. In order to prevent the compiler from ever
>>>>> reordering that access, or even optimizing it away, a READ_ONCE() is
>>>>> strictly necessary. This is easily proven by the fact that
>>>>> spsc_queue_pop() uses this very pattern to access the head.
>>>>>
>>>>> Add READ_ONCE() to spsc_queue_peek().
>>>>>
>>>>> Cc: stable@...r.kernel.org # v4.16+
>>>>> Fixes: 27105db6c63a ("drm/amdgpu: Add SPSC queue to scheduler.")
>>>>> Signed-off-by: Philipp Stanner <phasta@...nel.org>
>>>>
>>>>
> 
> […]
> 
>>>> Are illegal since you don't have the correct memory barriers any more.
>>>
>>> I can't follow. Are you saying that spsc_queue_peek() is correct
>>> because you know for sure that when it's used no one else might be
>>> changing that pointer?
>>
>> Correct, yes. That's the whole idea. I mean SPSC stands for single producer single consumer.
>>
> 
> I know that it means that (although it's not documented and, funnily
> enough, I one day realized the meaning while standing under the shower
> after work).
> 
> Anyways, it's not documented that a user must not call
> drm_sched_entity_push_job() in parallel. It currently basically just
> works by the coincidence of no one doing that or rather no one running
> into the race.

Yeah, completely agree.

> 
>>>
>>> Even if that were true this design is highly fragile.
>>>
>>>>
>>>> Took me an eternity to understand that as well, so bear with me that I didn't previously explained that.
>>>
>>> s/explain/document :)
>>>
>>> As discussed few weeks ago with Sima and Tvrtko, what we really need to
>>> move to in all of DRM is this development pattern:
>>>
>>>    1. For parallel code, at first by default use a boring, horribly
>>>       slow (sarcasm) spinlock. BTW I'm not even convinced that a
>>>       spinlock is slower than lockless tricks. Paul's book states that
>>>       a CAS atomic instruction takes about 60 cycles, and taking a lock
>>>       takes 100.
>>
>> The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.
> 
> Which cache lines? The spinlock's?
> 
> The queue data needs to move from one CPU to the other in either case.
> It's the same data that is being moved with spinlock protection.
> 
> A spinlock doesn't lead to more cache line moves as long as there's
> still just a single consumer / producer.

Looking at a couple of examples:

1. spinlock + double linked list (which is what the scheduler used initially).

   You have to touch 3-4 different cache lines, the lock, the previous, the current and the next element (next and prev are usually the same with the lock).

2. kfifo (attempt #2):

   3 cache lines, one for the array, one for the rptr/wptr and one for the element.
   Plus the problem that you need to come up with some upper bound for it.

3. spsc (attempt #3)

   2-3 cache lines, one for the queue (head/tail), one for the element and one for the previous element (but it is quite likely that this is pre-fetched).

Saying this I just realized we could potentially trivially replace the spsc with an single linked list+pointer to the end+spinlock and have the same efficiency. We don't need all the lockless stuff for that at all.

>>
>> Keep in mind that you can rather do a fused multiple add for a full 4x4 matrix before you take a single cache line miss.
>>
>>>    2. If you want to do parallelism without locks, you need to justify
>>>       it really well. "rmb() so list_empty() works without a lock"
>>>       doesn't qualify, but real use case speedups.
>>>    3. If you write lockless infrastructure, you need to document it
>>>       really well. In particular you need to state:
>>>          1. How it works
>>>          2. What the rules are
>>>
>>> See llist.h as an example. It clearly states when you need a lock and
>>> when you don't.
>>
>> The llist.h is actually pretty similar to the SPSC. I'm wondering why they don't have the same issues? E.g. is xchg() providing the memory barriers?
> 
> I think that some atomic instructions contain barriers. But I'd have to
> check.
> 
>>
>>
>>> Or RCU. No one could use it without such good
>>> documentation.
>>>
>>> I have no idea whether spsc_queue is correctly implemented (I doubt
>>> it), and if even a highly experienced dev like you takes "an eternity"
>>> (quote) to understand it, one is really tempted to dream of spinlock_t,
>>> which has very clear semantics and is easily understood even by
>>> beginners.
>>>
>>>>
>>>> Question is what should we do?
>>>
>>> Easy!
>>>
>>> Mid-term, we should replace spsc_queue with a boring, locked, super-
>>> slow linked list ^_^
>>
>> That's what the scheduler started with and the reason why that linked list was replaced with first a KFIFO and than the SPSC was because of lacking performance.
> 
> Such performance measurements must be included in the commit message,
> since they justify the entire commit.
> 
> Yet this is the entire justification given:
> 
> commit 27105db6c63a571b91d01e749d026105a1e63bcf
> Author: Andrey Grodzovsky <Andrey.Grodzovsky@....com>
> Date:   Thu Oct 12 16:41:39 2017 -0400
> 
>     drm/amdgpu: Add SPSC queue to scheduler.
>     
>     It is intended to sabstitute the bounded fifo we are currently
>     using.
>     
>     Signed-off-by: Andrey Grodzovsky <Andrey.Grodzovsky@....com>
>     Reviewed-by: Christian König <christian.koenig@....com>
>     Signed-off-by: Alex Deucher <alexander.deucher@....com>
> 
> 
>>
>> We could go back to the KFIFO design again, but a (double?) linked list is clearly going to show the same performance problems which originally triggered moving away from it.
>>
>>>
>>> The spsc_queue was designed and – perhaps – perf-measured when RR was
>>> the only scheduling policy.
>>>
>>> Since the FIFO rework, where FIFO became the default policy, we now
>>> access our super efficient super fast lockless queue most of the time
>>> with the spinlock being taken immediately afterwards anyways. So we
>>> almost have a locked lockless queue now.
>>>
>>> https://elixir.bootlin.com/linux/v6.18-rc4/source/drivers/gpu/drm/scheduler/sched_entity.c#L502
>>
>> That is not that relevant.
> 
> If the lock being there is not relevant, then why can't we just take it
> and get rid off all those barriers and READ_ONCE's once and for all?

I think so, yes.

>>
>>> Only push_job() often (?) still runs locklessly, but I'm not at all
>>> convinced that this is worth it. Not even performance-wise.
>>
>> That is what is relevant. And yes the difference was totally measurable, that's why this was originally implemented.
>>
>>>
>>> If you look at spsc_queue_push() you see that it
>>>    1. disables preemption,
>>>    2. uses atomic instructions and
>>>    3. has a ton of memory barries
>>>
>>> – in other words, this is almost literally a manual re-implementation
>>> of a spinlock, just without mutual exclusion…
>>
>> The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another. 
> 
> With a doubly linked list you can attach at the front and pull from the
> tail. How is that transferring many cache lines?

See above.

We have some tests for old and trivial use cases (e.g. GLmark2) which on todays standards pretty much only depend on how fast you can push things to the HW.

We could just extend the scheduler test cases to see how many submissions per second we can pump through a dummy implementation where both producer and consumer are nailed to separate CPUs.

Regards,
Christian.

> 
> 
> P.