linux-kernel - Re: [PATCH] drm/sched: Fix UB in spsc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <987527ead1fe93877139a9ee8b6d2ee55eefa1ee.camel@mailbox.org>
Date: Mon, 10 Nov 2025 15:20:31 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Christian König <christian.koenig@....com>, 
 phasta@...nel.org, David Airlie <airlied@...il.com>, Simona Vetter
 <simona@...ll.ch>, Alex Deucher <alexander.deucher@....com>, Andrey
 Grodzovsky <Andrey.Grodzovsky@....com>, dakr@...nel.org, Matthew Brost
 <matthew.brost@...el.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org, 
	stable@...r.kernel.org
Subject: Re: [PATCH] drm/sched: Fix UB in spsc_queue

On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote:
> On 11/10/25 13:27, Philipp Stanner wrote:
> > Please don't top-post :(
> > FDFY:
> > 
> > 
> > > On 11/10/25 09:19, Philipp Stanner wrote:
> > > > The spsc_queue is an unlocked, highly asynchronous piece of
> > > > infrastructure. Its inline function spsc_queue_peek() obtains the head
> > > > entry of the queue.
> > > > 
> > > > This access is performed without READ_ONCE() and is, therefore,
> > > > undefined behavior. In order to prevent the compiler from ever
> > > > reordering that access, or even optimizing it away, a READ_ONCE() is
> > > > strictly necessary. This is easily proven by the fact that
> > > > spsc_queue_pop() uses this very pattern to access the head.
> > > > 
> > > > Add READ_ONCE() to spsc_queue_peek().
> > > > 
> > > > Cc: stable@...r.kernel.org # v4.16+
> > > > Fixes: 27105db6c63a ("drm/amdgpu: Add SPSC queue to scheduler.")
> > > > Signed-off-by: Philipp Stanner <phasta@...nel.org>
> > > 
> > > 

[…]

> > > Are illegal since you don't have the correct memory barriers any more.
> > 
> > I can't follow. Are you saying that spsc_queue_peek() is correct
> > because you know for sure that when it's used no one else might be
> > changing that pointer?
> 
> Correct, yes. That's the whole idea. I mean SPSC stands for single producer single consumer.
> 

I know that it means that (although it's not documented and, funnily
enough, I one day realized the meaning while standing under the shower
after work).

Anyways, it's not documented that a user must not call
drm_sched_entity_push_job() in parallel. It currently basically just
works by the coincidence of no one doing that or rather no one running
into the race.

> > 
> > Even if that were true this design is highly fragile.
> > 
> > > 
> > > Took me an eternity to understand that as well, so bear with me that I didn't previously explained that.
> > 
> > s/explain/document :)
> > 
> > As discussed few weeks ago with Sima and Tvrtko, what we really need to
> > move to in all of DRM is this development pattern:
> > 
> >    1. For parallel code, at first by default use a boring, horribly
> >       slow (sarcasm) spinlock. BTW I'm not even convinced that a
> >       spinlock is slower than lockless tricks. Paul's book states that
> >       a CAS atomic instruction takes about 60 cycles, and taking a lock
> >       takes 100.
> 
> The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.

Which cache lines? The spinlock's?

The queue data needs to move from one CPU to the other in either case.
It's the same data that is being moved with spinlock protection.

A spinlock doesn't lead to more cache line moves as long as there's
still just a single consumer / producer.

> 
> Keep in mind that you can rather do a fused multiple add for a full 4x4 matrix before you take a single cache line miss.
> 
> >    2. If you want to do parallelism without locks, you need to justify
> >       it really well. "rmb() so list_empty() works without a lock"
> >       doesn't qualify, but real use case speedups.
> >    3. If you write lockless infrastructure, you need to document it
> >       really well. In particular you need to state:
> >          1. How it works
> >          2. What the rules are
> > 
> > See llist.h as an example. It clearly states when you need a lock and
> > when you don't.
> 
> The llist.h is actually pretty similar to the SPSC. I'm wondering why they don't have the same issues? E.g. is xchg() providing the memory barriers?

I think that some atomic instructions contain barriers. But I'd have to
check.

> 
> 
> > Or RCU. No one could use it without such good
> > documentation.
> > 
> > I have no idea whether spsc_queue is correctly implemented (I doubt
> > it), and if even a highly experienced dev like you takes "an eternity"
> > (quote) to understand it, one is really tempted to dream of spinlock_t,
> > which has very clear semantics and is easily understood even by
> > beginners.
> > 
> > > 
> > > Question is what should we do?
> > 
> > Easy!
> > 
> > Mid-term, we should replace spsc_queue with a boring, locked, super-
> > slow linked list ^_^
> 
> That's what the scheduler started with and the reason why that linked list was replaced with first a KFIFO and than the SPSC was because of lacking performance.

Such performance measurements must be included in the commit message,
since they justify the entire commit.

Yet this is the entire justification given:

commit 27105db6c63a571b91d01e749d026105a1e63bcf
Author: Andrey Grodzovsky <Andrey.Grodzovsky@....com>
Date:   Thu Oct 12 16:41:39 2017 -0400

    drm/amdgpu: Add SPSC queue to scheduler.
    
    It is intended to sabstitute the bounded fifo we are currently
    using.
    
    Signed-off-by: Andrey Grodzovsky <Andrey.Grodzovsky@....com>
    Reviewed-by: Christian König <christian.koenig@....com>
    Signed-off-by: Alex Deucher <alexander.deucher@....com>


> 
> We could go back to the KFIFO design again, but a (double?) linked list is clearly going to show the same performance problems which originally triggered moving away from it.
> 
> > 
> > The spsc_queue was designed and – perhaps – perf-measured when RR was
> > the only scheduling policy.
> > 
> > Since the FIFO rework, where FIFO became the default policy, we now
> > access our super efficient super fast lockless queue most of the time
> > with the spinlock being taken immediately afterwards anyways. So we
> > almost have a locked lockless queue now.
> > 
> > https://elixir.bootlin.com/linux/v6.18-rc4/source/drivers/gpu/drm/scheduler/sched_entity.c#L502
> 
> That is not that relevant.

If the lock being there is not relevant, then why can't we just take it
and get rid off all those barriers and READ_ONCE's once and for all?

> 
> > Only push_job() often (?) still runs locklessly, but I'm not at all
> > convinced that this is worth it. Not even performance-wise.
> 
> That is what is relevant. And yes the difference was totally measurable, that's why this was originally implemented.
> 
> > 
> > If you look at spsc_queue_push() you see that it
> >    1. disables preemption,
> >    2. uses atomic instructions and
> >    3. has a ton of memory barries
> > 
> > – in other words, this is almost literally a manual re-implementation
> > of a spinlock, just without mutual exclusion…
> 
> The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another. 

With a doubly linked list you can attach at the front and pull from the
tail. How is that transferring many cache lines?


P.