linux-kernel - Re: [PATCH] drm/sched: Fix UB in spsc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <589a1be140f3c8623a2647b107a1130289eb00ba.camel@mailbox.org>
Date: Mon, 10 Nov 2025 16:55:01 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Christian König <christian.koenig@....com>, 
 phasta@...nel.org, David Airlie <airlied@...il.com>, Simona Vetter
 <simona@...ll.ch>, Alex Deucher <alexander.deucher@....com>, Andrey
 Grodzovsky <Andrey.Grodzovsky@....com>, dakr@...nel.org, Matthew Brost
 <matthew.brost@...el.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org, 
	stable@...r.kernel.org
Subject: Re: [PATCH] drm/sched: Fix UB in spsc_queue

On Mon, 2025-11-10 at 16:14 +0100, Christian König wrote:
> On 11/10/25 15:20, Philipp Stanner wrote:
> > On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote:
> > > On 11/10/25 13:27, Philipp Stanner wrote:
> > > The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.
> > 
> > Which cache lines? The spinlock's?
> > 
> > The queue data needs to move from one CPU to the other in either case.
> > It's the same data that is being moved with spinlock protection.
> > 
> > A spinlock doesn't lead to more cache line moves as long as there's
> > still just a single consumer / producer.
> 
> Looking at a couple of examples:
> 
> 1. spinlock + double linked list (which is what the scheduler used initially).
> 
>    You have to touch 3-4 different cache lines, the lock, the previous, the current and the next element (next and prev are usually the same with the lock).

list when pushing:

Lock + head (same cache line) + head->next
head->next->next

when popping:

Lock + head + head->previous
head->previous->previous

I don't see why you need a "current" element when you're always only
touching head or tail.


> 
> 2. kfifo (attempt #2):
> 
>    3 cache lines, one for the array, one for the rptr/wptr and one for the element.
>    Plus the problem that you need to come up with some upper bound for it.
> 
> 3. spsc (attempt #3)
> 
>    2-3 cache lines, one for the queue (head/tail), one for the element and one for the previous element (but it is quite likely that this is pre-fetched).
> 
> Saying this I just realized we could potentially trivially replace the spsc with an single linked list+pointer to the end+spinlock and have the same efficiency. We don't need all the lockless stuff for that at all.
> 

Now we're speaking mostly the same language :]

If you could RB my DRM TODO patches we'd have a section for drm/sched,
and there we could then soonish add an item for getting rid of spsc.

https://lore.kernel.org/dri-devel/20251107135701.244659-2-phasta@kernel.org/

> > > 

[…]

> > > The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another. 
> > 
> > With a doubly linked list you can attach at the front and pull from the
> > tail. How is that transferring many cache lines?
> 
> See above.
> 
> We have some tests for old and trivial use cases (e.g. GLmark2) which on todays standards pretty much only depend on how fast you can push things to the HW.
> 
> We could just extend the scheduler test cases to see how many submissions per second we can pump through a dummy implementation where both producer and consumer are nailed to separate CPUs.
> 

I disagree. That would be a microbenchmark for a very narrow use case.
It would only tell us that a specific patch slows things done for the
microbenchmark, and we could only detect that if a developer runs the
unit tests with and without his patches.

The few major reworks that touch such essentials have good realistic
tests anyways, see Tvrtko's CFS series.


Lockless magic should always be justified by real world use cases.

By the way, back when spsc_queue was implemented, how large were the
real world performance gains you meassured by saving that 1 cache line?


P.