linux-kernel - Re: [PATCH] drm/sched: Fix UB in spsc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a3eefc87-2678-4a4d-82c8-f6aedf74be75@amd.com>
Date: Mon, 10 Nov 2025 17:08:11 +0100
From: Christian König <christian.koenig@....com>
To: phasta@...nel.org, David Airlie <airlied@...il.com>,
 Simona Vetter <simona@...ll.ch>, Alex Deucher <alexander.deucher@....com>,
 Andrey Grodzovsky <Andrey.Grodzovsky@....com>, dakr@...nel.org,
 Matthew Brost <matthew.brost@...el.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH] drm/sched: Fix UB in spsc_queue

On 11/10/25 16:55, Philipp Stanner wrote:
> On Mon, 2025-11-10 at 16:14 +0100, Christian König wrote:
>> On 11/10/25 15:20, Philipp Stanner wrote:
>>> On Mon, 2025-11-10 at 15:07 +0100, Christian König wrote:
>>>> On 11/10/25 13:27, Philipp Stanner wrote:
>>>> The problem isn't the burned CPU cycles, but rather the cache lines moved between CPUs.
>>>
>>> Which cache lines? The spinlock's?
>>>
>>> The queue data needs to move from one CPU to the other in either case.
>>> It's the same data that is being moved with spinlock protection.
>>>
>>> A spinlock doesn't lead to more cache line moves as long as there's
>>> still just a single consumer / producer.
>>
>> Looking at a couple of examples:
>>
>> 1. spinlock + double linked list (which is what the scheduler used initially).
>>
>>    You have to touch 3-4 different cache lines, the lock, the previous, the current and the next element (next and prev are usually the same with the lock).
> 
> list when pushing:
> 
> Lock + head (same cache line) + head->next
> head->next->next
> 
> when popping:
> 
> Lock + head + head->previous
> head->previous->previous
> 
> I don't see why you need a "current" element when you're always only
> touching head or tail.

The current element is the one you insert or remove.

>>
>> 2. kfifo (attempt #2):
>>
>>    3 cache lines, one for the array, one for the rptr/wptr and one for the element.
>>    Plus the problem that you need to come up with some upper bound for it.
>>
>> 3. spsc (attempt #3)
>>
>>    2-3 cache lines, one for the queue (head/tail), one for the element and one for the previous element (but it is quite likely that this is pre-fetched).
>>
>> Saying this I just realized we could potentially trivially replace the spsc with an single linked list+pointer to the end+spinlock and have the same efficiency. We don't need all the lockless stuff for that at all.
>>
> 
> Now we're speaking mostly the same language :]
> 
> If you could RB my DRM TODO patches we'd have a section for drm/sched,
> and there we could then soonish add an item for getting rid of spsc.
> 
> https://lore.kernel.org/dri-devel/20251107135701.244659-2-phasta@kernel.org/

I can't find that in my inbox anywhere. Can you send it out one more with my AMD mail address on explicit CC? Thanks in advance.

>>>> The problem is really to separate the push from the pop side so that as few cache lines as possible are transferred from one CPU to another. 
>>>
>>> With a doubly linked list you can attach at the front and pull from the
>>> tail. How is that transferring many cache lines?
>>
>> See above.
>>
>> We have some tests for old and trivial use cases (e.g. GLmark2) which on todays standards pretty much only depend on how fast you can push things to the HW.
>>
>> We could just extend the scheduler test cases to see how many submissions per second we can pump through a dummy implementation where both producer and consumer are nailed to separate CPUs.
>>
> 
> I disagree. That would be a microbenchmark for a very narrow use case.

That is actually a rather common use case (unfortunately).

> It would only tell us that a specific patch slows things done for the
> microbenchmark, and we could only detect that if a developer runs the
> unit tests with and without his patches.

I could trigger adding that to AMDs CI systems.

> 
> The few major reworks that touch such essentials have good realistic
> tests anyways, see Tvrtko's CFS series.
> 
> 
> Lockless magic should always be justified by real world use cases.
> 
> By the way, back when spsc_queue was implemented, how large were the
> real world performance gains you meassured by saving that 1 cache line?

That was actually quite a bit. If you want a real world test case use glMark2 on any modern HW.

And yeah I know how ridicules that is, the problem is that we still have people using this as indicator for the command submission overhead.

Regards,
Christian.

> 
> 
> P.