linux-kernel - Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <24173faf-c2f0-4d08-93db-587891dc8b5d@igalia.com>
Date: Fri, 16 May 2025 12:35:52 +0100
From: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>
To: Danilo Krummrich <dakr@...nel.org>
Cc: Philipp Stanner <phasta@...nel.org>, Lyude Paul <lyude@...hat.com>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Matthew Brost <matthew.brost@...el.com>,
 Christian König <ckoenig.leichtzumerken@...il.com>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
 dri-devel@...ts.freedesktop.org, nouveau@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org, Danilo Krummrich <dakr@...hat.com>
Subject: Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking
 too long


On 16/05/2025 11:54, Danilo Krummrich wrote:
> On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin wrote:
>>
>> On 16/05/2025 10:53, Danilo Krummrich wrote:
>>> On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:
>>>> On 24/04/2025 10:55, Philipp Stanner wrote:
>>>>> +	 * @kill_fence_context: kill the fence context belonging to this scheduler
>>>>
>>>> Which fence context would that be? ;)
>>>
>>> There's one one per ring and a scheduler instance represents a single ring. So,
>>> what should be specified here?
>>
>> I was pointing out the fact not all drivers are 1:1 sched:entity.
> 
> I'm well aware, but how is that relevant? Entities don't have an associated
> fence context, but a GPU Ring (either hardware or software) has, which a
> scheduler instance represents.

Aha! Well.. how it is relevant and do entities not have an associated 
fence context? Well, entity->fence_context.. that was my first 
association this whole time. Never it crossed my mind this is talking 
about the hardware fence context. Proof in the pudding naming should be 
improved.

But I also don't think there is a requirement for fences returned from 
->run_job() to have a single context. Which again makes it not the best 
naming.

>> Thought it would be obvious from the ";)".
> 
> I should read from ";)" that you refer to a 1:N-sched:entity relationship (which
> doesn't seem to be related)?
> 
>>>> Also, "fence context" would be a new terminology in gpu_scheduler.h API
>>>> level. You could call it ->sched_fini() or similar to signify at which point
>>>> in the API it gets called and then the fact it takes sched as parameter
>>>> would be natural.
>>>
>>> The driver should tear down the fence context in this callback, not the while
>>> scheduler. ->sched_fini() would hence be misleading.
>>
>> Not the while what? Not while drm_sched_fini()?
> 
> *whole
> 
>> Could call it sched_kill()
>> or anything. My point is that we dont' have "fence context" in the API but
>> entities so adding a new term sounds sub-optimal.
> 
> In the callback the driver should neither tear down an entity, nor the whole
> scheduler, hence we shouldn't call it like that. sched_kill() is therefore
> misleading as well.

  ->sched_exit()? ->sched_stop()? ->sched_cleanup()?

> It should be named after what it actually does (or should do). Feel free to
> propose a different name that conforms with that.
> 
>>>> We also probably want some commentary on the topic of indefinite (or very
>>>> long at least) blocking a thread exit / SIGINT/TERM/KILL time.
>>>
>>> You mean in case the driver does implement the callback, but does *not* properly
>>> tear down the fence context? So, you ask for describing potential consequences
>>> of drivers having bugs in the implementation of the callback? Or something else?
>>
>> I was proposing the kerneldoc for the vfunc should document the callback
>> must not block, or if blocking is unavoidable, either document a guideline
>> on how long is acceptable. Maybe even enforce a limit in the scheduler core
>> itself.
> 
> Killing the fence context shouldn't block.

Cool. And maybe convert the wait_event to wait_event_timeout with a 
warning to be robust.

Mind you, I still think a solution which does not add new state 
machinery should be preferred.

Regards,

Tvrtko