lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <287b6d12-2a65-3195-7305-0255a4762778@amd.com>
Date:   Wed, 2 Aug 2023 10:12:53 -0400
From:   Luben Tuikov <luben.tuikov@....com>
To:     Matthew Brost <matthew.brost@...el.com>
Cc:     Asahi Lina <lina@...hilina.net>, David Airlie <airlied@...il.com>,
        Daniel Vetter <daniel@...ll.ch>,
        Sumit Semwal <sumit.semwal@...aro.org>,
        Christian König <christian.koenig@....com>,
        Faith Ekstrand <faith.ekstrand@...labora.com>,
        linux-kernel@...r.kernel.org, dri-devel@...ts.freedesktop.org,
        asahi@...ts.linux.dev, Alyssa Rosenzweig <alyssa@...enzweig.io>,
        linux-media@...r.kernel.org
Subject: Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is
 torn down.

On 2023-08-02 00:06, Matthew Brost wrote:
> On Mon, Jul 17, 2023 at 01:40:38PM -0400, Luben Tuikov wrote:
>> On 2023-07-16 03:51, Asahi Lina wrote:
>>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>>> causes segfaults and other badness when job completion fences are
>>>>> signaled after the scheduler is torn down.
>>>>
>>>> If there are pending jobs, ideally we want to call into the driver,
>>>> so that it can release resources it may be holding for those.
>>>> The idea behind "pending" is that they are pending in the hardware
>>>> and we don't know their state until signalled/the callback called.
>>>> (Or unless the device is reset and we get a notification of that fact.)
>>>
>>> That's what the job->free_job() callback does, then the driver is free 
>>> to do whatever it wants with those jobs. A driver could opt to 
>>> synchronously kill those jobs (if it can) or account for them 
>>> separately/asynchronously.
>>>
>>> What this patch basically says is that if you destroy a scheduler with 
>>> pending jobs, it immediately considers them terminated with an error, 
>>> and returns ownership back to the driver for freeing. Then the driver 
>>> can decide how to handle the rest and whatever the underlying hardware 
>>> state is.
>>>
>>>>> Explicitly detach all jobs from their completion callbacks and free
>>>>> them. This makes it possible to write a sensible safe abstraction for
>>>>> drm_sched, without having to externally duplicate the tracking of
>>>>> in-flight jobs.
>>>>>
>>>>> This shouldn't regress any existing drivers, since calling
>>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>>> be a no-op if there are no pending jobs.
>>>>
>>>> While this statement is true on its own, it kind of contradicts
>>>> the premise of the first paragraph.
>>>
>>> I mean right *now* it's broken, before this patch. I'm trying to make it 
>>> safe, but it shouldn't regress any exiting drivers since if they trigger 
>>> this code path they are broken today.
>>
>> Not sure about other drivers--they can speak for themselves and the CC list
>> should include them--please use "dim add-missing-cc" and make sure
>> that the Git commit description contains the Cc tags--then git send-email
>> will populate the SMTP CC. Feel free to add more Cc tags on top of that.
>>
> 
> Xe doesn't need this as our reference counting scheme doesn't allow
> drm_sched_fini to be called when jobs are pending. If we want to
> teardown a drm_sched we set TDR timeout to zero and all pending jobs
> gets cleaned up that way, the ref of sched will go to zero, and
> drm_sched_fini is called. The caveat here being I think we need a worker
> to call drm_sched_fini as the last ref to scheduler might be dropped
> from within scheduler main thread.
> 
> That being said, I doubt this patch breaks anything in Xe so do not a
> real strong opinion on this.

Yes, that's my understanding as well. If the drivers call drm_sched_fini()
then they are responsible for cleaning up _before_ calling drm_sched_fini().
The Xe driver seems to be doing the right thing. All in all, since
drm_sched_fini() is called by the driver, the driver is supposed to have
cleaned up before calling it, so I don't see much need for this patch
after all.
-- 
Regards,
Luben

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ