linux-kernel - Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <585fa052-4eff-940e-b307-2415c315686a@amd.com>
Date:   Wed, 8 Mar 2023 11:03:34 +0100
From:   Christian König <christian.koenig@....com>
To:     Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
        Asahi Lina <lina@...hilina.net>,
        Maxime Ripard <mripard@...nel.org>,
        Thomas Zimmermann <tzimmermann@...e.de>,
        David Airlie <airlied@...il.com>,
        Daniel Vetter <daniel@...ll.ch>,
        Miguel Ojeda <ojeda@...nel.org>,
        Alex Gaynor <alex.gaynor@...il.com>,
        Wedson Almeida Filho <wedsonaf@...il.com>,
        Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
        Björn Roy Baron <bjorn3_gh@...tonmail.com>,
        Sumit Semwal <sumit.semwal@...aro.org>,
        Luben Tuikov <luben.tuikov@....com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>
Cc:     Alyssa Rosenzweig <alyssa@...enzweig.io>,
        Karol Herbst <kherbst@...hat.com>,
        Ella Stanforth <ella@...unix.org>,
        Faith Ekstrand <faith.ekstrand@...labora.com>,
        Mary <mary@...y.zone>, linux-kernel@...r.kernel.org,
        dri-devel@...ts.freedesktop.org, rust-for-linux@...r.kernel.org,
        linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org,
        linux-sgx@...r.kernel.org, asahi@...ts.linux.dev
Subject: Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler
 is torn down

Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
>
> On 2023-03-07 15:25, Asahi Lina wrote:
>> drm_sched_fini() currently leaves any pending jobs dangling, which
>> causes segfaults and other badness when job completion fences are
>> signaled after the scheduler is torn down.
>>
>> Explicitly detach all jobs from their completion callbacks and free
>> them. This makes it possible to write a sensible safe abstraction for
>> drm_sched, without having to externally duplicate the tracking of
>> in-flight jobs.
>>
>> This shouldn't regress any existing drivers, since calling
>> drm_sched_fini() with any pending jobs is broken and this change should
>> be a no-op if there are no pending jobs.
>>
>> Signed-off-by: Asahi Lina <lina@...hilina.net>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 27 
>> +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 5c0add2c7546..0aab1e0aebdd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>>   void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   {
>>       struct drm_sched_entity *s_entity;
>> +    struct drm_sched_job *s_job, *tmp;
>>       int i;
>>   -    if (sched->thread)
>> -        kthread_stop(sched->thread);
>> +    if (!sched->thread)
>> +        return;
>> +
>> +    /*
>> +     * Stop the scheduler, detaching all jobs from their hardware 
>> callbacks
>> +     * and cleaning up complete jobs.
>> +     */
>> +    drm_sched_stop(sched, NULL);
>> +
>> +    /*
>> +     * Iterate through the pending job list and free all jobs.
>> +     * This assumes the driver has either guaranteed jobs are 
>> already stopped, or that
>> +     * otherwise it is responsible for keeping any necessary data 
>> structures for
>> +     * in-progress jobs alive even when the free_job() callback is 
>> called early (e.g. by
>> +     * putting them in its own queue or doing its own refcounting).
>> +     */
>> +    list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>> +        spin_lock(&sched->job_list_lock);
>> +        list_del_init(&s_job->list);
>> +        spin_unlock(&sched->job_list_lock);
>> +        sched->ops->free_job(s_job);
>> +    }
>
> I would stop the kthread first, then delete all jobs without spinlock 
> since nothing else can race against sched_fini?
>
> If you do need the spinlock, It would need to guard 
> list_for_each_entry too.

Well this case here actually should not happen in the first place.

Jobs depend on their device, so as long as there are jobs there should 
also be a reference to the scheduler.

What could be is that you have allocated a scheduler instance 
dynamically, but even then you should first tear down all entities and 
then the scheduler.

Regards,
Christian.

>
>> +
>> +    kthread_stop(sched->thread);
>>         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= 
>> DRM_SCHED_PRIORITY_MIN; i--) {
>>           struct drm_sched_rq *rq = &sched->sched_rq[i];
>>