linux-kernel - Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e18500b5-21a0-77fd-8434-86258cefce5a@amd.com>
Date:   Wed, 8 Mar 2023 19:12:00 +0100
From:   Christian König <christian.koenig@....com>
To:     Asahi Lina <lina@...hilina.net>,
        Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
        Maxime Ripard <mripard@...nel.org>,
        Thomas Zimmermann <tzimmermann@...e.de>,
        David Airlie <airlied@...il.com>,
        Daniel Vetter <daniel@...ll.ch>,
        Miguel Ojeda <ojeda@...nel.org>,
        Alex Gaynor <alex.gaynor@...il.com>,
        Wedson Almeida Filho <wedsonaf@...il.com>,
        Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
        Björn Roy Baron <bjorn3_gh@...tonmail.com>,
        Sumit Semwal <sumit.semwal@...aro.org>,
        Luben Tuikov <luben.tuikov@....com>,
        Jarkko Sakkinen <jarkko@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>
Cc:     Alyssa Rosenzweig <alyssa@...enzweig.io>,
        Karol Herbst <kherbst@...hat.com>,
        Ella Stanforth <ella@...unix.org>,
        Faith Ekstrand <faith.ekstrand@...labora.com>,
        Mary <mary@...y.zone>, linux-kernel@...r.kernel.org,
        dri-devel@...ts.freedesktop.org, rust-for-linux@...r.kernel.org,
        linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org,
        linux-sgx@...r.kernel.org, asahi@...ts.linux.dev
Subject: Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler
 is torn down

Am 08.03.23 um 18:32 schrieb Asahi Lina:
> [SNIP]
> Yes but... none of this cleans up jobs that are already submitted by the
> scheduler and in its pending list, with registered completion callbacks,
> which were already popped off of the entities.
>
> *That* is the problem this patch fixes!

Ah! Yes that makes more sense now.

>> We could add a warning when users of this API doesn't do this
>> correctly, but cleaning up incorrect API use is clearly something we
>> don't want here.
> It is the job of the Rust abstractions to make incorrect API use that
> leads to memory unsafety impossible. So even if you don't want that in
> C, it's my job to do that for Rust... and right now, I just can't
> because drm_sched doesn't provide an API that can be safely wrapped
> without weird bits of babysitting functionality on top (like tracking
> jobs outside or awkwardly making jobs hold a reference to the scheduler
> and defer dropping it to another thread).

Yeah, that was discussed before but rejected.

The argument was that upper layer needs to wait for the hw to become 
idle before the scheduler can be destroyed anyway.

>>> Right now, it is not possible to create a safe Rust abstraction for
>>> drm_sched without doing something like duplicating all job tracking in
>>> the abstraction, or the above backreference + deferred cleanup mess, or
>>> something equally silly. So let's just fix the C side please ^^
>> Nope, as far as I can see this is just not correctly tearing down the
>> objects in the right order.
> There's no API to clean up in-flight jobs in a drm_sched at all.
> Destroying an entity won't do it. So there is no reasonable way to do
> this at all...

Yes, this was removed.

>> So you are trying to do something which is not supposed to work in the
>> first place.
> I need to make things that aren't supposed to work impossible to do in
> the first place, or at least fail gracefully instead of just oopsing
> like drm_sched does today...
>
> If you're convinced there's a way to do this, can you tell me exactly
> what code sequence I need to run to safely shut down a scheduler
> assuming all entities are already destroyed? You can't ask me for a list
> of pending jobs (the scheduler knows this, it doesn't make any sense to
> duplicate that outside), and you can't ask me to just not do this until
> all jobs complete execution (because then we either end up with the
> messy deadlock situation I described if I take a reference, or more
> duplicative in-flight job count tracking and blocking in the free path
> of the Rust abstraction, which doesn't make any sense either).

Good question. We don't have anybody upstream which uses the scheduler 
lifetime like this.

Essentially the job list in the scheduler is something we wanted to 
remove because it causes tons of race conditions during hw recovery.

When you tear down the firmware queue how do you handle already 
submitted jobs there?

Regards,
Christian.

>
> ~~ Lina