linux-kernel - Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cedc57a5-779a-3ec6-cfbd-f51dfb17f2dc@igalia.com>
Date:   Mon, 3 Jul 2023 12:00:22 -0300
From:   André Almeida <andrealmeid@...lia.com>
To:     Pekka Paalanen <ppaalanen@...il.com>
Cc:     Marek Olšák <maraeo@...il.com>,
        pierre-eric.pelloux-prayer@....com,
        Sebastian Wick <sebastian.wick@...hat.com>,
        Michel Dänzer <michel.daenzer@...lbox.org>,
        amd-gfx@...ts.freedesktop.org,
        Timur Kristóf <timur.kristof@...il.com>,
        Randy Dunlap <rdunlap@...radead.org>,
        linux-kernel@...r.kernel.org, dri-devel@...ts.freedesktop.org,
        alexander.deucher@....com,
        Samuel Pitoiset <samuel.pitoiset@...il.com>,
        kernel-dev@...lia.com, christian.koenig@....com
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations



Em 03/07/2023 05:49, Pekka Paalanen escreveu:
> On Mon, 3 Jul 2023 09:12:29 +0200
> Michel Dänzer <michel.daenzer@...lbox.org> wrote:
> 
>> On 6/30/23 22:32, Marek Olšák wrote:
>>> On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@...lbox.org <mailto:michel.daenzer@...lbox.org>> wrote:
>>>> On 6/30/23 16:59, Alex Deucher wrote:
>>>>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>>>>> <sebastian.wick@...hat.com <mailto:sebastian.wick@...hat.com>> wrote:
>>>>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@...lia.com <mailto:andrealmeid@...lia.com>> wrote:
>>>>>>>
>>>>>>> +Robustness
>>>>>>> +----------
>>>>>>> +
>>>>>>> +The only way to try to keep an application working after a reset is if it
>>>>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>>>>> +
>>>>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>>>>> +there is no guarantee that the app will use such features correctly, and the
>>>>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>>>>> +the user interface from being correctly displayed. This should be done even if
>>>>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>>>>
>>>>>> I still don't think it's good to let the kernel arbitrarily kill
>>>>>> processes that it thinks are not well-behaved based on some heuristics
>>>>>> and policy.
>>>>>>
>>>>>> Can't this be outsourced to user space? Expose the information about
>>>>>> processes causing a device and let e.g. systemd deal with coming up
>>>>>> with a policy and with killing stuff.
>>>>>
>>>>> I don't think it's the kernel doing the killing, it would be the UMD.
>>>>> E.g., if the app is guilty and doesn't support robustness the UMD can
>>>>> just call exit().
>>>>
>>>> It would be safer to just ignore API calls[0], similarly to what
>>>> is done until the application destroys the context with
>>>> robustness. Calling exit() likely results in losing any unsaved
>>>> work, whereas at least some applications might otherwise allow
>>>> saving the work by other means.
>>>
>>> That's a terrible idea. Ignoring API calls would be identical to a
>>> freeze. You might as well disable GPU recovery because the result
>>> would be the same.
>>
>> No GPU recovery would affect everything using the GPU, whereas this
>> affects only non-robust applications.
>>
>>
>>> - non-robust contexts: call exit(1) immediately, which is the best
>>> way to recover
>>
>> That's not the UMD's call to make.
>>
>>
>>>>      [0] Possibly accompanied by a one-time message to stderr along
>>>> the lines of "GPU reset detected but robustness not enabled in
>>>> context, ignoring OpenGL API calls".
>>
> 
> Hi,
> 
> Michel does have a point. It's not just games and display servers that
> use GPU, but productivity tools as well. They may have periodic
> autosave in anticipation of crashes, but being able to do the final
> save before quitting would be nice. UMD killing the process would be
> new behaviour, right? Previously either application's GPU thread hangs
> or various API calls return errors, but it didn't kill the process, did
> it?
> 

In Intel's Iris, UMD may call abort() for the reset guilty application:

https://elixir.bootlin.com/mesa/mesa-23.0.4/source/src/gallium/drivers/iris/iris_batch.c#L1063

I was pretty sure this was the same for RadeonSI, but I failed to find 
the code for this, so I might be wrong.

> If an application freezes, that's "no problem"; the end user can just
> continue using everything else. Alt-tab away etc. if the app was
> fullscreen. I do that already with games on even Xorg.
> 
> If a display server freezes, that's a desktop-wide problem, but so is
> killing it.
> 

Interesting, what GPU do you use? In my experience (AMD RX 5600 XT), 
hanging the GPU usually means that the rest of applications/compositor 
can't use the GPU either, freezing all user interactions. So killing the 
guilty app is one effective solution currently, but ignoring calls may 
help as well.

> OTOH, if UMD really does need to terminate the process, then please do
> it in a way that causes a crash report to be recorded. _exit() with an
> error code is not it.
> 

In the "Reporting causes of resets" subsection of this document I can 
add something for UMD as well.

> 
> Thanks,
> pq