linux-kernel - Re: [PATCH v7] drm/doc: Document DRM device reset expectations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4b85a79e-bfb3-4e38-bb10-212539e45126@igalia.com>
Date:   Wed, 23 Aug 2023 15:07:21 -0300
From:   André Almeida <andrealmeid@...lia.com>
To:     Rodrigo Vivi <rodrigo.vivi@...el.com>
Cc:     dri-devel@...ts.freedesktop.org, amd-gfx@...ts.freedesktop.org,
        linux-kernel@...r.kernel.org, pierre-eric.pelloux-prayer@....com,
        Randy Dunlap <rdunlap@...radead.org>,
        'Marek Olšák' <maraeo@...il.com>,
        Michel Dänzer <michel.daenzer@...lbox.org>,
        Timur Kristóf <timur.kristof@...il.com>,
        Pekka Paalanen <ppaalanen@...il.com>,
        Samuel Pitoiset <samuel.pitoiset@...il.com>,
        kernel-dev@...lia.com, alexander.deucher@....com,
        christian.koenig@....com
Subject: Re: [PATCH v7] drm/doc: Document DRM device reset expectations

Hi Rodrigo,

Em 23/08/2023 14:31, Rodrigo Vivi escreveu:
> On Fri, Aug 18, 2023 at 05:06:42PM -0300, André Almeida wrote:
>> Create a section that specifies how to deal with DRM device resets for
>> kernel and userspace drivers.
>>
>> Signed-off-by: André Almeida <andrealmeid@...lia.com>
>>
>> ---
>>
>> v7 changes:
>>   - s/application/graphical API contex/ in the robustness part (Michel)
>>   - Grammar fixes (Randy)
>>
>> v6: https://lore.kernel.org/lkml/20230815185710.159779-1-andrealmeid@igalia.com/
>>
>> v6 changes:
>>   - Due to substantial changes in the content, dropped Pekka's Acked-by
>>   - Grammar fixes (Randy)
>>   - Add paragraph about disabling device resets
>>   - Add note about integrating reset tracking in drm/sched
>>   - Add note that KMD should return failure for contexts affected by
>>     resets and UMD should check for this
>>   - Add note about lack of consensus around what to do about non-robust
>>     apps
>>
>> v5: https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealmeid@igalia.com/
>> ---
>>   Documentation/gpu/drm-uapi.rst | 77 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 77 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 65fb3036a580..3694bdb977f5 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -285,6 +285,83 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>>   mmapped regular files. Threads cause additional pain with signal
>>   handling as well.
>>   
>> +Device reset
>> +============
>> +
>> +The GPU stack is really complex and is prone to errors, from hardware bugs,
>> +faulty applications and everything in between the many layers. Some errors
>> +require resetting the device in order to make the device usable again. This
>> +section describes the expectations for DRM and usermode drivers when a
>> +device resets and how to propagate the reset status.
>> +
>> +Device resets can not be disabled without tainting the kernel, which can lead to
>> +hanging the entire kernel through shrinkers/mmu_notifiers. Userspace role in
>> +device resets is to propagate the message to the application and apply any
>> +special policy for blocking guilty applications, if any. Corollary is that
>> +debugging a hung GPU context require hardware support to be able to preempt such
>> +a GPU context while it's stopped.
>> +
>> +Kernel Mode Driver
>> +------------------
>> +
>> +The KMD is responsible for checking if the device needs a reset, and to perform
>> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
>> +should keep track of resets, because userspace can query any time about the
>> +reset status for a specific context. This is needed to propagate to the rest of
>> +the stack that a reset has happened. Currently, this is implemented by each
>> +driver separately, with no common DRM interface. Ideally this should be properly
>> +integrated at DRM scheduler to provide a common ground for all drivers. After a
>> +reset, KMD should reject new command submissions for affected contexts.
> 
> is there any consensus around what exactly 'affected contexts' might mean?
> I see i915 pin-point only the context that was at execution with head pointing
> at it and doesn't blame the queued ones, while on Xe it looks like we are
> blaming all the queued context. Not sure what other drivers are doing for the
> 'affected contexts'.
> 

"Affected contexts" is a generic term indeed, giving the differences 
from each driver as you already pointed out. amdgpu also tends to affect 
all queued contexts during a reset. This wording was used to fit how 
different drivers works.