lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADnq5_MNVdtdcWKSz6dgmsjg+kEu8p5FVE+fkw_5BaXeG3QGow@mail.gmail.com>
Date:   Fri, 30 Jun 2023 10:59:20 -0400
From:   Alex Deucher <alexdeucher@...il.com>
To:     Sebastian Wick <sebastian.wick@...hat.com>
Cc:     André Almeida <andrealmeid@...lia.com>,
        pierre-eric.pelloux-prayer@....com,
        Samuel Pitoiset <samuel.pitoiset@...il.com>,
        Pekka Paalanen <pekka.paalanen@...labora.com>,
        Marek Olšák <maraeo@...il.com>,
        Michel Dänzer <michel.daenzer@...lbox.org>,
        Randy Dunlap <rdunlap@...radead.org>,
        linux-kernel@...r.kernel.org, amd-gfx@...ts.freedesktop.org,
        Pekka Paalanen <ppaalanen@...il.com>,
        Timur Kristóf <timur.kristof@...il.com>,
        dri-devel@...ts.freedesktop.org, kernel-dev@...lia.com,
        alexander.deucher@....com, christian.koenig@....com
Subject: Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
<sebastian.wick@...hat.com> wrote:
>
> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@...lia.com> wrote:
> >
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen <pekka.paalanen@...labora.com>
> > Signed-off-by: André Almeida <andrealmeid@...lia.com>
> > ---
> >
> > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> >
> > Changes:
> >  - Grammar fixes (Randy)
> >
> >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> >  mmapped regular files. Threads cause additional pain with signal
> >  handling as well.
> >
> > +Device reset
> > +============
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +------------------
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +----------------
> > +
> > +The UMD should check before submitting new commands to the KMD if the device has
> > +been reset, and this can be checked more often if the UMD requires it. After
> > +detecting a reset, UMD will then proceed to report it to the application using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
>
> I still don't think it's good to let the kernel arbitrarily kill
> processes that it thinks are not well-behaved based on some heuristics
> and policy.
>
> Can't this be outsourced to user space? Expose the information about
> processes causing a device and let e.g. systemd deal with coming up
> with a policy and with killing stuff.

I don't think it's the kernel doing the killing, it would be the UMD.
E.g., if the app is guilty and doesn't support robustness the UMD can
just call exit().

Alex

>
> > +
> > +OpenGL
> > +~~~~~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > +interface tells if a reset has happened, and if so, all the context state is
> > +considered lost and the app proceeds by creating new ones. If it is possible to
> > +determine that robustness is not in use, the UMD will terminate the app when a
> > +reset is detected, giving that the contexts are lost and the app won't be able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~~~~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > +This error code means, among other things, that a device reset has happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--------------------------
> > +
> > +Apart from propagating the reset through the stack so apps can recover, it's
> > +really useful for driver developers to learn more about what caused the reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
> > +
> >  .. _drm_driver_ioctl:
> >
> >  IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ