lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8KgwswQQyGxhsR1@black.fi.intel.com>
Date: Sat, 1 Mar 2025 07:53:06 +0200
From: Raag Jadav <raag.jadav@...el.com>
To: André Almeida <andrealmeid@...lia.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
	kernel-dev@...lia.com, amd-gfx@...ts.freedesktop.org,
	intel-xe@...ts.freedesktop.org, intel-gfx@...ts.freedesktop.org,
	Alex Deucher <alexander.deucher@....com>,
	'Christian König' <christian.koenig@....com>,
	siqueira@...lia.com, airlied@...il.com, simona@...ll.ch,
	rodrigo.vivi@...el.com, jani.nikula@...ux.intel.com,
	lucas.demarchi@...el.com
Subject: Re: [PATCH 1/2] drm: Create an app info option for wedge events

On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> Hi Raag,
> 
> On 2/28/25 11:20, Raag Jadav wrote:
> > Cc: Lucas
> > 
> > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > When a device get wedged, it might be caused by a guilty application.
> > > For userspace, knowing which app was the cause can be useful for some
> > > situations, like for implementing a policy, logs or for giving a chance
> > > for the compositor to let the user know what app caused the problem.
> > > This is an optional argument, when `PID=-1` there's no information about
> > > the app caused the problem, or if any app was involved during the hang.
> > > 
> > > Sometimes just the PID isn't enough giving that the app might be already
> > > dead by the time userspace will try to check what was this PID's name,
> > > so to make the life easier also notify what's the app's name in the user
> > > event.
> > > 
> > > Signed-off-by: André Almeida <andrealmeid@...lia.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
> > >   drivers/gpu/drm/drm_drv.c                  | 16 +++++++++++++---
> > >   drivers/gpu/drm/i915/gt/intel_reset.c      |  3 ++-
> > >   drivers/gpu/drm/xe/xe_device.c             |  3 ++-
> > >   include/drm/drm_device.h                   |  8 ++++++++
> > >   include/drm/drm_drv.h                      |  3 ++-
> > >   7 files changed, 29 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 24ba52d76045..00b9b87dafd8 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > >   	atomic_set(&adev->reset_domain->reset_res, r);
> > >   	if (!r)
> > > -		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> > > +		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> > >   	return r;
> > >   }
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > index ef1b77f1e88f..3ed9cbcab1ad 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > @@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
> > >   			amdgpu_fence_driver_force_completion(ring);
> > >   			if (amdgpu_ring_sched_ready(ring))
> > >   				drm_sched_start(&ring->sched, 0);
> > > -			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> > > +			drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> > >   			dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name);
> > >   			goto exit;
> > >   		}
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index 17fc5dc708f4..48faafd82a99 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >    * drm_dev_wedged_event - generate a device wedged uevent
> > >    * @dev: DRM device
> > >    * @method: method(s) to be used for recovery
> > > + * @info: optional information about the guilty app
> > >    *
> > >    * This generates a device wedged uevent for the DRM device specified by @dev.
> > >    * Recovery @method\(s) of choice will be sent in the uevent environment as
> > > @@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > >    *
> > >    * Returns: 0 on success, negative error code otherwise.
> > >    */
> > > -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > > +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> > > +			 struct drm_wedge_app_info *info)
> > >   {
> > >   	const char *recovery = NULL;
> > >   	unsigned int len, opt;
> > >   	/* Event string length up to 28+ characters with available methods */
> > > -	char event_string[32];
> > > -	char *envp[] = { event_string, NULL };
> > > +	char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
> > > +	char *envp[] = { event_string, pid_string, comm_string, NULL };
> > >   	len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > >   	drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
> > >   		 "but recovered through reset" : "needs recovery");
> > > +	if (info) {
> > > +		snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> > > +		snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm);
> > > +	} else {
> > > +		snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> > > +		snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> > > +	}
> > This is not much use for wedge cases that needs recovery, since at that point
> > the userspace will need to clean house anyway.
> > 
> > Which leaves us with only 'none' case and perhaps the need for standardization
> > of "optional telemetry collection".
> > 
> > Thoughts?
> 
> I had the feeling that 'none' was already meant to be used for that. Do you
> think we should move to another naming? Given that we didn't reach the merge
> window yet we could potentially change that name without much damage.

No, I meant thoughts on possible telemetry data that the drivers might
think is useful for userspace (along with PID) and can be presented in
a vendor agnostic manner (just like wedged event).

Raag

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ