linux-kernel - Re: [PATCH] drm/amdgpu: Enable runtime modification of gpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <563b1797-5524-44c5-89b0-754f245e6b8f@amd.com>
Date: Sun, 29 Dec 2024 21:11:38 +0100
From: Christian König <christian.koenig@....com>
To: Shuai Xue <xueshuai@...ux.alibaba.com>, alexander.deucher@....com,
 Xinhui.Pan@....com, airlied@...il.com, simona@...ll.ch, lijo.lazar@....com,
 le.ma@....com, hamza.mahfooz@....com, tzimmermann@...e.de,
 shaoyun.liu@....com, Jun.Ma2@....com
Cc: amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery
 parameter with validation

Am 28.12.24 um 07:32 schrieb Shuai Xue:
> It's observed that most GPU jobs utilize less than one server, typically
> with each GPU being used by an independent job. If a job consumed poisoned
> data, a SIGBUS signal will be sent to terminate it. Meanwhile, the
> gpu_recovery parameter is set to -1 by default, the amdgpu driver resets
> all GPUs on the server. As a result, all jobs are terminated. Setting
> gpu_recovery to 0 provides an opportunity to preemptively evacuate other
> jobs and subsequently manually reset all GPUs.

*BIG* NAK to this whole approach!

Setting gpu_recovery to 0 in a production environment is *NOT* supported 
at all and should never be done.

This is a pure debugging feature for JTAG debugging and can result in 
random crashes and/or compromised data.

Please don't tell me that you tried to use this in a production environment.

Regards,
Christian.

>   However, this parameter is
> read-only, necessitating correct settings at driver load. And reloading the
> GPU driver in a production environment can be challenging due to reference
> counts maintained by various monitoring services.
>
> Set the gpu_recovery parameter with read-write permission to enable runtime
> modification. It will enables users to dynamically manage GPU recovery
> mechanisms based on real-time requirements or conditions.
>
> Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++-
>   1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 38686203bea6..03dd902e1cec 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444);
>   MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)");
>   module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
>   
> +static int amdgpu_set_gpu_recovery(const char *buf,
> +				   const struct kernel_param *kp)
> +{
> +	unsigned long val;
> +	int ret;
> +
> +	ret = kstrtol(buf, 10, &val);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (val != 1 && val != 0 && val != -1) {
> +		pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n",
> +		       val);
> +		return -EINVAL;
> +	}
> +
> +	return param_set_int(buf, kp);
> +}
> +
> +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = {
> +	.set = amdgpu_set_gpu_recovery,
> +	.get = param_get_int,
> +};
> +
>   /**
>    * DOC: gpu_recovery (int)
>    * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV).
>    */
>   MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
> -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
> +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644);
>   
>   /**
>    * DOC: emu_mode (int)