[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y1fevNkLmG3+5lWt@kroah.com>
Date: Tue, 25 Oct 2022 15:03:56 +0200
From: Greg KH <gregkh@...uxfoundation.org>
To: Kai Ye <yekai13@...wei.com>
Cc: linux-accelerators@...ts.ozlabs.org, linux-kernel@...r.kernel.org,
linuxarm@...wei.com, zhangfei.gao@...aro.org,
wangzhou1@...ilicon.com
Subject: Re: [PATCH v9 2/3] Documentation: add a isolation strategy sysfs
node for uacce
On Tue, Oct 25, 2022 at 12:39:30PM +0000, Kai Ye wrote:
> Update documentation describing sysfs node that could help to
> configure isolation strategy for users in the user space. And
> describing sysfs node that could read the device isolated state.
>
> Signed-off-by: Kai Ye <yekai13@...wei.com>
> ---
> Documentation/ABI/testing/sysfs-driver-uacce | 27 ++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
> index 08f2591138af..50737c897ba3 100644
> --- a/Documentation/ABI/testing/sysfs-driver-uacce
> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
> @@ -19,6 +19,33 @@ Contact: linux-accelerators@...ts.ozlabs.org
> Description: Available instances left of the device
> Return -ENODEV if uacce_ops get_available_instances is not provided
>
> +What: /sys/class/uacce/<dev_name>/isolate_strategy
> +Date: Oct 2022
> +KernelVersion: 6.1
> +Contact: linux-accelerators@...ts.ozlabs.org
> +Description: (RW) Configure the frequency size for the hardware error
> + isolation strategy. This unit is the number of times. Number
Number of times what?
> + of occurrences in a period, also means threshold. If the number
> + of device pci AER error exceeds the threshold in a time window,
What is the time window?
> + the device is isolated. This size is a configured integer value.
> + The default is 0. The maximum value is 65535.
> +
> + In the hisilicon accelerator engine, first we will
> + time-stamp every slot AER error. Then check the AER error log
> + when the device AER error occurred. if the device slot AER error
> + count exceeds the preset the number of times in one hour, the
> + isolated state will be set to true. So the device will be
> + isolated. And the AER error log that exceed one hour will be
> + cleared.
This seems like a very hardware-specific implementation here. And this
is supposed to be a generic class?
I feel this is getting really messy :(
thanks,
greg k-h
Powered by blists - more mailing lists