lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 20 Oct 2022 01:24:54 -0700
From:   Aru <aru.kolappan@...cle.com>
To:     Leon Romanovsky <leon@...nel.org>
Cc:     jgg@...pe.ca, saeedm@...dia.com, linux-rdma@...r.kernel.org,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
        manjunath.b.patil@...cle.com, rama.nichanamatlu@...cle.com
Subject: Re: [PATCH 1/1] net/mlx5: add dynamic logging for mlx5_dump_err_cqe

On 10/18/22 12:47 AM, Leon Romanovsky wrote:
> On Fri, Oct 14, 2022 at 12:12:36PM -0700, Aru wrote:
>> Hi Leon,
>>
>> Thank you for reviewing the patch.
>>
>> The method you mentioned disables the dump permanently for the kernel.
>> We thought vendor might have enabled it for their consumption when needed.
>> Hence we made it dynamic, so that it can be enabled/disabled at run time.
>>
>> Especially, in a production environment, having the option to turn this log
>> on/off
>> at runtime will be helpful.
> While you are interested on/off this specific warning, your change will
> cause "to hide" all syndromes as it is unlikely that anyone runs in
> production with debug prints.
>
>   -   mlx5_ib_warn(dev, "dump error cqe\n");
>   +   mlx5_ib_dbg(dev, "dump error cqe\n");
>
> Something like this will do the trick without interrupting to the others.
>
> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
> index 457f57b088c6..966206085eb3 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -267,10 +267,29 @@ static void handle_responder(struct ib_wc *wc, struct mlx5_cqe64 *cqe,
>   	wc->wc_flags |= IB_WC_WITH_NETWORK_HDR_TYPE;
>   }
>   
> -static void dump_cqe(struct mlx5_ib_dev *dev, struct mlx5_err_cqe *cqe)
> +static void dump_cqe(struct mlx5_ib_dev *dev, struct mlx5_err_cqe *cqe,
> +		     struct ib_wc *wc, int dump)
>   {
> -	mlx5_ib_warn(dev, "dump error cqe\n");
> -	mlx5_dump_err_cqe(dev->mdev, cqe);
> +	const char *level;
> +
> +	if (!dump)
> +		return;
> +
> +	mlx5_ib_warn(dev, "WC error: %d, Message: %s\n", wc->status,
> +		     ib_wc_status_msg(wc->status));
> +
> +	if (dump == 1) {
> +		mlx5_ib_warn(dev, "dump error cqe\n");
> +		level = KERN_WARNING;
> +	}
> +
> +	if (dump == 2) {
> +		mlx5_ib_dbg(dev, "dump error cqe\n");
> +		level = KERN_DEBUG;
> +	}
> +
> +	print_hex_dump(level, "", DUMP_PREFIX_OFFSET, 16, 1, cqe, sizeof(*cqe),
> +		       false);
>   }
Hi Leon,

Thank you for the reply and your suggested method to handle this debug 
logging.

We set 'dump=2' for the syndromes applicable to our scenario:  
MLX5_CQE_SYNDROME_REMOTE_ACCESS_ERR,
MLX5_CQE_SYNDROME_REMOTE_OP_ERR and MLX5_CQE_SYNDROME_LOCAL_PROT_ERR.
We verified this code change and by default, the dump_cqe is not printed 
to syslog until
the level is changed to KERN_DEBUG level. This works as expected.

I will send out another email with the patch using your method.

Is it fine with you If I add your name in the 'suggested-by' field in 
the new patch?

Thanks
Aru

>   
>   static void mlx5_handle_error_cqe(struct mlx5_ib_dev *dev,
> @@ -300,6 +319,7 @@ static void mlx5_handle_error_cqe(struct mlx5_ib_dev *dev,
>   		wc->status = IB_WC_BAD_RESP_ERR;
>   		break;
>   	case MLX5_CQE_SYNDROME_LOCAL_ACCESS_ERR:
> +		dump = 2;
>   		wc->status = IB_WC_LOC_ACCESS_ERR;
>   		break;
>   	case MLX5_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR:
> @@ -328,11 +348,7 @@ static void mlx5_handle_error_cqe(struct mlx5_ib_dev *dev,
>   	}
>   
>   	wc->vendor_err = cqe->vendor_err_synd;
> -	if (dump) {
> -		mlx5_ib_warn(dev, "WC error: %d, Message: %s\n", wc->status,
> -			     ib_wc_status_msg(wc->status));
> -		dump_cqe(dev, cqe);
> -	}
> +	dump_cqe(dev, cqe, wc, dump);
>   }
>   
>   static void handle_atomics(struct mlx5_ib_qp *qp, struct mlx5_cqe64 *cqe64,
>
>> Feel free to share your thoughts.
> And please don't top-post.
>
> Thanks
>> Thanks,
>> Aru
>>
>> On 10/13/22 3:43 AM, Leon Romanovsky wrote:
>>> On Wed, Oct 12, 2022 at 04:52:52PM -0700, Aru Kolappan wrote:
>>>> From: Arumugam Kolappan <aru.kolappan@...cle.com>
>>>>
>>>> Presently, mlx5 driver dumps error CQE by default for few syndromes. Some
>>>> syndromes are expected due to application behavior[Ex: REMOTE_ACCESS_ERR
>>>> for revoking rkey before RDMA operation is completed]. There is no option
>>>> to disable the log if the application decided to do so. This patch
>>>> converts the log into dynamic print and by default, this debug print is
>>>> disabled. Users can enable/disable this logging at runtime if needed.
>>>>
>>>> Suggested-by: Manjunath Patil <manjunath.b.patil@...cle.com>
>>>> Signed-off-by: Arumugam Kolappan <aru.kolappan@...cle.com>
>>>> ---
>>>>    drivers/infiniband/hw/mlx5/cq.c | 2 +-
>>>>    include/linux/mlx5/cq.h         | 4 ++--
>>>>    2 files changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
>>>> index be189e0..890cdc3 100644
>>>> --- a/drivers/infiniband/hw/mlx5/cq.c
>>>> +++ b/drivers/infiniband/hw/mlx5/cq.c
>>>> @@ -269,7 +269,7 @@ static void handle_responder(struct ib_wc *wc, struct mlx5_cqe64 *cqe,
>>>>    static void dump_cqe(struct mlx5_ib_dev *dev, struct mlx5_err_cqe *cqe)
>>>>    {
>>>> -	mlx5_ib_warn(dev, "dump error cqe\n");
>>>> +	mlx5_ib_dbg(dev, "dump error cqe\n");
>>> This path should be handled in switch<->case of mlx5_handle_error_cqe()
>>> by skipping dump_cqe for MLX5_CQE_SYNDROME_REMOTE_ACCESS_ERR.
>>>
>>> diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
>>> index be189e0525de..2d75c3071a1e 100644
>>> --- a/drivers/infiniband/hw/mlx5/cq.c
>>> +++ b/drivers/infiniband/hw/mlx5/cq.c
>>> @@ -306,6 +306,7 @@ static void mlx5_handle_error_cqe(struct mlx5_ib_dev *dev,
>>>                   wc->status = IB_WC_REM_INV_REQ_ERR;
>>>                   break;
>>>           case MLX5_CQE_SYNDROME_REMOTE_ACCESS_ERR:
>>> +               dump = 0;
>>>                   wc->status = IB_WC_REM_ACCESS_ERR;
>>>                   break;
>>>           case MLX5_CQE_SYNDROME_REMOTE_OP_ERR:
>>>
>>> Thanks

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ