lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 18 Apr 2018 13:46:18 +0800
From:   Yanjun Zhu <yanjun.zhu@...cle.com>
To:     Tariq Toukan <tariqt@...lanox.com>, netdev@...r.kernel.org,
        linux-rdma@...r.kernel.org, haakon.bugge@...cle.com
Subject: Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an
 offline device



On 2018/4/17 23:37, Tariq Toukan wrote:
>
>
> On 16/04/2018 4:02 AM, Zhu Yanjun wrote:
>> While a faulty cable is used or HCA firmware error, HCA device will
>> be offline. When the driver is accessing this offline device, the
>> following call trace will pop out.
>>
>> "
>> ...
>>    [<ffffffff816e4842>] dump_stack+0x63/0x81
>>    [<ffffffff816e459e>] panic+0xcc/0x21b
>>    [<ffffffffa03e5f8a>] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
>>    [<ffffffffa03e7298>] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
>>    [<ffffffffa03e7381>] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
>>    [<ffffffffa03e9f00>] __mlx4_cmd+0xb0/0x160 [mlx4_core]
>>    [<ffffffffa0406934>] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
>>    [<ffffffffa03f5f54>] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
>> ...
>> "
>> In the above call trace, the function mlx4_cmd_poll calls the function
>> mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
>> returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
>> mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.
>>
>> This is not reasonable. Since HCA device is offline when it is being
>> accessed, it should not be reset again.
>>
>> In this patch, since HCA is offline, the function mlx4_cmd_post returns
>> an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly 
>> returns
>> instead of resetting HCA.
>>
>> CC: Srinivas Eeda <srinivas.eeda@...cle.com>
>> CC: Junxiao Bi <junxiao.bi@...cle.com>
>> Suggested-by: Håkon Bugge <haakon.bugge@...cle.com>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@...cle.com>
>> ---
>>   drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
>> b/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> index 6a9086d..f1c8c42 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> @@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, 
>> u64 in_param, u64 out_param,
>>            * Device is going through error recovery
>>            * and cannot accept commands.
>>            */
>> +        mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
>> +        ret = -EINVAL;
>>           goto out;
>>       }
>>   @@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, 
>> u64 in_param, u64 *out_param,
>>       }
>>     out_reset:
>> +    if (err == -EINVAL)
>> +        goto out;
>> +
>
> See below.
>
>>       if (err)
>>           err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
>>   out:
>> @@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, 
>> u64 in_param, u64 *out_param,
>>           *out_param = context->out_param;
>>     out_reset:
>> +    if (err == -EINVAL)
>> +        goto out;
>> +
>>       if (err)
>
> Instead, just do here: if (err && err != -EINVAL)
>
>>           err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
>>   out:
>>
>
> I am not sure this does not mistakenly cover other cases that already 
> exist and have (err == -EINVAL).
>
> For example, this line is hard to predict:
> err = mlx4_status_to_errno
> and later on, we might get into
> if (mlx4_closing_cmd_fatal_error(op, stat))
> which leads to out_reset.
Thanks a lot.
Sure. I agree with you that "err = mlx4_status_to_errno" and "if 
(mlx4_closing_cmd_fatal_error(op, stat))" will also make "err=-EINVAL".
This will mistakenly go to out instead of resetting HCA device.

I will make a new patch to avoid the above error.

Zhu Yanjun
>
> We must have a deeper look at this.
> But a better option is, change the error indication to uniquely 
> indicate "already in error recovery".
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ