lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 25 Aug 2021 14:49:56 -0300
From:   Jason Gunthorpe <jgg@...dia.com>
To:     Håkon Bugge <haakon.bugge@...cle.com>
Cc:     Doug Ledford <dledford@...hat.com>,
        Leon Romanovsky <leon@...nel.org>, linux-rdma@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH for-next v2] RDMA/core/sa_query: Retry SA queries

On Thu, Aug 12, 2021 at 06:12:35PM +0200, Håkon Bugge wrote:
> A MAD packet is sent as an unreliable datagram (UD). SA requests are
> sent as MAD packets. As such, SA requests or responses may be silently
> dropped.
> 
> IB Core's MAD layer has a timeout and retry mechanism, which amongst
> other, is used by RDMA CM. But it is not used by SA queries. The lack
> of retries of SA queries leads to long specified timeout, and error
> being returned in case of packet loss. The ULP or user-land process
> has to perform the retry.
> 
> Fix this by taking advantage of the MAD layer's retry mechanism.
> 
> First, a check against a zero timeout is added in
> rdma_resolve_route(). In send_mad(), we set the MAD layer timeout to
> one tenth of the specified timeout and the number of retries to
> 10. The special case when timeout is less than 10 is handled.
> 
> With this fix:
> 
>  # ucmatose -c 1000 -S 1024 -C 1
> 
> runs stable on an Infiniband fabric. Without this fix, we see an
> intermittent behavior and it errors out with:
> 
> cmatose: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -110
> 
> (110 is ETIMEDOUT)
> 
> Fixes: f75b7a529494 ("[PATCH] IB: Add automatic retries to MAD layer")
> Signed-off-by: Håkon Bugge <haakon.bugge@...cle.com>
> ---
>  drivers/infiniband/core/cma.c      | 3 +++
>  drivers/infiniband/core/sa_query.c | 9 ++++++++-
>  2 files changed, 11 insertions(+), 1 deletion(-)

I'm nervous about this, mostly because the mad layer is very
complicated, but it does seem aligned with the spec.

However, it seems quite wrong that the timeout comes in from outside,
the SA timeout should be integral to the SA layer..

Anyhow, applied to for-next

Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ