linux-kernel - Re: [PATCH v2] RDMA/cma: Make CM response timeout and # CM retries configurable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <174ccd37a9ffa05d0c7c03fe80ff7170a9270824.camel@redhat.com>
Date:   Thu, 13 Jun 2019 10:25:23 -0400
From:   Doug Ledford <dledford@...hat.com>
To:     Håkon Bugge <haakon.bugge@...cle.com>,
        Jason Gunthorpe <jgg@...pe.ca>,
        Leon Romanovsky <leon@...nel.org>,
        Parav Pandit <parav@...lanox.com>,
        Steve Wise <swise@...ngridcomputing.com>
Cc:     linux-rdma@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] RDMA/cma: Make CM response timeout and # CM retries
 configurable

On Tue, 2019-02-26 at 08:57 +0100, Håkon Bugge wrote:
> During certain workloads, the default CM response timeout is too
> short, leading to excessive retries. Hence, make it configurable
> through sysctl. While at it, also make number of CM retries
> configurable.
> 
> The defaults are not changed.
> 
> Signed-off-by: Håkon Bugge <haakon.bugge@...cle.com>
> ---
> v1 -> v2:
>    * Added unregister_net_sysctl_table() in cma_cleanup()
> ---
>  drivers/infiniband/core/cma.c | 52 ++++++++++++++++++++++++++++++---
> --
>  1 file changed, 45 insertions(+), 7 deletions(-)

This has been sitting on patchworks since forever.  Presumably because
Jason and I neither one felt like we really wanted it, but also
couldn't justify flat refusing it.  Well, I've made up my mind, so
unless Jason wants to argue the other side, I'm rejecting this patch. 
Here's why.  The whole concept of a timeout is to help recovery in a
situation that overloads one end of the connection.  There is a
relationship between the max queue backlog on the one host and the
timeout on the other host.  Generally, in order for a request to get
dropped and us to need to retransmit, the queue must already have a
full backlog.  So, how long does it take a heavily loaded system to
process a full backlog?  That, plus a fuzz for a margin of error,
should be our timeout.  We shouldn't be asking users to configure it.

However, if users change the default backlog queue on their systems,
*then* it would make sense to have the users also change the timeout
here, but I think guidance would be helpful.

So, to revive this patch, what I'd like to see is some attempt to
actually quantify a reasonable timeout for the default backlog depth,
then the patch should actually change the default to that reasonable
timeout, and then put in the ability to adjust the timeout with some
sort of doc guidance on how to calculate a reasonable timeout based on
configured backlog depth.

-- 
Doug Ledford <dledford@...hat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57
2FDD

Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)