netdev - Re: [PATCH v2 2/3] nvme-tcp: support specifying the congestion-control

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220331112613.0000063e@tom.com>
Date:   Thu, 31 Mar 2022 11:26:13 +0800
From:   Mingbao Sun <sunmingbao@....com>
To:     Sagi Grimberg <sagi@...mberg.me>
Cc:     Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...com>,
        Christoph Hellwig <hch@....de>,
        Chaitanya Kulkarni <kch@...dia.com>,
        linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org,
        Eric Dumazet <edumazet@...gle.com>,
        "David S . Miller" <davem@...emloft.net>,
        Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
        David Ahern <dsahern@...nel.org>,
        Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
        tyler.sun@...l.com, ping.gan@...l.com, yanxiu.cai@...l.com,
        libin.zhang@...l.com, ao.sun@...l.com
Subject: Re: [PATCH v2 2/3] nvme-tcp: support specifying the
 congestion-control

On Tue, 29 Mar 2022 10:46:08 +0300
Sagi Grimberg <sagi@...mberg.me> wrote:

> >> As I said, TCP can be tuned in various ways, congestion being just one
> >> of them. I'm sure you can find a workload where rmem/wmem will make
> >> a difference.  
> > 
> > agree.
> > but the difference for the knob of rmem/wmem is:
> > we could enlarge rmem/wmem for NVMe/TCP via sysctl,
> > and it would not bring downside to any other sockets whose
> > rmem/wmem are not explicitly specified.  
> 
> It can most certainly affect them, positively or negatively, depends
> on the use-case.

Agree.
Your saying is rigorous.

> >> In addition, based on my knowledge, application specific TCP level
> >> tuning (like congestion) is not really a common thing to do. So why in
> >> nvme-tcp?
> >>
> >> So to me at least, it is not clear why we should add it to the driver.  
> > 
> > As mentioned in the commit message, though we can specify the
> > congestion-control of NVMe_over_TCP via sysctl or writing
> > '/proc/sys/net/ipv4/tcp_congestion_control', but this also
> > changes the congestion-control of all the future TCP sockets on
> > the same host that have not been explicitly assigned the
> > congestion-control, thus bringing potential impaction on their
> > performance.
> > 
> > For example:
> > 
> > A server in a data-center with the following 2 NICs:
> > 
> >      - NIC_fron-end, for interacting with clients through WAN
> >        (high latency, ms-level)
> > 
> >      - NIC_back-end, for interacting with NVMe/TCP target through LAN
> >        (low latency, ECN-enabled, ideal for dctcp)
> > 
> > This server interacts with clients (handling requests) via the fron-end
> > network and accesses the NVMe/TCP storage via the back-end network.
> > This is a normal use case, right?
> > 
> > For the client devices, we can’t determine their congestion-control.
> > But normally it’s cubic by default (per the CONFIG_DEFAULT_TCP_CONG).
> > So if we change the default congestion control on the server to dctcp
> > on behalf of the NVMe/TCP traffic of the LAN side, it could at the
> > same time change the congestion-control of the front-end sockets
> > to dctcp while the congestion-control of the client-side is cubic.
> > So this is an unexpected scenario.
> > 
> > In addition, distributed storage products like the following also have
> > the above problem:
> > 
> >      - The product consists of a cluster of servers.
> > 
> >      - Each server serves clients via its front-end NIC
> >       (WAN, high latency).
> > 
> >      - All servers interact with each other via NVMe/TCP via back-end NIC
> >       (LAN, low latency, ECN-enabled, ideal for dctcp).  
> 
> Separate networks are still not application (nvme-tcp) specific and as
> mentioned, we have a way to control that. IMO, this still does not
> qualify as solid justification to add this to nvme-tcp.
> 
> What do others think?

Well, per the fact that the approach (‘ip route …’) proposed
by Jakub could largely fit the per link requirement on
congestion-control, so the usefulness of this patchset is really
not so significant.

So here I terminate all the threads of this patchset.

At last, many thanks to all of you for reviewing this patchset.