netdev - Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220512124608.452d3300@kernel.org>
Date:   Thu, 12 May 2022 12:46:08 -0700
From:   Jakub Kicinski <kuba@...nel.org>
To:     Joe Damato <jdamato@...tly.com>
Cc:     netdev@...r.kernel.org, davem@...emloft.net,
        linux-kernel@...r.kernel.org, x86@...nel.org
Subject: Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write
 path

On Wed, 11 May 2022 18:01:54 -0700 Joe Damato wrote:
> > Is there a practical use case?  
> 
> Yes; for us there seems to be - especially with AMD Zen2. I'll try to
> describe such a setup and my synthetic HTTP benchmark results.
> 
> Imagine a program, call it storageD, which is responsible for storing and
> retrieving data from a data store. Other programs can request data from
> storageD via communicating with it on a Unix socket.
> 
> One such program that could request data via the Unix socket is an HTTP
> daemon. For some client connections that the HTTP daemon receives, the
> daemon may determine that responses can be sent in plain text.
> 
> In this case, the HTTP daemon can use splice to move data from the unix
> socket connection with storageD directly to the client TCP socket via a
> pipe. splice saves CPU cycles and avoids incurring any memory access
> latency since the data itself is not accessed.
> 
> Because we'll use splice (instead of accessing the data and potentially
> affecting the CPU cache) it is advantageous for storageD to use NT copies
> when it writes to the Unix socket to avoid evicting hot data from the CPU
> cache. After all, once the data is copied into the kernel on the unix
> socket write path, it won't be touched again; only spliced.
> 
> In my synthetic HTTP benchmarks for this setup, we've been able to increase
> network throughput of the the HTTP daemon by roughly 30% while reducing
> the system time of storageD. We're still collecting data on production
> workloads.
> 
> The motivation, IMHO, is very similar to the motivation for
> NETIF_F_NOCACHE_COPY, as far I understand.
> 
> In some cases, when an application writes to a network socket the data
> written to the socket won't be accessed again once it is copied into the
> kernel. In these cases, NETIF_F_NOCACHE_COPY can improve performance and
> helps to preserve the CPU cache and avoid evicting hot data.
> 
> We get a sizable benefit from this option, too, in situations where we
> can't use splice and have to call write to transmit data to client
> connections. We want to get the same benefit of NETIF_F_NOCACHE_COPY, but
> when writing to Unix sockets as well.
> 
> Let me know if that makes it more clear.

Makes sense, thanks for the explainer.

> > The patches look like a lot of extra indirect calls.  
> 
> Yup. As I mentioned in the cover letter this was mostly a PoC that seems to
> work and increases network throughput in a real world scenario.
> 
> If this general line of thinking (NT copies on write to a Unix socket) is
> acceptable, I'm happy to refactor the code however you (and others) would
> like to get it to an acceptable state.

My only concern is that in post-spectre world the indirect calls are
going to be more expensive than an branch would be. But I'm not really
a mirco-optimization expert :)