netdev - Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220512010153.GA74055@fastly.com>
Date:   Wed, 11 May 2022 18:01:54 -0700
From:   Joe Damato <jdamato@...tly.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     netdev@...r.kernel.org, davem@...emloft.net,
        linux-kernel@...r.kernel.org, x86@...nel.org
Subject: Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write
 path

On Wed, May 11, 2022 at 04:25:20PM -0700, Jakub Kicinski wrote:
> On Tue, 10 May 2022 20:54:21 -0700 Joe Damato wrote:
> > Initial benchmarks are extremely encouraging. I wrote a simple C program to
> > benchmark this patchset, the program:
> >   - Creates a unix socket pair
> >   - Forks a child process
> >   - The parent process writes to the unix socket using MSG_NTCOPY - or not -
> >     depending on the command line flags
> >   - The child process uses splice to move the data from the unix socket to
> >     a pipe buffer, followed by a second splice call to move the data from
> >     the pipe buffer to a file descriptor opened on /dev/null.
> >   - taskset is used when launching the benchmark to ensure the parent and
> >     child run on appropriate CPUs for various scenarios
> 
> Is there a practical use case?

Yes; for us there seems to be - especially with AMD Zen2. I'll try to
describe such a setup and my synthetic HTTP benchmark results.

Imagine a program, call it storageD, which is responsible for storing and
retrieving data from a data store. Other programs can request data from
storageD via communicating with it on a Unix socket.

One such program that could request data via the Unix socket is an HTTP
daemon. For some client connections that the HTTP daemon receives, the
daemon may determine that responses can be sent in plain text.

In this case, the HTTP daemon can use splice to move data from the unix
socket connection with storageD directly to the client TCP socket via a
pipe. splice saves CPU cycles and avoids incurring any memory access
latency since the data itself is not accessed.

Because we'll use splice (instead of accessing the data and potentially
affecting the CPU cache) it is advantageous for storageD to use NT copies
when it writes to the Unix socket to avoid evicting hot data from the CPU
cache. After all, once the data is copied into the kernel on the unix
socket write path, it won't be touched again; only spliced.

In my synthetic HTTP benchmarks for this setup, we've been able to increase
network throughput of the the HTTP daemon by roughly 30% while reducing
the system time of storageD. We're still collecting data on production
workloads.

The motivation, IMHO, is very similar to the motivation for
NETIF_F_NOCACHE_COPY, as far I understand.

In some cases, when an application writes to a network socket the data
written to the socket won't be accessed again once it is copied into the
kernel. In these cases, NETIF_F_NOCACHE_COPY can improve performance and
helps to preserve the CPU cache and avoid evicting hot data.

We get a sizable benefit from this option, too, in situations where we
can't use splice and have to call write to transmit data to client
connections. We want to get the same benefit of NETIF_F_NOCACHE_COPY, but
when writing to Unix sockets as well.

Let me know if that makes it more clear.

> The patches look like a lot of extra indirect calls.

Yup. As I mentioned in the cover letter this was mostly a PoC that seems to
work and increases network throughput in a real world scenario.

If this general line of thinking (NT copies on write to a Unix socket) is
acceptable, I'm happy to refactor the code however you (and others) would
like to get it to an acceptable state.

Thanks for taking a look,
Joe