linux-kernel - Re: [PATCH net-next v2 2/3] net: devmem: use niov array for token management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aMwU4YPF+ERN9qxc@devvm11784.nha0.facebook.com>
Date: Thu, 18 Sep 2025 07:19:13 -0700
From: Bobby Eshleman <bobbyeshleman@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Simon Horman <horms@...nel.org>,
	Kuniyuki Iwashima <kuniyu@...gle.com>,
	Willem de Bruijn <willemb@...gle.com>,
	Neal Cardwell <ncardwell@...gle.com>,
	David Ahern <dsahern@...nel.org>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, Stanislav Fomichev <sdf@...ichev.me>,
	Bobby Eshleman <bobbyeshleman@...a.com>
Subject: Re: [PATCH net-next v2 2/3] net: devmem: use niov array for token
 management

On Wed, Sep 17, 2025 at 04:55:34PM -0700, Mina Almasry wrote:
> On Thu, Sep 11, 2025 at 10:28 PM Bobby Eshleman <bobbyeshleman@...il.com> wrote:
> >
> > From: Bobby Eshleman <bobbyeshleman@...a.com>
> >
> > Improve CPU performance of devmem token management by using page offsets
> > as dmabuf tokens and using them for direct array access lookups instead
> > of xarray lookups. Consequently, the xarray can be removed. The result
> > is an average 5% reduction in CPU cycles spent by devmem RX user
> > threads.
> >
> > This patch changes the meaning of tokens. Tokens previously referred to
> > unique fragments of pages. In this patch tokens instead represent
> > references to pages, not fragments.  Because of this, multiple tokens
> > may refer to the same page and so have identical value (e.g., two small
> > fragments may coexist on the same page). The token and offset pair that
> > the user receives uniquely identifies fragments if needed.  This assumes
> > that the user is not attempting to sort / uniq the token list using
> > tokens alone.
> >
> > A new restriction is added to the implementation: devmem RX sockets
> > cannot switch dmabuf bindings. In practice, this is a symptom of invalid
> > configuration as a flow would have to be steered to a different queue or
> > device where there is a different binding, which is generally bad for
> > TCP flows. This restriction is necessary because the 32-bit dmabuf token
> > does not have enough bits to represent both the pages in a large dmabuf
> > and also a binding or dmabuf ID. For example, a system with 8 NICs and
> > 32 queues requires 8 bits for a binding / queue ID (8 NICs * 32 queues
> > == 256 queues total == 2^8), which leaves only 24 bits for dmabuf pages
> > (2^24 * 4096 / (1<<30) == 64GB). This is insufficient for the device and
> > queue numbers on many current systems or systems that may need larger
> > GPU dmabufs (as for hard limits, my current H100 has 80GB GPU memory per
> > device).
> >
> > Using kperf[1] with 4 flows and workers, this patch improves receive
> > worker CPU util by ~4.9% with slightly better throughput.
> >
> > Before, mean cpu util for rx workers ~83.6%:
> >
> > Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> > Average:       4    2.30    0.00   79.43    0.00    0.65    0.21    0.00    0.00    0.00   17.41
> > Average:       5    2.27    0.00   80.40    0.00    0.45    0.21    0.00    0.00    0.00   16.67
> > Average:       6    2.28    0.00   80.47    0.00    0.46    0.25    0.00    0.00    0.00   16.54
> > Average:       7    2.42    0.00   82.05    0.00    0.46    0.21    0.00    0.00    0.00   14.86
> >
> > After, mean cpu util % for rx workers ~78.7%:
> >
> > Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> > Average:       4    2.61    0.00   73.31    0.00    0.76    0.11    0.00    0.00    0.00   23.20
> > Average:       5    2.95    0.00   74.24    0.00    0.66    0.22    0.00    0.00    0.00   21.94
> > Average:       6    2.81    0.00   73.38    0.00    0.97    0.11    0.00    0.00    0.00   22.73
> > Average:       7    3.05    0.00   78.76    0.00    0.76    0.11    0.00    0.00    0.00   17.32
> >
> > Mean throughput improves, but falls within a standard deviation (~45GB/s
> > for 4 flows on a 50GB/s NIC, one hop).
> >
> > This patch adds an array of atomics for counting the tokens returned to
> > the user for a given page. There is a 4-byte atomic per page in the
> > dmabuf per socket. Given a 2GB dmabuf, this array is 2MB.
> >
> 
> I think this may be an issue. A typical devmem application doing real
> work will probably use a dmabuf around this size and will have
> thousands of connections. For algorithms like all-to-all I believe
> every node needs a number of connections to each other node, and it's
> common to see 10K devmem connections while a training is happening or
> what not.
> 
> Having (2MB * 10K) = 20GB extra memory now being required just for
> this book-keeping is a bit hard to swallow. Do you know what's the
> existing memory footprint of the xarrays? Were they large anyway
> (we're not actually adding more memory), or is the 2MB entirely new?
> 
> If it's entirely new, I think we may need to resolve that somehow. One
> option is implement a resizeable array... IDK if that would be more
> efficient, especially since we need to lock it in the
> tcp_recvmsg_dmabuf and in the setsockopt.
> 

I can give the xarray a measurement on some workloads and see. My guess
is it'll be quite a bit smaller than the aggregate of per-socket arrays.

> Another option is to track the userrefs per-binding, not per socket.
> If we do that, we can't free user refs the user leaves behind when
> they close the socket (or crash). We can only clear refs on dmabuf
> unbind. We have to trust the user to do the right thing. I'm finding
> it hard to verify that our current userspace is careful about not
> leaving refs behind. We'd have to run thorough tests and stuff against
> your series.
> 

I can give this a try and test on our end too, this would work for us.

Thanks!

Best,
Bobby