[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aMwU4YPF+ERN9qxc@devvm11784.nha0.facebook.com>
Date: Thu, 18 Sep 2025 07:19:13 -0700
From: Bobby Eshleman <bobbyeshleman@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>,
Willem de Bruijn <willemb@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>,
David Ahern <dsahern@...nel.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, Stanislav Fomichev <sdf@...ichev.me>,
Bobby Eshleman <bobbyeshleman@...a.com>
Subject: Re: [PATCH net-next v2 2/3] net: devmem: use niov array for token
management
On Wed, Sep 17, 2025 at 04:55:34PM -0700, Mina Almasry wrote:
> On Thu, Sep 11, 2025 at 10:28 PM Bobby Eshleman <bobbyeshleman@...il.com> wrote:
> >
> > From: Bobby Eshleman <bobbyeshleman@...a.com>
> >
> > Improve CPU performance of devmem token management by using page offsets
> > as dmabuf tokens and using them for direct array access lookups instead
> > of xarray lookups. Consequently, the xarray can be removed. The result
> > is an average 5% reduction in CPU cycles spent by devmem RX user
> > threads.
> >
> > This patch changes the meaning of tokens. Tokens previously referred to
> > unique fragments of pages. In this patch tokens instead represent
> > references to pages, not fragments. Because of this, multiple tokens
> > may refer to the same page and so have identical value (e.g., two small
> > fragments may coexist on the same page). The token and offset pair that
> > the user receives uniquely identifies fragments if needed. This assumes
> > that the user is not attempting to sort / uniq the token list using
> > tokens alone.
> >
> > A new restriction is added to the implementation: devmem RX sockets
> > cannot switch dmabuf bindings. In practice, this is a symptom of invalid
> > configuration as a flow would have to be steered to a different queue or
> > device where there is a different binding, which is generally bad for
> > TCP flows. This restriction is necessary because the 32-bit dmabuf token
> > does not have enough bits to represent both the pages in a large dmabuf
> > and also a binding or dmabuf ID. For example, a system with 8 NICs and
> > 32 queues requires 8 bits for a binding / queue ID (8 NICs * 32 queues
> > == 256 queues total == 2^8), which leaves only 24 bits for dmabuf pages
> > (2^24 * 4096 / (1<<30) == 64GB). This is insufficient for the device and
> > queue numbers on many current systems or systems that may need larger
> > GPU dmabufs (as for hard limits, my current H100 has 80GB GPU memory per
> > device).
> >
> > Using kperf[1] with 4 flows and workers, this patch improves receive
> > worker CPU util by ~4.9% with slightly better throughput.
> >
> > Before, mean cpu util for rx workers ~83.6%:
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 4 2.30 0.00 79.43 0.00 0.65 0.21 0.00 0.00 0.00 17.41
> > Average: 5 2.27 0.00 80.40 0.00 0.45 0.21 0.00 0.00 0.00 16.67
> > Average: 6 2.28 0.00 80.47 0.00 0.46 0.25 0.00 0.00 0.00 16.54
> > Average: 7 2.42 0.00 82.05 0.00 0.46 0.21 0.00 0.00 0.00 14.86
> >
> > After, mean cpu util % for rx workers ~78.7%:
> >
> > Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> > Average: 4 2.61 0.00 73.31 0.00 0.76 0.11 0.00 0.00 0.00 23.20
> > Average: 5 2.95 0.00 74.24 0.00 0.66 0.22 0.00 0.00 0.00 21.94
> > Average: 6 2.81 0.00 73.38 0.00 0.97 0.11 0.00 0.00 0.00 22.73
> > Average: 7 3.05 0.00 78.76 0.00 0.76 0.11 0.00 0.00 0.00 17.32
> >
> > Mean throughput improves, but falls within a standard deviation (~45GB/s
> > for 4 flows on a 50GB/s NIC, one hop).
> >
> > This patch adds an array of atomics for counting the tokens returned to
> > the user for a given page. There is a 4-byte atomic per page in the
> > dmabuf per socket. Given a 2GB dmabuf, this array is 2MB.
> >
>
> I think this may be an issue. A typical devmem application doing real
> work will probably use a dmabuf around this size and will have
> thousands of connections. For algorithms like all-to-all I believe
> every node needs a number of connections to each other node, and it's
> common to see 10K devmem connections while a training is happening or
> what not.
>
> Having (2MB * 10K) = 20GB extra memory now being required just for
> this book-keeping is a bit hard to swallow. Do you know what's the
> existing memory footprint of the xarrays? Were they large anyway
> (we're not actually adding more memory), or is the 2MB entirely new?
>
> If it's entirely new, I think we may need to resolve that somehow. One
> option is implement a resizeable array... IDK if that would be more
> efficient, especially since we need to lock it in the
> tcp_recvmsg_dmabuf and in the setsockopt.
>
I can give the xarray a measurement on some workloads and see. My guess
is it'll be quite a bit smaller than the aggregate of per-socket arrays.
> Another option is to track the userrefs per-binding, not per socket.
> If we do that, we can't free user refs the user leaves behind when
> they close the socket (or crash). We can only clear refs on dmabuf
> unbind. We have to trust the user to do the right thing. I'm finding
> it hard to verify that our current userspace is careful about not
> leaving refs behind. We'd have to run thorough tests and stuff against
> your series.
>
I can give this a try and test on our end too, this would work for us.
Thanks!
Best,
Bobby
Powered by blists - more mailing lists