[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izPNC65OUr4j1AjWFwRi-kNFobQ=cS4UtwNSVJsZrw19nQ@mail.gmail.com>
Date: Wed, 17 Sep 2025 16:55:34 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Bobby Eshleman <bobbyeshleman@...il.com>
Cc: "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Kuniyuki Iwashima <kuniyu@...gle.com>, Willem de Bruijn <willemb@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>, David Ahern <dsahern@...nel.org>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, Stanislav Fomichev <sdf@...ichev.me>,
Bobby Eshleman <bobbyeshleman@...a.com>
Subject: Re: [PATCH net-next v2 2/3] net: devmem: use niov array for token management
On Thu, Sep 11, 2025 at 10:28 PM Bobby Eshleman <bobbyeshleman@...il.com> wrote:
>
> From: Bobby Eshleman <bobbyeshleman@...a.com>
>
> Improve CPU performance of devmem token management by using page offsets
> as dmabuf tokens and using them for direct array access lookups instead
> of xarray lookups. Consequently, the xarray can be removed. The result
> is an average 5% reduction in CPU cycles spent by devmem RX user
> threads.
>
> This patch changes the meaning of tokens. Tokens previously referred to
> unique fragments of pages. In this patch tokens instead represent
> references to pages, not fragments. Because of this, multiple tokens
> may refer to the same page and so have identical value (e.g., two small
> fragments may coexist on the same page). The token and offset pair that
> the user receives uniquely identifies fragments if needed. This assumes
> that the user is not attempting to sort / uniq the token list using
> tokens alone.
>
> A new restriction is added to the implementation: devmem RX sockets
> cannot switch dmabuf bindings. In practice, this is a symptom of invalid
> configuration as a flow would have to be steered to a different queue or
> device where there is a different binding, which is generally bad for
> TCP flows. This restriction is necessary because the 32-bit dmabuf token
> does not have enough bits to represent both the pages in a large dmabuf
> and also a binding or dmabuf ID. For example, a system with 8 NICs and
> 32 queues requires 8 bits for a binding / queue ID (8 NICs * 32 queues
> == 256 queues total == 2^8), which leaves only 24 bits for dmabuf pages
> (2^24 * 4096 / (1<<30) == 64GB). This is insufficient for the device and
> queue numbers on many current systems or systems that may need larger
> GPU dmabufs (as for hard limits, my current H100 has 80GB GPU memory per
> device).
>
> Using kperf[1] with 4 flows and workers, this patch improves receive
> worker CPU util by ~4.9% with slightly better throughput.
>
> Before, mean cpu util for rx workers ~83.6%:
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 4 2.30 0.00 79.43 0.00 0.65 0.21 0.00 0.00 0.00 17.41
> Average: 5 2.27 0.00 80.40 0.00 0.45 0.21 0.00 0.00 0.00 16.67
> Average: 6 2.28 0.00 80.47 0.00 0.46 0.25 0.00 0.00 0.00 16.54
> Average: 7 2.42 0.00 82.05 0.00 0.46 0.21 0.00 0.00 0.00 14.86
>
> After, mean cpu util % for rx workers ~78.7%:
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: 4 2.61 0.00 73.31 0.00 0.76 0.11 0.00 0.00 0.00 23.20
> Average: 5 2.95 0.00 74.24 0.00 0.66 0.22 0.00 0.00 0.00 21.94
> Average: 6 2.81 0.00 73.38 0.00 0.97 0.11 0.00 0.00 0.00 22.73
> Average: 7 3.05 0.00 78.76 0.00 0.76 0.11 0.00 0.00 0.00 17.32
>
> Mean throughput improves, but falls within a standard deviation (~45GB/s
> for 4 flows on a 50GB/s NIC, one hop).
>
> This patch adds an array of atomics for counting the tokens returned to
> the user for a given page. There is a 4-byte atomic per page in the
> dmabuf per socket. Given a 2GB dmabuf, this array is 2MB.
>
I think this may be an issue. A typical devmem application doing real
work will probably use a dmabuf around this size and will have
thousands of connections. For algorithms like all-to-all I believe
every node needs a number of connections to each other node, and it's
common to see 10K devmem connections while a training is happening or
what not.
Having (2MB * 10K) = 20GB extra memory now being required just for
this book-keeping is a bit hard to swallow. Do you know what's the
existing memory footprint of the xarrays? Were they large anyway
(we're not actually adding more memory), or is the 2MB entirely new?
If it's entirely new, I think we may need to resolve that somehow. One
option is implement a resizeable array... IDK if that would be more
efficient, especially since we need to lock it in the
tcp_recvmsg_dmabuf and in the setsockopt.
Another option is to track the userrefs per-binding, not per socket.
If we do that, we can't free user refs the user leaves behind when
they close the socket (or crash). We can only clear refs on dmabuf
unbind. We have to trust the user to do the right thing. I'm finding
it hard to verify that our current userspace is careful about not
leaving refs behind. We'd have to run thorough tests and stuff against
your series.
--
Thanks,
Mina
Powered by blists - more mailing lists