netdev - Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAq2y_awPoGqhjdp@mini-arch>
Date: Thu, 24 Apr 2025 15:10:19 -0700
From: Stanislav Fomichev <stfomichev@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: Cosmin Ratiu <cratiu@...dia.com>, netdev@...r.kernel.org,
	Jason Gunthorpe <jgg@...dia.com>,
	Leon Romanovsky <leonro@...dia.com>,
	Andrew Lunn <andrew+netdev@...n.ch>,
	"David S . Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Simon Horman <horms@...nel.org>, Saeed Mahameed <saeedm@...dia.com>,
	Tariq Toukan <tariqt@...dia.com>,
	Dragos Tatulea <dtatulea@...dia.com>,
	linux-kselftest@...r.kernel.org
Subject: Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf
 pools

On 04/24, Mina Almasry wrote:
> On Wed, Apr 23, 2025 at 1:15 PM Stanislav Fomichev <stfomichev@...il.com> wrote:
> >
> > On 04/23, Mina Almasry wrote:
> > > On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@...dia.com> wrote:
> > > >
> > > > Drivers that are told to allocate RX buffers from pools of DMA memory
> > > > should have enough memory in the pool to satisfy projected allocation
> > > > requests (a function of ring size, MTU & other parameters). If there's
> > > > not enough memory, RX ring refill might fail later at inconvenient times
> > > > (e.g. during NAPI poll).
> > > >
> > >
> > > My understanding is that if the RX ring refill fails, the driver will
> > > post the buffers it was able to allocate data for, and will not post
> > > other buffers. So it will run with a degraded performance but nothing
> > > overly bad should happen. This should be the same behavior if the
> > > machine is under memory pressure.
> > >
> > > In general I don't know about this change. If the user wants to use
> > > very small dmabufs, they should be able to, without going through
> > > hoops reducing the number of rx ring slots the driver has (if it
> > > supports configuring that).
> > >
> > > I think maybe printing an error or warning that the dmabuf is too
> > > small for the pool_size may be fine. But outright failing this
> > > configuration? I don't think so.
> > >
> > > > This commit adds a check at dmabuf pool init time that compares the
> > > > amount of memory in the underlying chunk pool (configured by the user
> > > > space application providing dmabuf memory) with the desired pool size
> > > > (previously set by the driver) and fails with an error message if chunk
> > > > memory isn't enough.
> > > >
> > > > Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> > > > Signed-off-by: Cosmin Ratiu <cratiu@...dia.com>
> > > > ---
> > > >  net/core/devmem.c | 11 +++++++++++
> > > >  1 file changed, 11 insertions(+)
> > > >
> > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > index 6e27a47d0493..651cd55ebb28 100644
> > > > --- a/net/core/devmem.c
> > > > +++ b/net/core/devmem.c
> > > > @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > > >  int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > >  {
> > > >         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > > > +       size_t size;
> > > >
> > > >         if (!binding)
> > > >                 return -EINVAL;
> > > > @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > >         if (pool->p.order != 0)
> > > >                 return -E2BIG;
> > > >
> > > > +       /* Validate that the underlying dmabuf has enough memory to satisfy
> > > > +        * requested pool size.
> > > > +        */
> > > > +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> > > > +       if (size < pool->p.pool_size) {
> > >
> > > pool_size seems to be the number of ptr_ring slots in the page_pool,
> > > not some upper or lower bound on the amount of memory the page_pool
> > > can provide. So this check seems useless? The page_pool can still not
> > > provide this amount of memory with dmabuf (if the netmems aren't being
> > > recycled fast enough) or with normal memory (under memory pressure).
> >
> > I read this check more as "is there enough chunks in the binding to
> > fully fill in the page pool". User controls the size of rx ring
> 
> Only on drivers that support ethtool -G, and where it will let you
> configure -G to what you want.

gve is the minority here, any major nic (brcm/mlx/intel) supports resizing
the rings.

> > which
> > controls the size of the page pool which somewhat dictates the minimal
> > size of the binding (maybe).
> 
> See the test I ran in the other thread. Seems at least GVE is fine
> with dmabuf size < ring size. I don't know what other drivers do, but
> generally speaking I think specific driver limitations should not
> limit what others can do with their drivers. Sure for the GPU mem
> applications you're probably looking at the dmabufs are huge and
> supporting small dmabufs is not a concern, but someone somewhere may
> want to run with 1 MB dmabuf for some use case and if their driver is
> fine with it, core should not prevent it, I think.
> 
> > So it's more of a sanity check.
> >
> > Maybe having better defaults in ncdevmem would've been a better option? It
> > allocates (16000*4096) bytes (slightly less than 64MB, why? to fit into
> > default /sys/module/udmabuf/parameters/size_limit_mb?) and on my setup
> > PP wants to get 64MB at least..
> 
> Yeah, udmabuf has a limitation that it only supports 64MB max size
> last I looked.

We can use /sys/module/udmabuf/parameters/size_limit_mb to allocate
more than 64MB, ncdevmem can change it. Or warn the user similar
to what kperf does: https://github.com/facebookexperimental/kperf/blob/main/devmem.c#L308

So either having a kernel warn or tuning 63MB up to something sensible
(1G?) should prevent people from going through the pain..

> I added devmem TCP support with udmabuf selftests to demonstrate that
> the feature is non-proprietary, not to advertise that devmem tcp +
> udmabuf is a great combination. udmabuf is actually terrible for
> devmem TCP. The 64MB limit is way too small for anyone to do anything
> performant on it and by dmaing into host memory you lose many of the
> benefits of devmem TCP (lower mem bw + pcie bw utilization).

It would still be nice to have a udmabuf as a properly supported option.
This can drive the UAPI performance conversions: for example, comparing
existing tcp rx zerocopy vs MSG_SOCK_DEVMEM.. So let's not completely
dismiss it. We've played internally with doing 2MB udmabuf huge-pages,
might post it at some point..

> If you're running real experiments with devmem TCP I suggest moving to
> real dmabufs as soon as possible, or at least hack udmabuf to give you
> large sizes. We've open sourced our production devmem TCP userspace:
> 
> https://github.com/google/tcpgpudmarxd
> https://github.com/google/nccl-plugin-gpudirecttcpx
> 
> Porting it to upstream APIs + your dmabuf provider will have you run
> much more interesting tests than anything you do with udmabuf I think,
> unless you hack the udmabuf size.

I found these a bit too late, so I reimplemented the plugin over
upstream APIs :-[ Plus, you yourself have acked [0], guess why
I sent this patch :-D Once the tx part is accepted, we'll upstream
kperf cuda support as well..

0: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=8b9049af8066b4705d83bb7847ee3c960fc58d09