netdev - Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAq94Zw69XRs45T4@mini-arch>
Date: Thu, 24 Apr 2025 15:40:33 -0700
From: Stanislav Fomichev <stfomichev@...il.com>
To: Mina Almasry <almasrymina@...gle.com>
Cc: Cosmin Ratiu <cratiu@...dia.com>, netdev@...r.kernel.org,
	Jason Gunthorpe <jgg@...dia.com>,
	Leon Romanovsky <leonro@...dia.com>,
	Andrew Lunn <andrew+netdev@...n.ch>,
	"David S . Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Simon Horman <horms@...nel.org>, Saeed Mahameed <saeedm@...dia.com>,
	Tariq Toukan <tariqt@...dia.com>,
	Dragos Tatulea <dtatulea@...dia.com>,
	linux-kselftest@...r.kernel.org
Subject: Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf
 pools

On 04/24, Mina Almasry wrote:
> On Thu, Apr 24, 2025 at 3:10 PM Stanislav Fomichev <stfomichev@...il.com> wrote:
> >
> > On 04/24, Mina Almasry wrote:
> > > On Wed, Apr 23, 2025 at 1:15 PM Stanislav Fomichev <stfomichev@...il.com> wrote:
> > > >
> > > > On 04/23, Mina Almasry wrote:
> > > > > On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@...dia.com> wrote:
> > > > > >
> > > > > > Drivers that are told to allocate RX buffers from pools of DMA memory
> > > > > > should have enough memory in the pool to satisfy projected allocation
> > > > > > requests (a function of ring size, MTU & other parameters). If there's
> > > > > > not enough memory, RX ring refill might fail later at inconvenient times
> > > > > > (e.g. during NAPI poll).
> > > > > >
> > > > >
> > > > > My understanding is that if the RX ring refill fails, the driver will
> > > > > post the buffers it was able to allocate data for, and will not post
> > > > > other buffers. So it will run with a degraded performance but nothing
> > > > > overly bad should happen. This should be the same behavior if the
> > > > > machine is under memory pressure.
> > > > >
> > > > > In general I don't know about this change. If the user wants to use
> > > > > very small dmabufs, they should be able to, without going through
> > > > > hoops reducing the number of rx ring slots the driver has (if it
> > > > > supports configuring that).
> > > > >
> > > > > I think maybe printing an error or warning that the dmabuf is too
> > > > > small for the pool_size may be fine. But outright failing this
> > > > > configuration? I don't think so.
> > > > >
> > > > > > This commit adds a check at dmabuf pool init time that compares the
> > > > > > amount of memory in the underlying chunk pool (configured by the user
> > > > > > space application providing dmabuf memory) with the desired pool size
> > > > > > (previously set by the driver) and fails with an error message if chunk
> > > > > > memory isn't enough.
> > > > > >
> > > > > > Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider")
> > > > > > Signed-off-by: Cosmin Ratiu <cratiu@...dia.com>
> > > > > > ---
> > > > > >  net/core/devmem.c | 11 +++++++++++
> > > > > >  1 file changed, 11 insertions(+)
> > > > > >
> > > > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > > > index 6e27a47d0493..651cd55ebb28 100644
> > > > > > --- a/net/core/devmem.c
> > > > > > +++ b/net/core/devmem.c
> > > > > > @@ -299,6 +299,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> > > > > >  int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > > >  {
> > > > > >         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > > > > > +       size_t size;
> > > > > >
> > > > > >         if (!binding)
> > > > > >                 return -EINVAL;
> > > > > > @@ -312,6 +313,16 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> > > > > >         if (pool->p.order != 0)
> > > > > >                 return -E2BIG;
> > > > > >
> > > > > > +       /* Validate that the underlying dmabuf has enough memory to satisfy
> > > > > > +        * requested pool size.
> > > > > > +        */
> > > > > > +       size = gen_pool_size(binding->chunk_pool) >> PAGE_SHIFT;
> > > > > > +       if (size < pool->p.pool_size) {
> > > > >
> > > > > pool_size seems to be the number of ptr_ring slots in the page_pool,
> > > > > not some upper or lower bound on the amount of memory the page_pool
> > > > > can provide. So this check seems useless? The page_pool can still not
> > > > > provide this amount of memory with dmabuf (if the netmems aren't being
> > > > > recycled fast enough) or with normal memory (under memory pressure).
> > > >
> > > > I read this check more as "is there enough chunks in the binding to
> > > > fully fill in the page pool". User controls the size of rx ring
> > >
> > > Only on drivers that support ethtool -G, and where it will let you
> > > configure -G to what you want.
> >
> > gve is the minority here, any major nic (brcm/mlx/intel) supports resizing
> > the rings.
> >
> 
> GVE supports resizing rings. Other drivers may not. Even on drivers
> that support resizing rings. Some users may have a use case for a
> dmabuf smaller than the minimum ring size their driver accepts.
> 
> > > > which
> > > > controls the size of the page pool which somewhat dictates the minimal
> > > > size of the binding (maybe).
> > >
> > > See the test I ran in the other thread. Seems at least GVE is fine
> > > with dmabuf size < ring size. I don't know what other drivers do, but
> > > generally speaking I think specific driver limitations should not
> > > limit what others can do with their drivers. Sure for the GPU mem
> > > applications you're probably looking at the dmabufs are huge and
> > > supporting small dmabufs is not a concern, but someone somewhere may
> > > want to run with 1 MB dmabuf for some use case and if their driver is
> > > fine with it, core should not prevent it, I think.
> > >
> > > > So it's more of a sanity check.
> > > >
> > > > Maybe having better defaults in ncdevmem would've been a better option? It
> > > > allocates (16000*4096) bytes (slightly less than 64MB, why? to fit into
> > > > default /sys/module/udmabuf/parameters/size_limit_mb?) and on my setup
> > > > PP wants to get 64MB at least..
> > >
> > > Yeah, udmabuf has a limitation that it only supports 64MB max size
> > > last I looked.
> >
> > We can use /sys/module/udmabuf/parameters/size_limit_mb to allocate
> > more than 64MB, ncdevmem can change it.
> 
> The udmabuf limit is hardcoded, in udmabuf.c or configured on module
> load, and ncdevmem doesn't load udmabuf. I guess it could be changed
> to that, but currently ncdevmem works with CONFIG_UDMABUF=y

You don't need to load/reload the module to change module params:

# id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys)
# cat /sys/module/udmabuf/parameters/size_limit_mb
64
# echo 128 > /sys/module/udmabuf/parameters/size_limit_mb
# cat /sys/module/udmabuf/parameters/size_limit_mb
128

> > Or warn the user similar
> > to what kperf does: https://github.com/facebookexperimental/kperf/blob/main/devmem.c#L308
> >
> > So either having a kernel warn or tuning 63MB up to something sensible
> > (1G?) should prevent people from going through the pain..
> >
> 
> Agreed with both. Another option is updating the devmem.rst docs:
> 
> "Some drivers may struggle to run devmem TCP when the dmabuf size is
> too small, especially when it's smaller than the number of rx ring
> slots. Look for this warning in dmesg." etc.
> 
> But I don't see the need to outright disable this "dmabuf size < ring
> size" use case for everyone to solve this.

Agreed. The fact that mlx5 has issues with small pp should be fixed,
I'm not arguing with that. I'm trying to understand whether giving
a hint to the user about dmabuf < pp size is helpful or not (because
it will most likely will lead to poor perf, which is the main point
of devmem).

> > > I added devmem TCP support with udmabuf selftests to demonstrate that
> > > the feature is non-proprietary, not to advertise that devmem tcp +
> > > udmabuf is a great combination. udmabuf is actually terrible for
> > > devmem TCP. The 64MB limit is way too small for anyone to do anything
> > > performant on it and by dmaing into host memory you lose many of the
> > > benefits of devmem TCP (lower mem bw + pcie bw utilization).
> >
> > It would still be nice to have a udmabuf as a properly supported option.
> > This can drive the UAPI performance conversions: for example, comparing
> > existing tcp rx zerocopy vs MSG_SOCK_DEVMEM.. So let's not completely
> > dismiss it. We've played internally with doing 2MB udmabuf huge-pages,
> > might post it at some point..
> >
> > > If you're running real experiments with devmem TCP I suggest moving to
> > > real dmabufs as soon as possible, or at least hack udmabuf to give you
> > > large sizes. We've open sourced our production devmem TCP userspace:
> > >
> > > https://github.com/google/tcpgpudmarxd
> > > https://github.com/google/nccl-plugin-gpudirecttcpx
> > >
> > > Porting it to upstream APIs + your dmabuf provider will have you run
> > > much more interesting tests than anything you do with udmabuf I think,
> > > unless you hack the udmabuf size.
> >
> > I found these a bit too late, so I reimplemented the plugin over
> > upstream APIs :-[
> 
> Oh, where? Is it open source?

No, but depending on where we end up I think it should be possible
to make it open source. The is no secret sauce in there..