netdev - Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com>
Date: Thu, 24 Apr 2025 08:47:36 +0000
From: Cosmin Ratiu <cratiu@...dia.com>
To: "almasrymina@...gle.com" <almasrymina@...gle.com>
CC: Jason Gunthorpe <jgg@...dia.com>, "andrew+netdev@...n.ch"
	<andrew+netdev@...n.ch>, "davem@...emloft.net" <davem@...emloft.net>, Tariq
 Toukan <tariqt@...dia.com>, "linux-kselftest@...r.kernel.org"
	<linux-kselftest@...r.kernel.org>, Dragos Tatulea <dtatulea@...dia.com>,
	Saeed Mahameed <saeedm@...dia.com>, "pabeni@...hat.com" <pabeni@...hat.com>,
	Leon Romanovsky <leonro@...dia.com>, "edumazet@...gle.com"
	<edumazet@...gle.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"horms@...nel.org" <horms@...nel.org>, "kuba@...nel.org" <kuba@...nel.org>
Subject: Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf
 pools

On Wed, 2025-04-23 at 10:30 -0700, Mina Almasry wrote:
> On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu <cratiu@...dia.com>
> wrote:
> > 
> > Drivers that are told to allocate RX buffers from pools of DMA
> > memory
> > should have enough memory in the pool to satisfy projected
> > allocation
> > requests (a function of ring size, MTU & other parameters). If
> > there's
> > not enough memory, RX ring refill might fail later at inconvenient
> > times
> > (e.g. during NAPI poll).
> > 
> 
> My understanding is that if the RX ring refill fails, the driver will
> post the buffers it was able to allocate data for, and will not post
> other buffers. So it will run with a degraded performance but nothing
> overly bad should happen. This should be the same behavior if the
> machine is under memory pressure.

What motivated this change was a failure in how mlx5 refills rings
today. For efficiency, ring refill triggered by NAPI polling only
releases old buffers just before allocating new ones so effectively has
a built-in assumption that the ring can be filled. Commit 4c2a13236807
("net/mlx5e: RX, Defer page release in striding rq for better
recycling") is an interesting read here.

For more details, see the do{ }while loop in mlx5e_post_rx_mpwqes,
where mlx5e_free_rx_mpwqe then mlx5e_alloc_rx_mpwqe are called to free
the old buffer and reallocate a new one at the same position. This has
excellent cache-locality and the pages returned to the pool will be
reused by the new descriptor.

The bug in mlx5 is that with a large MTU & ring size, the ring cannot
be fully populated with rx descriptors because the pool doesn't have
enough memory, but there's no memory released back to the pool for new
ones. Eventually, rx descriptors are exhausted and traffic stops.

> In general I don't know about this change. If the user wants to use
> very small dmabufs, they should be able to, without going through
> hoops reducing the number of rx ring slots the driver has (if it
> supports configuring that).
> 
> I think maybe printing an error or warning that the dmabuf is too
> small for the pool_size may be fine. But outright failing this
> configuration? I don't think so.

For regular memory-backed page pools, there's no hard limit of how big
they can become (except available physical memory), so this issue was
not seen before.

I didn't look at other drivers yet, but is it expected that drivers
operate with incompletely filled rings? I assumed that since the user
configured a specified ring size, it expects drivers to be able to use
that size and not silently operate in degraded mode, with a smaller
ring size.

If you think drivers should work in degraded mode, we can look at
improving the ring population code to work better in this scenario.

> pool_size seems to be the number of ptr_ring slots in the page_pool,
> not some upper or lower bound on the amount of memory the page_pool
> can provide. So this check seems useless? The page_pool can still not
> provide this amount of memory with dmabuf (if the netmems aren't
> being
> recycled fast enough) or with normal memory (under memory pressure).

I think pool_size is usually set by drivers based on their params, and
it's the max size of pool->ring. The opportunistic check I added
compares this demand with the supply (available chunk memory) and fails
this config based on the assumption that there should be enough memory
in the pool to satisfy driver needs.

Please let me know your thoughts and how to proceed.

Cosmin.