netdev - Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHS8izNe1pttBsZWneDcWjtPEug5fgTGQpASLdv3BeRf37Y_hA@mail.gmail.com>
Date: Wed, 26 Mar 2025 18:37:38 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Saeed Mahameed <saeedm@...dia.com>
Cc: Toke Høiland-Jørgensen <toke@...hat.com>, 
	"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, Leon Romanovsky <leon@...nel.org>, Tariq Toukan <tariqt@...dia.com>, 
	Andrew Lunn <andrew+netdev@...n.ch>, Eric Dumazet <edumazet@...gle.com>, 
	Paolo Abeni <pabeni@...hat.com>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, 
	Simon Horman <horms@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	Yonglong Liu <liuyonglong@...wei.com>, Yunsheng Lin <linyunsheng@...wei.com>, 
	Pavel Begunkov <asml.silence@...il.com>, Matthew Wilcox <willy@...radead.org>, netdev@...r.kernel.org, 
	bpf@...r.kernel.org, linux-rdma@...r.kernel.org, linux-mm@...ck.org, 
	Qiuling Ren <qren@...hat.com>, Yuying Ma <yuma@...hat.com>
Subject: Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and
 unmap them when destroying the pool

On Wed, Mar 26, 2025 at 5:29 PM Saeed Mahameed <saeedm@...dia.com> wrote:
>
> On 26 Mar 13:02, Mina Almasry wrote:
> >On Wed, Mar 26, 2025 at 11:22 AM Saeed Mahameed <saeedm@...dia.com> wrote:
> >>
> >> On 25 Mar 16:45, Toke Høiland-Jørgensen wrote:
> >> >When enabling DMA mapping in page_pool, pages are kept DMA mapped until
> >> >they are released from the pool, to avoid the overhead of re-mapping the
> >> >pages every time they are used. This causes resource leaks and/or
> >> >crashes when there are pages still outstanding while the device is torn
> >> >down, because page_pool will attempt an unmap through a non-existent DMA
> >> >device on the subsequent page return.
> >> >
> >>
> >> Why dynamically track when it is guaranteed the page_pool consumer (driver)
> >> will return all outstanding pages before disabling the DMA device.
> >> When a page pool is destroyed by the driver, just mark it as "DMA-inactive",
> >> and on page_pool_return_page() if DMA-inactive don't recycle those pages
> >> and immediately DMA unmap and release them.
> >
> >That doesn't work, AFAIU. DMA unmaping after page_pool_destroy has
> >been called in what's causing the very bug this series is trying to
> >fix. What happens is:
> >
> >1. Driver calls page_pool_destroy,
>
> Here the driver should already have returned all inflight pages to the
> pool, the problem is that those returned pages are recycled back, instead
> of being dma unmapped, all we need to do is to mark page_pool as "don't
> recycle" after the driver destroyed it.
>
> >2. Driver removes the net_device (and I guess the associated iommu
> >structs go away with it).
>
> if the driver had not yet released those pages at this point then there is a
> more series leak than just dma leak. If the pool has those pages, which is
> probably the case, then they were already release by the driver, the problem
> is that they were recycled back.
>
> >3. Page-pool tries to unmap after page_pool_destroy is called, trying
> >to fetch iommu resources that have been freed due to the netdevice
> >gone away = bad stuff.
> >
> >(but maybe I misunderstood your suggestion)
>
> Yes, see above, but I just double checked the code and I though that the
> page_pool_destroy would wait for all inflight pages to be release before it
> returns back to the caller, but apparently I was mistaken, if the pages are
> still being held by stack/user-space then they will still be considered
> "inflight" although the driver is done with them :/.
>

Right, I think we're on the same page now. page_pool_destroy doesn't
(currently) wait for all the inflight pages to be returned before it
returns to the driver calling it, even if the driver is fully done
with all the pages. There could be pages still held by the
userspace/net stack.

> So yes tracking "ALL" pages is one way to solve it, but I still think that
> the correct way to deal with this is to hold the driver/netdev while there
> are inflight pages, but no strong opinion if we expect pages to remain in
> userspace for "too" long, then I admit, tracking is the only way.
>

AFAICT, the amount of time the userspace can hold onto an inflight
page is actually indefinite. The pathological edge case is the
userspace opens a receive socket, never closes it, and never
recvmsg()'s it. In that case skbs will wait in the socket's receive
queue forever.

FWIW Jakub did suggest a fix where the page_pool will stall the
netdevice removal while there are inflight pages. I never provided
Reviewed-by because I was nervous about GVE failing to soft reset or
unload or something because some userspace is holding onto a page_pool
page, but maybe you like it better:

https://lore.kernel.org/netdev/20240809205717.0c966bad@kernel.org/T/#mbca7f9391ba44840444e747c9038ef6617142048

My personal feeling is that adding overhead to the slow page_alloc +
dma mapping path is preferable to blocking netdev unregister.

-- 
Thanks,
Mina