netdev - Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izNB0qNaU8OTcwDYmeVPtCrEjTTOhwCHtVsLiyhXmPLsXQ@mail.gmail.com>
Date: Wed, 5 Jul 2023 18:17:39 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Jason Gunthorpe <jgg@...pe.ca>
Cc: David Ahern <dsahern@...nel.org>, Jakub Kicinski <kuba@...nel.org>, 
	Jesper Dangaard Brouer <jbrouer@...hat.com>, brouer@...hat.com, 
	Alexander Duyck <alexander.duyck@...il.com>, Yunsheng Lin <linyunsheng@...wei.com>, davem@...emloft.net, 
	pabeni@...hat.com, netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Lorenzo Bianconi <lorenzo@...nel.org>, Yisen Zhuang <yisen.zhuang@...wei.com>, 
	Salil Mehta <salil.mehta@...wei.com>, Eric Dumazet <edumazet@...gle.com>, 
	Sunil Goutham <sgoutham@...vell.com>, Geetha sowjanya <gakula@...vell.com>, 
	Subbaraya Sundeep <sbhatta@...vell.com>, hariprasad <hkelam@...vell.com>, 
	Saeed Mahameed <saeedm@...dia.com>, Leon Romanovsky <leon@...nel.org>, Felix Fietkau <nbd@....name>, 
	Ryder Lee <ryder.lee@...iatek.com>, Shayne Chen <shayne.chen@...iatek.com>, 
	Sean Wang <sean.wang@...iatek.com>, Kalle Valo <kvalo@...nel.org>, 
	Matthias Brugger <matthias.bgg@...il.com>, 
	AngeloGioacchino Del Regno <angelogioacchino.delregno@...labora.com>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, 
	linux-rdma@...r.kernel.org, linux-wireless@...r.kernel.org, 
	linux-arm-kernel@...ts.infradead.org, linux-mediatek@...ts.infradead.org, 
	Jonathan Lemon <jonathan.lemon@...il.com>
Subject: Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5]
 page_pool: remove PP_FLAG_PAGE_FRAG flag)

On Mon, Jul 3, 2023 at 2:43 PM Jason Gunthorpe <jgg@...pe.ca> wrote:
>
> On Sun, Jul 02, 2023 at 11:22:33PM -0700, Mina Almasry wrote:
> > On Sun, Jul 2, 2023 at 9:20 PM David Ahern <dsahern@...nel.org> wrote:
> > >
> > > On 6/29/23 8:27 PM, Mina Almasry wrote:
> > > >
> > > > Hello Jakub, I'm looking into device memory (peer-to-peer) networking
> > > > actually, and I plan to pursue using the page pool as a front end.
> > > >
> > > > Quick description of what I have so far:
> > > > current implementation uses device memory with struct pages; I am
> > > > putting all those pages in a gen_pool, and we have written an
> > > > allocator that allocates pages from the gen_pool. In the driver, we
> > > > use this allocator instead of alloc_page() (the driver in question is
> > > > gve which currently doesn't use the page pool). When the driver is
> > > > done with the p2p page, it simply decrements the refcount on it and
> > > > the page is freed back to the gen_pool.
> >
> > Quick update here, I was able to get my implementation working with
> > the page pool as a front end with the memory provider API Jakub wrote
> > here:
> > https://github.com/kuba-moo/linux/tree/pp-providers
> >
> > The main complication indeed was the fact that my device memory pages
> > are ZONE_DEVICE pages, which are incompatible with the page_pool due
> > to the union in struct page. I thought of a couple of approaches to
> > resolve that.
> >
> > 1. Make my device memory pages non-ZONE_DEVICE pages.
>
> Hard no on this from a mm perspective.. We need P2P memory to be
> properly tagged and have the expected struct pages to be DMA mappable
> and otherwise, you totally break everything if you try to do this..
>
> > 2. Convert the pages from ZONE_DEVICE pages to page_pool pages and
> > vice versa as they're being inserted and removed from the page pool.
>
> This is kind of scary, it is very, very, fragile to rework the pages
> like this. Eg what happens when the owning device unplugs and needs to
> revoke these pages? I think it would likely crash..
>
> I think it also technically breaks the DMA API as we may need to look
> into the pgmap to do cache ops on some architectures.
>
> I suggest you try to work with 8k folios and then the tail page's
> struct page is empty enough to store the information you need..

Hi Jason, sorry for the late reply,

I think this could work, and the page pool already supports > order 0
allocations. It may end up being a big change to the GVE driver which
as I understand currently deals with order 0 allocations exclusively.

Another issue is that in networks with low MTU, we could be DMAing
1400/1500 bytes into each allocation, which is problematic if the
allocation is 8K+. I would need to investigate a bit to see if/how to
solve that, and we may end up having to split the page and again run
into the 'not enough room in struct page' problem.

> Or allocate per page memory and do a memdesc like thing..
>

I need to review memdesc more closely. Do you imagine I add a pointer
in struct page that points to the memdesc? Or implement a page to
memdesc mapping in the page_pool? Either approach could work. I think
the concern would be accessing the memdesc entries may be a cache miss
unacceptable in fast paths, but I think I already dereference
page->pgmap in a few places and it doesn't seem to be an issue.

> Though overall, you won't find devices creating struct pages for their
> P2P memory today, so I'm not sure what the purpose is. Jonathan
> already got highly slammed for proposing code to the kernel that was
> unusable. Please don't repeat that. Other than a special NVMe use case
> the interface for P2P is DMABUF right now and it is not struct page
> backed.
>

Our approach is actually to extend DMABUF to provide struct page
backed attachment mappings, which as far as I understand sidesteps the
issues Jonathan ran into. Our code is fully functional with any device
that supports dmabuf and in fact a lot of my tests use udmabuf to
minimize the dependencies. The RFC may come with a udmabuf selftest to
showcase that any dmabuf, even a mocked one, would be supported.

> Even if we did get to struct pages for device memory, it is highly
> likely cases you are interested in will be using larger than 4k
> folios, so page pool would need to cope with this nicely as well.
>

--
Thanks,
Mina