[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4d4AaLGrhRa5KLJ@phenom.ffwll.local>
Date: Wed, 15 Jan 2025 09:55:29 +0100
From: Simona Vetter <simona.vetter@...ll.ch>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Xu Yilun <yilun.xu@...ux.intel.com>,
Christian König <christian.koenig@....com>,
Christoph Hellwig <hch@....de>, Leon Romanovsky <leonro@...dia.com>,
kvm@...r.kernel.org, dri-devel@...ts.freedesktop.org,
linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org,
sumit.semwal@...aro.org, pbonzini@...hat.com, seanjc@...gle.com,
alex.williamson@...hat.com, vivek.kasireddy@...el.com,
dan.j.williams@...el.com, aik@....com, yilun.xu@...el.com,
linux-coco@...ts.linux.dev, linux-kernel@...r.kernel.org,
lukas@...ner.de, yan.y.zhao@...el.com, leon@...nel.org,
baolu.lu@...ux.intel.com, zhenzhong.duan@...el.com,
tao1.su@...el.com
Subject: Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked()
kAPI
On Tue, Jan 14, 2025 at 01:31:03PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote:
>
> > E.g. if a compositor gets a dma-buf it assumes that by just binding that
> > it will not risk gpu context destruction (unless you're out of memory and
> > everything is on fire anyway, and it's ok to die). But if a nasty client
> > app supplies a revocable dma-buf, then it can shot down the higher
> > priviledged compositor gpu workload with precision. Which is not great, so
> > maybe existing dynamic gpu importers should reject revocable dma-buf.
> > That's at least what I had in mind as a potential issue.
>
> I see, so it is not that they can't handle a non-present fault it is
> just that the non-present effectively turns into a crash of the
> context and you want to avoid the crash. It makes sense to me to
> negotiate this as part of the API.
Yup.
> > > This is similar to the structure BIO has, and it composes nicely with
> > > a future pin_user_pages() and memfd_pin_folios().
> >
> > Since you mention pin here, I think that's another aspect of the revocable
> > vs dynamic question. Dynamic buffers are expected to sometimes just move
> > around for no reason, and importers must be able to cope.
>
> Yes, and we have importers that can tolerate dynamic and those that
> can't. Though those that can't tolerate it can often implement revoke.
>
> I view your list as a cascade:
> 1) Fully pinned can never be changed so long as the attach is present
> 2) Fully pinned, but can be revoked. Revoked is a fatal condition and
> the importer is allowed to experience an error
> 3) Fully dynamic and always present. Support for move, and
> restartable fault, is required
>
> Today in RDMA we ask the exporter if it is 1 or 3 and allow different
> things. I've seen the GPU side start to offer 1 more often as it has
> significant performance wins.
I'm not entirely sure a cascade is the right thing or whether we should
have importers just specify bitmask of what is acceptable to them. But
that little detail aside this sounds like what I was thinking of.
> > For recovable exporters/importers I'd expect that movement is not
> > happening, meaning it's pinned until the single terminal revocation. And
> > maybe I read the kvm stuff wrong, but it reads more like the latter to me
> > when crawling through the pfn code.
>
> kvm should be fully faultable and it should be able handle move. It
> handles move today using the mmu notifiers after all.
>
> kvm would need to interact with the dmabuf reservations on its page
> fault path.
>
> iommufd cannot be faultable and it would only support revoke. For VFIO
> revoke would not be fully terminal as VFIO can unrevoke too
> (sigh). If we make revoke special I'd like to eventually include
> unrevoke for this reason.
>
> > Once we have the lifetime rules nailed then there's the other issue of how
> > to describe the memory, and my take for that is that once the dma-api has
> > a clear answer we'll just blindly adopt that one and done.
>
> This is what I hope, we are not there yet, first Leon's series needs
> to get merged then we can start on making the DMA API P2P safe without
> any struct page. From there it should be clear what direction things
> go in.
>
> DMABUF would return pfns annotated with whatever matches the DMA API,
> and the importer would be able to inspect the PFNs to learn
> information like their P2Pness, CPU mappability or whatever.
I think for 90% of exporters pfn would fit, but there's some really funny
ones where you cannot get a cpu pfn by design. So we need to keep the
pfn-less interfaces around. But ideally for the pfn-capable exporters we'd
have helpers/common code that just implements all the other interfaces.
I'm not worried about the resulting fragmentation, because dma-buf is the
"everythig is optional" api. We just need to make sure we have really
clear semantic api contracts, like with the revocable vs dynamic/moveable
distinction. Otherwise things go boom when an importer gets an unexpected
dma-buf fd, and we pass this across security boundaries a lot.
> I'm pushing for the extra struct, and Christoph has been thinking
> about searching a maple tree on the PFN. We need to see what works best.
>
> > And currently with either dynamic attachments and dma_addr_t or through
> > fishing the pfn from the cpu pagetables there's some very clearly defined
> > lifetime and locking rules (which kvm might get wrong, I've seen some
> > discussions fly by where it wasn't doing a perfect job with reflecting pte
> > changes, but that was about access attributes iirc).
>
> Wouldn't surprise me, mmu notifiers are very complex all around. We've
> had bugs already where the mm doesn't signal the notifiers at the
> right points.
>
> > If we add something
> > new, we need clear rules and not just "here's the kvm code that uses it".
> > That's how we've done dma-buf at first, and it was a terrible mess of
> > mismatched expecations.
>
> Yes, that would be wrong. It should be self defined within dmabuf and
> kvm should adopt to it, move semantics and all.
Ack.
I feel like we have a plan here. Summary from my side:
- Sort out pin vs revocable vs dynamic/moveable semantics, make sure
importers have no surprises.
- Adopt whatever new dma-api datastructures pops out of the dma-api
reworks.
- Add pfn based memory access as yet another optional access method, with
helpers so that exporters who support this get all the others for free.
I don't see a strict ordering between these, imo should be driven by
actual users of the dma-buf api.
Already done:
- dmem cgroup so that we can resource control device pinnings just landed
in drm-next for next merge window. So that part is imo sorted and we can
charge ahead with pinning into device memory without all the concerns
we've had years ago when discussing that for p2p dma-buf support.
But there might be some work so that we can p2p pin without requiring
dynamic attachments only, I haven't checked whether that needs
adjustment in dma-buf.c code or just in exporters.
Anything missing?
I feel like this is small enough that m-l archives is good enough. For
some of the bigger projects we do in graphics we sometimes create entries
in our kerneldoc with wip design consensus and things like that. But
feels like overkill here.
> My general desire is to move all of RDMA's MR process away from
> scatterlist and work using only the new DMA API. This will save *huge*
> amounts of memory in common workloads and be the basis for non-struct
> page DMA support, including P2P.
Yeah a more memory efficient structure than the scatterlist would be
really nice. That would even benefit the very special dma-buf exporters
where you cannot get a pfn and only the dma_addr_t, altough most of those
(all maybe even?) have contig buffers, so your scatterlist has only one
entry. But it would definitely be nice from a design pov.
Aside: A way to more efficiently create compressed scatterlists would be
neat too, because a lot of drivers hand-roll that and it's a bit brittle
and kinda silly to duplicate. With compressed I mean just a single entry
for a contig range, in practice thanks to huge pages/folios and allocators
trying to hand out contig ranges if there's plenty of memory that saves a
lot of memory too. But currently it's a bit a pain to construct these
efficiently, mostly it's just a two-pass approach and then trying to free
surplus memory or krealloc to fit. Also I don't have good ideas here, but
dma-api folks might have some from looking at too many things that create
scatterlists.
-Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Powered by blists - more mailing lists