[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240904155854.471ff2a8@collabora.com>
Date: Wed, 4 Sep 2024 15:58:54 +0200
From: Boris Brezillon <boris.brezillon@...labora.com>
To: Steven Price <steven.price@....com>
Cc: Adrián Larumbe <adrian.larumbe@...labora.com>,
Christian König <christian.koenig@....com>, Mihail Atanassov
<mihail.atanassov@....com>, linux-kernel@...r.kernel.org, Liviu Dudau
<liviu.dudau@....com>, dri-devel@...ts.freedesktop.org, David Airlie
<airlied@...il.com>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann
<tzimmermann@...e.de>, Alex Deucher <alexander.deucher@....com>, Xinhui Pan
<Xinhui.Pan@....com>, Shashank Sharma <shashank.sharma@....com>, Ketil
Johnsen <ketil.johnsen@....com>, Akash Goel <akash.goel@....com>
Subject: Re: [RFC PATCH 00/10] drm/panthor: Add user submission
On Wed, 4 Sep 2024 14:35:12 +0100
Steven Price <steven.price@....com> wrote:
> On 04/09/2024 14:20, Boris Brezillon wrote:
> > + Adrian, who has been looking at the shrinker stuff for Panthor
> >
> > On Wed, 4 Sep 2024 13:46:12 +0100
> > Steven Price <steven.price@....com> wrote:
> >
> >> On 04/09/2024 12:34, Christian König wrote:
> >>> Hi Boris,
> >>>
> >>> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
> >>>>>>>> Please read up here on why that stuff isn't allowed:
> >>>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> >>>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which means
> >>>>>>> memory management easy mode.
> >>>>>> Ok, that at least makes things work for the moment.
> >>>>> Ah, perhaps this should have been spelt out more clearly ;)
> >>>>>
> >>>>> The VM_BIND mechanism that's already in place jumps through some hoops
> >>>>> to ensure that memory is preallocated when the memory operations are
> >>>>> enqueued. So any memory required should have been allocated before any
> >>>>> sync object is returned. We're aware of the issue with memory
> >>>>> allocations on the signalling path and trying to ensure that we don't
> >>>>> have that.
> >>>>>
> >>>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
> >>>>> memory with our design.
> >>>> That's actually what we were planning to do: the panthor shrinker was
> >>>> about to rely on fences attached to GEM objects to know if it can
> >>>> reclaim the memory. This design relies on each job attaching its fence
> >>>> to the GEM mapped to the VM at the time the job is submitted, such that
> >>>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
> >>>> is done.
> >>
> >> How progressed is this shrinker?
> >
> > We don't have code yet. All we know is that we want to re-use Dmitry's
> > generic GEM-SHMEM shrinker implementation [1], and adjust it to match
> > the VM model, which means not tracking things at the BO granularity,
> > but at the VM granularity. Actually it has to be an hybrid model, where
> > shared BOs (those imported/exported) are tracked individually, while
> > all private BOs are checked simultaneously (since they all share the VM
> > resv object).
> >
> >> It would be good to have an RFC so that
> >> we can look to see how user submission could fit in with it.
> >
> > Unfortunately, we don't have that yet :-(. All we have is a rough idea
> > of how things will work, which is basically how TTM reclaim works, but
> > adapted to GEM.
>
> Fair enough, thanks for the description. We might need to coordinate to
> get this looked at sooner if it's going to be blocking user submission.
>
> >>
> >>> Yeah and exactly that doesn't work any more when you are using user
> >>> queues, because the kernel has no opportunity to attach a fence for each
> >>> submission.
> >>
> >> User submission requires a cooperating user space[1]. So obviously user
> >> space would need to ensure any BOs that it expects will be accessed to
> >> be in some way pinned. Since the expectation of user space submission is
> >> that we're reducing kernel involvement, I'd also expect these to be
> >> fairly long-term pins.
> >>
> >> [1] Obviously with a timer to kill things from a malicious user space.
> >>
> >> The (closed) 'kbase' driver has a shrinker but is only used on a subset
> >> of memory and it's up to user space to ensure that it keeps the relevant
> >> parts pinned (or more specifically not marking them to be discarded if
> >> there's memory pressure). Not that I think we should be taking it's
> >> model as a reference here.
> >>
> >>>>> Memory which user space thinks the GPU might
> >>>>> need should be pinned before the GPU work is submitted. APIs which
> >>>>> require any form of 'paging in' of data would need to be implemented by
> >>>>> the GPU work completing and being resubmitted by user space after the
> >>>>> memory changes (i.e. there could be a DMA fence pending on the GPU work).
> >>>> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> >>>> that means we can't really transparently swap out GPU memory, or we
> >>>> have to constantly pin/unpin around each job, which means even more
> >>>> ioctl()s than we have now. Another option would be to add the XGS fence
> >>>> to the BOs attached to the VM, assuming it's created before the job
> >>>> submission itself, but you're no longer reducing the number of user <->
> >>>> kernel round trips if you do that, because you now have to create an
> >>>> XSG job for each submission, so you basically get back to one ioctl()
> >>>> per submission.
> >>
> >> As you say the granularity of pinning has to be fairly coarse for user
> >> space submission to make sense. My assumption (could be wildly wrong)
> >> was that most memory would be pinned whenever a context is rendering.
> >
> > The granularity of pinning (in term of which regions are pinned) is not
> > really the problem, we can just assume anything that's mapped to the VM
> > will be used by the GPU (which is what we're planning to do for kernel
> > submission BTW). The problem is making the timeslice during
> > which VM memory is considered unreclaimable as short as possible, such
> > that the system can reclaim memory under mem pressure. Ideally, you want
> > to pin memory as long as you have jobs queued/running, and allow for
> > reclaim when the GPU context is idle.
> >
> > We might be able to involve the panthor_scheduler for usermode queues,
> > such that a context that's eligible for scheduling first gets its VM
> > mappings pinned (fence creation + assignment to the VM/BO resvs), and
> > things get reclaimable again when the group is evicted from the CSG
> > slot. That implies evicting idle groups more aggressively than we do
> > know, but there's probably a way around it.
>
> Can we evict idle groups from the shrinker? User space already has to do
> a little dance to work out whether it needs to "kick" the kernel to do
> the submission. So it would be quite reasonable to extend the kick to
> also pin the VM(s). The shrinker could then proactively evict idle
> groups, and at that point it would be possible to unpin the memory.
>
> I'm probably missing something, but that seems like a fairly solid
> solution to me.
The locking might be tricky to get right, but other than that, it looks
like an option that could work. It's definitely worth investigating.
>
> >>
> >>> For AMDGPU we are currently working on the following solution with
> >>> memory management and user queues:
> >>>
> >>> 1. User queues are created through an kernel IOCTL, submissions work by
> >>> writing into a ring buffer and ringing a doorbell.
> >>>
> >>> 2. Each queue can request the kernel to create fences for the currently
> >>> pushed work for a queues which can then be attached to BOs, syncobjs,
> >>> syncfiles etc...
> >>>
> >>> 3. Additional to that we have and eviction/preemption fence attached to
> >>> all BOs, page tables, whatever resources we need.
> >>>
> >>> 4. When this eviction fences are requested to signal they first wait for
> >>> all submission fences and then suspend the user queues and block
> >>> creating new submission fences until the queues are restarted again.
> >>>
> >>> This way you can still do your memory management inside the kernel (e.g.
> >>> move BOs from local to system memory) or even completely suspend and
> >>> resume applications without their interaction, but as Sima said it is
> >>> just horrible complicated to get right.
> >>>
> >>> We have been working on this for like two years now and it still could
> >>> be that we missed something since it is not in production testing yet.
> >>
> >> I'm not entirely sure I follow how this doesn't create a dependency
> >> cycle. From your description it sounds like you create a fence from the
> >> user space queue which is then used to prevent eviction of the BOs needed.
> >>
> >> So to me it sounds like:
> >>
> >> 1. Attach fence to BOs to prevent eviction.
> >>
> >> 2. User space submits work to the ring buffer, rings doorbell.
> >>
> >> 3. Call into the kernel to create the fence for step 1.
> >>
> >> Which is obviously broken. What am I missing?
> >>
> >> One other thing to note is that Mali doesn't have local memory - so the
> >> only benefit of unpinning is if we want to swap to disk (or zram etc).
> >
> > Which would be good to have, IMHO. If we don't do the implicit swapout
> > based on some sort of least-recently-used-VM, we have to rely on
> > userspace freeing its buffer (or flagging them reclaimable) as soon as
> > they are no longer used by the GPU.
>
> It's an awkward one because really you do want user space to free
> buffers - generally the user space driver has a fairly large number of
> buffers which are for temporary storage (and kept around to avoid the
> overhead of freeing/allocating each frame). I know we've resorted to
> idle timers in user space in an attempt to do this in the past - but
> it's ugly.
Fair enough.
>
> Actual swapout can work, but it incurs a heavy penalty when the
> application becomes active again. But I agree we should probably be
> aiming to add support for this.
I guess the other case is a gazillion of GPU contexts, each keeping
buffers around for good reason. If most of those contexts are idle, I
guess swapping out mostly inactive context is better than having the
OOM kick in to reclaim memory. And yes, there's a hit any time an
inactive context gets active again, if and only if the system was under
memory pressure in the meantime. So I guess the slowdown is acceptable
in that case.
Just to mention that, IIRC, MSM already moved away from explicit BO
reclaim (BO flagged reclaimable from userspace when they're back in the
userspace BO cache) in favor of an LRU-based swapout. Dmitry's work is
also centered around this LRU-based swapout model, even though it also
supports the model we have in panfrost, lima, vc4, etc. So it looks
like the new trend for embedded GPU drivers is to defer this memory
reclaim responsibility entirely to the kernel rather than letting
userspace decide which bits are reclaimable.
Powered by blists - more mailing lists