[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140724184739.GA6177@gmail.com>
Date: Thu, 24 Jul 2014 14:47:41 -0400
From: Jerome Glisse <j.glisse@...il.com>
To: Alex Deucher <alexdeucher@...il.com>
Cc: Oded Gabbay <oded.gabbay@....com>,
"Bridgman, John" <John.Bridgman@....com>,
Jesse Barnes <jbarnes@...tuousgeek.org>,
"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
Christian König <deathsimple@...afone.de>,
"Lewycky, Andrew" <Andrew.Lewycky@....com>,
David Airlie <airlied@...ux.ie>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
On Thu, Jul 24, 2014 at 01:35:53PM -0400, Alex Deucher wrote:
> On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse@...il.com> wrote:
> > On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
> >> On 24/07/14 00:46, Bridgman, John wrote:
> >> >
> >> >> -----Original Message----- From: dri-devel
> >> >> [mailto:dri-devel-bounces@...ts.freedesktop.org] On Behalf Of Jesse
> >> >> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
> >> >> dri-devel@...ts.freedesktop.org Subject: Re: [PATCH v2 00/25]
> >> >> AMDKFD kernel driver
> >> >>
> >> >> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
> >> >> Vetter) wrote:
> >> >>
> >> >>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> >> >>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> >> >>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
> >> >>>>> wrote:
> >> >>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >> >>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >> >>
> >> >> [snip!!]
> >> > My BlackBerry thumb thanks you ;)
> >> >>
> >> >>>>>>
> >> >>>>>> The main questions here are if it's avoid able to pin down
> >> >>>>>> the memory and if the memory is pinned down at driver load,
> >> >>>>>> by request from userspace or by anything else.
> >> >>>>>>
> >> >>>>>> As far as I can see only the "mqd per userspace queue"
> >> >>>>>> might be a bit questionable, everything else sounds
> >> >>>>>> reasonable.
> >> >>>>>
> >> >>>>> Aside, i915 perspective again (i.e. how we solved this):
> >> >>>>> When scheduling away from contexts we unpin them and put them
> >> >>>>> into the lru. And in the shrinker we have a last-ditch
> >> >>>>> callback to switch to a default context (since you can't ever
> >> >>>>> have no context once you've started) which means we can evict
> >> >>>>> any context object if it's
> >> >> getting in the way.
> >> >>>>
> >> >>>> So Intel hardware report through some interrupt or some channel
> >> >>>> when it is not using a context ? ie kernel side get
> >> >>>> notification when some user context is done executing ?
> >> >>>
> >> >>> Yes, as long as we do the scheduling with the cpu we get
> >> >>> interrupts for context switches. The mechanic is already
> >> >>> published in the execlist patches currently floating around. We
> >> >>> get a special context switch interrupt.
> >> >>>
> >> >>> But we have this unpin logic already on the current code where
> >> >>> we switch contexts through in-line cs commands from the kernel.
> >> >>> There we obviously use the normal batch completion events.
> >> >>
> >> >> Yeah and we can continue that going forward. And of course if your
> >> >> hw can do page faulting, you don't need to pin the normal data
> >> >> buffers.
> >> >>
> >> >> Usually there are some special buffers that need to be pinned for
> >> >> longer periods though, anytime the context could be active. Sounds
> >> >> like in this case the userland queues, which makes some sense. But
> >> >> maybe for smaller systems the size limit could be clamped to
> >> >> something smaller than 128M. Or tie it into the rlimit somehow,
> >> >> just like we do for mlock() stuff.
> >> >>
> >> > Yeah, even the queues are in pageable memory, it's just a ~256 byte
> >> > structure per queue (the Memory Queue Descriptor) that describes the
> >> > queue to hardware, plus a couple of pages for each process using HSA
> >> > to hold things like doorbells. Current thinking is to limit #
> >> > processes using HSA to ~256 and #queues per process to ~1024 by
> >> > default in the initial code, although my guess is that we could take
> >> > the #queues per process default limit even lower.
> >> >
> >>
> >> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
> >> on 256 boundary.
> >> I had in mind to reserve 64MB of gart by default, which translates to
> >> 512 queues per process, with 128 processes. Add 2 kernel module
> >> parameters, # of max-queues-per-process and # of max-processes (default
> >> is, as I said, 512 and 128) for better control of system admin.
> >>
> >
> > So as i said somewhere else in this thread, this should not be reserved
> > but use a special allocation. Any HSA GPU use virtual address space for
> > userspace so only issue is for kernel side GTT.
> >
> > What i would like is seeing radeon pinned GTT allocation at bottom of
> > GTT space (ie all ring buffer and the ib pool buffer). Then have an
> > allocator that allocate new queue from top of GTT address space and
> > grow to the bottom.
> >
> > It should not staticly reserved 64M or anything. When doing allocation
> > it should move any ttm buffer that are in the region it want to allocate
> > to a different location.
> >
> >
> > As this needs some work, i am not against reserving some small amount
> > (couple MB) as a first stage but anything more would need a proper solution
> > like the one i just described.
>
> It's still a trade off. Even if we reserve a couple of megs it'll be
> wasted if we are not running HSA apps. And even today if we run a
> compute job using the current interfaces we could end up in the same
> case. So while I think it's definitely a good goal to come up with
> some solution for fragmentation, I don't think it should be a
> show-stopper right now.
>
Seems i am having a hard time to express myself. I am not saying it is a
showstopper i am saying until proper solution is implemented KFD should
limit its number of queue to consume at most couple MB ie not 64MB or more
but 2MB, 4MB something in that water.
> A better solution to deal with fragmentation of GTT and provide a
> better way to allocate larger buffers in vram would be to break up
> vram <-> system pool transfers into multiple transfers depending on
> the available GTT size. Or use GPUVM dynamically for vram <-> system
> transfers.
Isn't the UVD engine still using the main GTT ? I have not look much at
UVD in a while.
Yes there is way to fix buffer migration but i would also like to see
address space fragmentation to a minimum which is the main reason i
uterly hate any design that forbid kernel to take over and do its thing.
Buffer pining should really be only for front buffer and thing like ring
ie buffer that have a lifetime bound to the driver lifetime.
Cheers,
Jérôme
>
> Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists