[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5eeefadd-7804-3876-c8da-3e6f1bcb9dc0@gmail.com>
Date: Mon, 4 Jul 2022 15:48:03 +0200
From: Christian König <ckoenig.leichtzumerken@...il.com>
To: Christian König <christian.koenig@....com>,
Daniel Stone <daniel@...ishbar.org>,
Pekka Paalanen <ppaalanen@...il.com>,
"Sharma, Shashank" <Shashank.Sharma@....com>,
lkml <linux-kernel@...r.kernel.org>,
dri-devel <dri-devel@...ts.freedesktop.org>,
Nicolas Dufresne <nicolas@...fresne.ca>,
linaro-mm-sig@...ts.linaro.org,
Sumit Semwal <sumit.semwal@...aro.org>,
linux-media <linux-media@...r.kernel.org>
Subject: Re: [Linaro-mm-sig] Re: DMA-buf and uncached system memory
Hi Daniel,
Am 25.06.22 um 00:02 schrieb Daniel Vetter:
> On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote:
>> Am 23.06.22 um 13:27 schrieb Daniel Stone:
>>> [SNIP]
>>> If it's really your belief that dmabuf requires universal snooping, I
>>> recommend you send the patch to update the documentation, as well as
>>> to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.
>> Well, to be honest I think that would indeed be necessary.
>>
>> What we have created are essentially two different worlds, one for PCI
>> devices and one for the rest.
>>
>> This was indeed not the intention, but it's a fact that basically all
>> DMA-buf based PCI drivers assume coherent access.
> dma-buf does not require universal snooping.
>
> It does defacto require that all device access is coherent with all other
> device access, and consistent with the exporters notion of how cpu
> coherency is achieved. Not that coherent does not mean snooping, as long
> as all devices do unsnooped access and the exporter either does wc/uc or
> flushes caches that's perfectly fine, and how all the arm soc dma-buf
> sharing works.
We should probably start documenting that better.
> We did originally have the wording in there that you have to map/unamp
> around every device access, but that got dropped because no one was doing
> that anyway.
>
> Now where this totally breaks down is how we make this work, because the
> idea was that dma_buf_attach validates this all. Where this means all the
> hilarious reasons buffer sharing might not work:
> - wrong coherency mode (cpu cached or not)
> - not contiguous (we do check that, but only once we get the sg from
> dma_buf_attachment_map, which strictly speaking is a bit too late but
> most drivers do attach&map as one step so not that bad in practice)
> - whether the dma api will throw in bounce buffers or not
> - random shit like "oh this is in the wrong memory bank", which I think
> never landed in upstream
>
> p2p connectivity is about the only one that gets this right, yay. And the
> only reason we can even get it right is because all the information is
> exposed to drivers fully.
Yeah, that's why I designed P2P that way :)
I also don't think it's that bad, at least for radeon, nouveau and
amdgpu all the migration restrictions are actually handled correctly.
In other words when a DMA-buf is about to be used by another device we
use TTM to move the buffer around so that it can actually be accessed by
that device.
What I haven't foreseen in here is that we need to deal with different
caching behaviors between exporter and importer.
> The issue is that the device dma api refuses to share this information
> because it would "leak". Which sucks, because we have defacto build every
> single cross-device use-case of dma-buf on the assumption we can check
> this (up to gl/vk specs), but oh well.
>
> So in practice this gets sorted out by endless piles of hacks to make
> individual use-cases work.
>
> Oh and: This is definitely not limited to arm socs. x86 socs with intel
> at least have exactly all the same issues, and they get solved by adding
> various shitty hacks to the involved drivers (like i915+amdgpu). Luckily
> the intel camera driver isn't in upstream yet, since that would break a
> bunch of the hacks since suddently there will be now 2 cpu cache
> incoherent devices in an x86 system.
>
> Ideally someone fixes this, but I'm not hopeful.
>
> I recommend pouring more drinks.
>
> What is definitely not correct is claiming that dma-buf wasn't meant for
> this. We discussed cache coherency issues endless in budapest 12 or so
> years ago, I was there. It's just that the reality of the current
> implementation is falling short, and every time someone tries to fix it we
> get shouted down by dma api maintainers for looking behind their current.
Well that explains this, I've joined the party a year later and haven't
witnessed all of this.
> tldr; You have to magically know to not use cpu cached allocators on these
> machines.
Or reject the attachment. As far as I can see that is still the cleanest
option.
Regards,
Christian.
>
> Aside: This is also why vgem alloates wc memory on x86. It's the least
> common denominator that works. arm unfortunately doesn't allow you to
> allocate wc memory, so there stuff is simply somewhat broken.
> -Daniel
Powered by blists - more mailing lists