linux-kernel - Re: DMA-buf and uncached system memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a3d783b4-4d38-c5c1-12d1-80496c1138c0@amd.com>
Date:   Fri, 24 Jun 2022 08:54:18 +0200
From:   Christian König <christian.koenig@....com>
To:     Lucas Stach <l.stach@...gutronix.de>,
        Pekka Paalanen <ppaalanen@...il.com>
Cc:     "Sharma, Shashank" <Shashank.Sharma@....com>,
        lkml <linux-kernel@...r.kernel.org>,
        dri-devel <dri-devel@...ts.freedesktop.org>,
        Nicolas Dufresne <nicolas@...fresne.ca>,
        linaro-mm-sig@...ts.linaro.org,
        Sumit Semwal <sumit.semwal@...aro.org>,
        linux-media <linux-media@...r.kernel.org>
Subject: Re: DMA-buf and uncached system memory

Am 23.06.22 um 17:26 schrieb Lucas Stach:
> Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:
>> Am 23.06.22 um 14:14 schrieb Lucas Stach:
>>> Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:
>>>> Am 23.06.22 um 13:29 schrieb Lucas Stach:
>>>> [SNIP]
>>>> I mean I even had somebody from ARM which told me that this is not going
>>>> to work with our GPUs on a specific SoC. That there are ARM internal use
>>>> cases which just seem to work because all the devices are non-coherent
>>>> is completely new to me.
>>>>
>>> Yes, trying to hook up a peripheral that assumes cache snooping in some
>>> design details to a non coherent SoC may end up exploding in various
>>> ways. On the other hand you can work around most of those assumptions
>>> by marking the memory as uncached to the CPU, which may tank
>>> performance, but will work from a correctness PoV.
>> Yeah, and exactly that's what I meant with "DMA-buf is not the framework
>> for this".
>>
>> See we do support using uncached/not snooped memory in DMA-buf, but only
>> for the exporter side.
>>
>> For example the AMD and Intel GPUs have a per buffer flag for this.
>>
>> The importer on the other hand needs to be able to handle whatever the
>> exporter provides.
>>
> I fail to construct a case where you want the Vulkan/GL "no domain
> transition" coherent semantic without the allocator knowing about this.
> If you need this and the system is non-snooping, surely the allocator
> will choose uncached memory.

No it won't. The allocator in the exporter is independent of the importer.

That is an important and intentional design decision, cause otherwise 
you wouldn't have exporters/importers in the first place and rather a 
centralized allocation pool like what dma-heap implements.

See the purpose of DMA-buf is to expose the buffers in the way the 
exporter wants to expose them. So when the exporting driver wants to 
allocate normal cached system memory then that is perfectly fine and 
completely fits into this design.

Otherwise we would need to adjust all exporters to the importers, which 
is potentially not even possible.

> I agree that you absolutely need to fail the usage when someone imports
> a CPU cached buffer and then tries to use it as GL coherent on a non-
> snooping system. That simply will not work.

Exactly that, yes. That's what the attach callback is good for.

See we already have tons of cases where buffers can't be shared because 
they wasn't initially allocated in a way the importer can deal with 
them. But that's perfectly ok and intentional.

For example just take a configuration where a dedicated GPU clones the 
display with an integrated GPU. The dedicated GPU needs the image in 
local memory for scanout which is usually not accessible to the 
integrated GPU.

So either attaching the DMA-buf or creating the KMS framebuffer config 
will fail and we are running into the fallback path which involves an 
extra copy. And that is perfectly fine and intentional since this 
configuration is not supported by the hardware.

>>>> [SNIP]
>>>>>> You can of course use DMA-buf in an incoherent environment, but then you
>>>>>> can't expect that this works all the time.
>>>>>>
>>>>>> This is documented behavior and so far we have bluntly rejected any of
>>>>>> the complains that it doesn't work on most ARM SoCs and I don't really
>>>>>> see a way to do this differently.
>>>>> Can you point me to that part of the documentation? A quick grep for
>>>>> "coherent" didn't immediately turn something up within the DMA-buf
>>>>> dirs.
>>>> Search for "cache coherency management". It's quite a while ago, but I
>>>> do remember helping to review that stuff.
>>>>
>>> That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are
>>> saying the exact opposite of the DMA-buf is always coherent.
>> Sounds like I'm not making clear what I want to say here: For the
>> exporter using cache coherent memory is optional, for the importer it isn't.
>>
>> For the exporter it is perfectly valid to use kmalloc, get_free_page
>> etc... on his buffers as long as it uses the DMA API to give the
>> importer access to it.
>>
> And here is where our line of thought diverges: the DMA API allows
> snooping and non-snooping devices to work together just fine, as it has
> explicit domain transitions, which are no-ops if both devices are
> snooping, but will do the necessary cache maintenance when one of them
> is non-snooping but the memory is CPU cached.
>
> I don't see why DMA-buf should be any different here. Yes, you can not
> support the "no domain transition" sharing when the memory is CPU
> cached and one of the devices in non-snooping, but you can support 99%
> of real use-cases like the non-snooped scanout or the UVC video import.

Well I didn't say we couldn't do it that way. What I'm saying that it 
was intentionally decided against it.

We could re-iterate that decision, but this would mean that all existing 
exporters would now need to provide additional functionality.

>> The importer on the other hand needs to be able to deal with that. When
>> this is not the case then the importer somehow needs to work around that.
>>
> Why? The importer maps the dma-buf via dma_buf_map_attachment, which in
> most cases triggers a map via the DMA API on the exporter side. This
> map via the DMA API will already do the right thing in terms of cache
> management, it's just that we explicitly disable it via
> DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be
> cached, which violates the DMA API explicit domain transition anyway.

Why doesn't the importer simply calls dma_sync_sg_for_device() as 
necessary? See the importer does already know when it needs to access 
the buffer and as far as I can see has all the necessary variable to do 
the sync.

The exporter on the other hand doesn't know that. So we would need to 
transport this information.

Another fundamental problem is that the DMA API isn't designed for 
device to device transitions. In other words you have CPU->device and 
device->CPU transition, but not device->device. As far as I can see the 
DMA API should already have the necessary information if things like 
cache flushes are necessary or not.

>> Either by flushing the CPU caches or by rejecting using the imported
>> buffer for this specific use case (like AMD and Intel drivers should be
>> doing).
>>
>> If the Intel or ARM display drivers need non-cached memory and don't
>> reject buffer where they don't know this then that's certainly a bug in
>> those drivers.
> It's not just display drivers, video codec accelerators and most GPUs
> in this space are also non-snooping. In the ARM SoC world everyone just
> assumes you are non-snooping, which is why things work for most cases
> and only a handful like the UVC video import is broken.

That is really interesting to know, but I still think that DMA-buf was 
absolutely not designed for this use case.

 From the point of view the primary reason for this was laptops with 
both dedicated and integrated GPUs, webcams etc...

That you have a huge number of ARM specific devices which can interop 
with themselves, but not with devices outside of their domain is not 
something foreseen here.

Regards,
Christian.

>> Otherwise we would need to change all DMA-buf exporters to use a special
>> function for allocation non-coherent memory and that is certainly not
>> going to fly.
>>
>>> I also don't see why you think that both world views are so totally
>>> different. We could just require explicit domain transitions for non-
>>> snoop access, which would probably solve your scanout issue and would
>>> not be a problem for most ARM systems, where we could no-op this if the
>>> buffer is already in uncached memory and at the same time keep the "x86
>>> assumes cached + snooped access by default" semantics.
>> Well the key point is we intentionally rejected that design previously
>> because it created all kind of trouble as well.
>>
> I would really like to know what issues popped up there. Moving the
> dma-buf attachment to work more like a buffer used with the DMA API
> seems like a good thing to me.
>
>> For this limited use case of doing a domain transition right before
>> scanout it might make sense, but that's just one use case.
>>
> The only case I see that we still couldn't support with a change in
> that direction is the GL coherent access to a imported buffer that has
> been allocated from CPU cached memory on a system with non-snooping
> agents. Which to me sounds like a pretty niche use-case, but I would be
> happy to be proven wrong.
>
> Regards,
> Lucas
>