lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4ea37684-5dda-94e4-a544-74d3812e8d9d@amd.com>
Date:   Thu, 23 Jun 2022 14:52:26 +0200
From:   Christian König <christian.koenig@....com>
To:     Lucas Stach <l.stach@...gutronix.de>,
        Pekka Paalanen <ppaalanen@...il.com>
Cc:     "Sharma, Shashank" <Shashank.Sharma@....com>,
        lkml <linux-kernel@...r.kernel.org>,
        dri-devel <dri-devel@...ts.freedesktop.org>,
        Nicolas Dufresne <nicolas@...fresne.ca>,
        linaro-mm-sig@...ts.linaro.org,
        Sumit Semwal <sumit.semwal@...aro.org>,
        linux-media <linux-media@...r.kernel.org>
Subject: Re: DMA-buf and uncached system memory

Am 23.06.22 um 14:14 schrieb Lucas Stach:
> Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:
>> Am 23.06.22 um 13:29 schrieb Lucas Stach:
>> [SNIP]
>> I mean I even had somebody from ARM which told me that this is not going
>> to work with our GPUs on a specific SoC. That there are ARM internal use
>> cases which just seem to work because all the devices are non-coherent
>> is completely new to me.
>>
> Yes, trying to hook up a peripheral that assumes cache snooping in some
> design details to a non coherent SoC may end up exploding in various
> ways. On the other hand you can work around most of those assumptions
> by marking the memory as uncached to the CPU, which may tank
> performance, but will work from a correctness PoV.

Yeah, and exactly that's what I meant with "DMA-buf is not the framework 
for this".

See we do support using uncached/not snooped memory in DMA-buf, but only 
for the exporter side.

For example the AMD and Intel GPUs have a per buffer flag for this.

The importer on the other hand needs to be able to handle whatever the 
exporter provides.

>> [SNIP]
>>> Non coherent access, including your non-snoop scanout, and no domain
>>> transition signal just doesn't go together when you want to solve
>>> things in a generic way.
>> Yeah, that's the stuff I totally agree on.
>>
>> See we absolutely do have the requirement of implementing coherent
>> access without domain transitions for Vulkan and OpenGL+extensions.
>>
> Coherent can mean 2 different things:
> 1. CPU cached with snooping from the IO device
> 2. CPU uncached
>
> The Vulkan and GL "coherent" uses are really coherent without explicit
> domain transitions, so on non coherent arches that require the
> transitions the only way to implement this is by making the memory CPU
> uncached. Which from a performance PoV will probably not be what app
> developers expect, but will still expose the correct behavior.

Quite a boomer for performance, but yes that should work.

>>> Remember that in a fully (not only IO) coherent system the CPU isn't
>>> the only agent that may cache the content you are trying to access
>>> here. The dirty cacheline could reasonably still be sitting in a GPU or
>>> VPU cache, so you need some way to clean those cachelines, which isn't
>>> a magic "importer knows how to call CPU cache clean instructions".
>> IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I
>> need to double check as well, that's way to long ago) this was kicked
>> out because of the requirements above.
>>
> The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit
> documentation that "userspace can not rely on coherent access".

Yeah, double checked that as well. This is for the coherency case on the 
exporter side.

>>>> You can of course use DMA-buf in an incoherent environment, but then you
>>>> can't expect that this works all the time.
>>>>
>>>> This is documented behavior and so far we have bluntly rejected any of
>>>> the complains that it doesn't work on most ARM SoCs and I don't really
>>>> see a way to do this differently.
>>> Can you point me to that part of the documentation? A quick grep for
>>> "coherent" didn't immediately turn something up within the DMA-buf
>>> dirs.
>> Search for "cache coherency management". It's quite a while ago, but I
>> do remember helping to review that stuff.
>>
> That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are
> saying the exact opposite of the DMA-buf is always coherent.

Sounds like I'm not making clear what I want to say here: For the 
exporter using cache coherent memory is optional, for the importer it isn't.

For the exporter it is perfectly valid to use kmalloc, get_free_page 
etc... on his buffers as long as it uses the DMA API to give the 
importer access to it.

The importer on the other hand needs to be able to deal with that. When 
this is not the case then the importer somehow needs to work around that.

Either by flushing the CPU caches or by rejecting using the imported 
buffer for this specific use case (like AMD and Intel drivers should be 
doing).

If the Intel or ARM display drivers need non-cached memory and don't 
reject buffer where they don't know this then that's certainly a bug in 
those drivers.

Otherwise we would need to change all DMA-buf exporters to use a special 
function for allocation non-coherent memory and that is certainly not 
going to fly.

> I also don't see why you think that both world views are so totally
> different. We could just require explicit domain transitions for non-
> snoop access, which would probably solve your scanout issue and would
> not be a problem for most ARM systems, where we could no-op this if the
> buffer is already in uncached memory and at the same time keep the "x86
> assumes cached + snooped access by default" semantics.

Well the key point is we intentionally rejected that design previously 
because it created all kind of trouble as well.

For this limited use case of doing a domain transition right before 
scanout it might make sense, but that's just one use case.

Regards,
Christian.

>
> Regards,
> Lucas
>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ