linux-kernel - Re: [RFC PATCH 2/2] drm/ttm: downgrade cached to write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <b9189c97f7efbaa895198113ee5b47012bd8b4dc.camel@icenowy.me>
Date: Tue, 02 Jul 2024 17:06:58 +0800
From: Icenowy Zheng <uwu@...nowy.me>
To: Christian König <christian.koenig@....com>, Jiaxun
 Yang <jiaxun.yang@...goat.com>, Huang Rui <ray.huang@....com>, Maarten
 Lankhorst <maarten.lankhorst@...ux.intel.com>, Maxime Ripard
 <mripard@...nel.org>,  Thomas Zimmermann <tzimmermann@...e.de>, David
 Airlie <airlied@...il.com>, Daniel Vetter <daniel@...ll.ch>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 2/2] drm/ttm: downgrade cached to write_combined
 when snooping not available

在 2024-07-02星期二的 10:08 +0200，Christian König写道：
> Am 02.07.24 um 03:46 schrieb Icenowy Zheng:
> > 在 2024-07-01星期一的 13:40 +0200，Christian König写道：
> > > Am 29.06.24 um 22:51 schrieb Icenowy Zheng:
> > > > 于 2024年6月30日 GMT+08:00 03:57:47，Jiaxun Yang
> > > > <jiaxun.yang@...goat.com>  写道：
> > > > > 在2024年6月29日六月 上午6:22，Icenowy Zheng写道：
> > > > > [...]
> > > > > > @@ -302,6 +302,10 @@ pgprot_t ttm_io_prot(struct
> > > > > > ttm_buffer_object *bo,
> > > > > > struct ttm_resource *res,
> > > > > >                  caching = res->bus.caching;
> > > > > >          }
> > > > > > 
> > > > > > +       /* Downgrade cached mapping for non-snooping
> > > > > > devices */
> > > > > > +       if (!bo->bdev->dma_coherent && caching ==
> > > > > > ttm_cached)
> > > > > > +               caching = ttm_write_combined;
> > > > > Hi Icenowy,
> > > > > 
> > > > > Thanks for your patch! You saved many non-coh PCIe host
> > > > > implementations a day!.
> > > Ah, wait a second.
> > > 
> > > Such a thing as non-coherent PCIe implementation doesn't exist.
> > > The
> > > PCIe
> > > specification makes it mandatory for memory access to be cache
> > > coherent.
> > Really? I tried to get PCIe spec 2.0, PCI spec 3.0 and PCI-X
> > addendum
> > 1.0, none of this explicitly requires the PCIe controller and the
> > CPU
> > being fully coherent. The PCI-X spec even says "Note that PCI-X,
> > like
> > conventional PCI, does not require systems to support coherent
> > caches
> > for addresses accessed by PCI-X requesters".
> 
> See the very first PCI specification, AGP 2.0 and the PCIe extension
> for 
> non-snooped access.
> 
> Originally it wasn't well defined what the PCI 1.0 spec meant with 
> coherency (e.g. snooping vs uncached).
> 

I think the word in PCI-X addendum could be understood as kind of
clarification, which makes at least PCI-X (and maybe retrospectively
PCI) coherency optional.

> AGP was the first specification which explicitly defined that all AGP
> memory accesses must be non-snooped and all PCI accesses must snoop
> the 
> CPU caches.

However I don't think the definition of the AGP spec could apply on all
PCI(e) implementations. The AGP spec itself don't apply on
implementations that do not implement AGP (which is the most PCI(e)
implementations today), and it's not in the reference list of the PCIe
spec, so it does no help on this context.

> 
> PCIe then had an extension which defined the "No Snooping Attribute"
> to 
> allow emulating the AGP behavior.
> 
> For the current PCIe 6.1 specification the non-snoop extension was 
> merged into the base specification.
> 
> Here see section "2.2.6.5 No Snoop Attribute", e.g. "Hardware
> enforced 
> cache coherency expected"
> 
> As well as the notes under section 7.5.3.4 Device Control Register:
> 
> Enable No Snoop - If this bit is Set, the Function is permitted to
> Set 
> the No Snoop bit in the Requester
> Attributes of transactions it initiates that do not require hardware 
> enforced cache coherency (see Section 2.2.6.5 ).
> 
> To summarize it: Not snooping caches is an extension, snooping caches
> is 
> mandatory.

I don't get any such new PCIe specifications, but in Section 2.2.6.3 of
PCIe specification 2.0, it declares these attribute bits as "hints"
"Level of support is dependent on target applications", and suggests to
refer to PCI-X 2.0 (which is the document that I referenced the
sentence originally from).

This makes it reasonable that No-Snoop bit is only meaningful in a
coherent implementation, and it becomes just no-op for non-coherent
implementations. If PCIe specification really requires coherent access
by default, this should be mentioned in other parts of the spec,
instead of only get referred by the No-Snoop bit definition.

> 
> > In addition, in the perspective of Linux, I think bypassing CPU
> > cache
> > of shared memory is considered as coherent access too, see
> > dma_alloc_coherent() function's naming.
> 
> Yes that's correct, but this is for platform devices. E.g. other I/O 
> from drivers who doesn't need to work with malloced system memory for
> example.
> 
> We have quite a bunch of V4L, sound and I also think network devices 
> which work like that. But those are non-PCI devices.

I think in the Linux kernel most drivers (of course including PCI ones)
use DMA buffer in this way, which makes them natually compatible with
non-coherent PCIe implementations. TTM is one of the few exceptions
here.

> 
> > > There are a bunch of non-compliant PCIe implementations which
> > > have
> > > broken cache coherency, but those explicitly violate the
> > > specification
> > > and because of that are not supported.
> > Regardless of it violating the spec or not, these devices work with
> > Linux subsystems that use dma_alloc_coherent to allocate DMA
> > buffers
> > (which is the most common case), and GPU drivers just give out
> > cryptic
> > error messages like "ring gfx test failed" without any mention of
> > coherency issues at all, which makes the fact that Linux DRM/TTM
> > subsystem currently requires PCIe snooping CPU cache more obscure.
> 
> No, they don't even remotely work. You just got very basic tests
> working.
> 
> Both the Vulkan as well as the OpenGL specification require that you
> can 
> import "normal" malloced() system memory into the GPU driver.

I am not familiar with Vulkan, however in the OpenGL context I don't
think just importing a memory buffer into the GPU is supported. Most
buffers of OpenGL is declared and allocated on the GPU part, and have
data copied to by either using specific GL functions (e.g. glTexImage*)
or have the GPU side buffer mapped to CPU (with functions like
glMapBuffer). In this case the memory is surely not "normal malloc()ed
memory", but memory originated from the GPU driver.

To map arbitary CPU-allocated buffers, not only the GPU should be able
to fully snoop the CPU cache, but the GPU MMU should have a equal or
smaller page size than the CPU MMU (although in real world cases the
GPU page size is usually 4K and the CPU page size is usually >=4K,
which makes this not a big problem, but still not well considered).

In fact, in my daily jobs, I met the situation that the library of some
peripheral (a HW decoder) just passed an arbitary user-space pointer to
the application without any hardware address info (its internal
implementation seems to use dma-buf, but I get no interface for
acquiring the dma-buf information), and I failed to do a zero-copy
optimization of this just because I am not able to import this user-
space pointer to the GPU driver because of no known API importing this
arbitary pointer without further device memory info.

> 
> This is not possible without a cache coherent platform architecture.
> So 
> you can't fully support those APIs.

In this case I think the coherency status could be exported to the
userspace, and the userspace side driver could disable certain APIs (I
don't think anything in the core GL 2.1 spec should be affected).

In addition, in the case of AMD GPUs, disabling KFD for non-coherent
devices sounds viable.

> 
> We exercised this quite extensively already and even have a
> confirmation 
> from ARM engineers that the approach of attaching just any PCIe root
> to 
> an ARM IP core is not supported from their side.

This kind of problem should be considered as a kind of platform
implementation, not getting support from just the CPU IP core vendor
isn't a surprise; this do not prevent ARM SoC vendors from implementing
PCIe, from low-end ones such as Allwinner to high-end ones such as you
AMD.

> And if I'm not completely mistaken the RISC-V specification was also 
> updated to disallow stuff like this.
> 
> So yes you can have boards which implement non-snooped PCIe, but you
> get 
> exactly zero support from hardware vendors to run software on it.
> 

It's a quite usual case for free softwares to get no support from
hardware vendors, and some of them are even developed by reverse
engineering. I don't think it as a blocker for the Linux kernel to
merge as many hardwares' support as possible.

> Regards,
> Christian.
> 
> > > Regards,
> > > Christian.
> > > 
> > > > > Unfortunately I don't think we can safely ttm_cached to
> > > > > ttm_write_comnined, we've
> > > > > had enough drama with write combine behaviour on all
> > > > > different
> > > > > platforms.
> > > > > 
> > > > > See drm_arch_can_wc_memory in drm_cache.h.
> > > > > 
> > > > Yes this really sounds like an issue.
> > > > 
> > > > Maybe the behavior of ttm_write_combined should furtherly be
> > > > decided
> > > > by drm_arch_can_wc_memory() in case of quirks?
> > > > 
> > > > > Thanks
> > > > > 
> > > > > > +
> > > > > >          return ttm_prot_from_caching(caching, tmp);
> > > > > >    }
> > > > > >    EXPORT_SYMBOL(ttm_io_prot);
> > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > b/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > index 7b00ddf0ce49f..3335df45fba5e 100644
> > > > > > --- a/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > +++ b/drivers/gpu/drm/ttm/ttm_tt.c
> > > > > > @@ -152,6 +152,10 @@ static void ttm_tt_init_fields(struct
> > > > > > ttm_tt *ttm,
> > > > > >                                 enum ttm_caching caching,
> > > > > >                                 unsigned long extra_pages)
> > > > > >    {
> > > > > > +       /* Downgrade cached mapping for non-snooping
> > > > > > devices */
> > > > > > +       if (!bo->bdev->dma_coherent && caching ==
> > > > > > ttm_cached)
> > > > > > +               caching = ttm_write_combined;
> > > > > > +
> > > > > >          ttm->num_pages = (PAGE_ALIGN(bo->base.size) >>
> > > > > > PAGE_SHIFT) + extra_pages;
> > > > > >          ttm->page_flags = page_flags;
> > > > > >          ttm->dma_address = NULL;
> > > > > > diff --git a/include/drm/ttm/ttm_caching.h
> > > > > > b/include/drm/ttm/ttm_caching.h
> > > > > > index a18f43e93abab..f92d7911f50e4 100644
> > > > > > --- a/include/drm/ttm/ttm_caching.h
> > > > > > +++ b/include/drm/ttm/ttm_caching.h
> > > > > > @@ -47,7 +47,8 @@ enum ttm_caching {
> > > > > > 
> > > > > >          /**
> > > > > >           * @ttm_cached: Fully cached like normal system
> > > > > > memory,
> > > > > > requires that
> > > > > > -        * devices snoop the CPU cache on accesses.
> > > > > > +        * devices snoop the CPU cache on accesses.
> > > > > > Downgraded
> > > > > > to
> > > > > > +        * ttm_write_combined when the snooping capaiblity
> > > > > > is
> > > > > > missing.
> > > > > >           */
> > > > > >          ttm_cached
> > > > > >    };
> > > > > > -- 
> > > > > > 2.45.2