linux-kernel - Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250422233556.GB1648741@nvidia.com>
Date: Tue, 22 Apr 2025 20:35:56 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Oliver Upton <oliver.upton@...ux.dev>
Cc: Catalin Marinas <catalin.marinas@....com>,
	Ankit Agrawal <ankita@...dia.com>,
	Sean Christopherson <seanjc@...gle.com>,
	Marc Zyngier <maz@...nel.org>,
	"joey.gouly@....com" <joey.gouly@....com>,
	"suzuki.poulose@....com" <suzuki.poulose@....com>,
	"yuzenghui@...wei.com" <yuzenghui@...wei.com>,
	"will@...nel.org" <will@...nel.org>,
	"ryan.roberts@....com" <ryan.roberts@....com>,
	"shahuang@...hat.com" <shahuang@...hat.com>,
	"lpieralisi@...nel.org" <lpieralisi@...nel.org>,
	"david@...hat.com" <david@...hat.com>,
	Aniket Agashe <aniketa@...dia.com>, Neo Jia <cjia@...dia.com>,
	Kirti Wankhede <kwankhede@...dia.com>,
	"Tarun Gupta (SW-GPU)" <targupta@...dia.com>,
	Vikram Sethi <vsethi@...dia.com>, Andy Currid <acurrid@...dia.com>,
	Alistair Popple <apopple@...dia.com>,
	John Hubbard <jhubbard@...dia.com>, Dan Williams <danw@...dia.com>,
	Zhi Wang <zhiw@...dia.com>, Matt Ochs <mochs@...dia.com>,
	Uday Dhoke <udhoke@...dia.com>, Dheeraj Nigam <dnigam@...dia.com>,
	Krishnakant Jaju <kjaju@...dia.com>,
	"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
	"sebastianene@...gle.com" <sebastianene@...gle.com>,
	"coltonlewis@...gle.com" <coltonlewis@...gle.com>,
	"kevin.tian@...el.com" <kevin.tian@...el.com>,
	"yi.l.liu@...el.com" <yi.l.liu@...el.com>,
	"ardb@...nel.org" <ardb@...nel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"gshan@...hat.com" <gshan@...hat.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"ddutile@...hat.com" <ddutile@...hat.com>,
	"tabba@...gle.com" <tabba@...gle.com>,
	"qperret@...gle.com" <qperret@...gle.com>,
	"kvmarm@...ts.linux.dev" <kvmarm@...ts.linux.dev>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>
Subject: Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using
 VMA flags

On Tue, Apr 22, 2025 at 02:28:18PM -0700, Oliver Upton wrote:
> So, if a VMM wants to do migration of VMs on !S2FWB correctly, it'd
> probably want to know it can elide CMOs on something that actually bears
> the feature.

OK

> > >  - The memslot flag says userspace expects a particular GFN range to guarantee
> > >    Write-Back semantics. This can be applied to 'normal', kernel-managed memory
> > >    and PFNMAP thingies that have cacheable attributes at host stage-1.
> > 
> > Userspace doesn't actaully know if it has a cachable mapping from VFIO
> > though :(
> 
> That seems like a shortcoming on the VFIO side, not a KVM issue. What if
> userspace wants to do atomics on some VFIO mapping, doesn't it need to
> know that it has something with WB?

VFIO is almost always un cachable. So far there is only one device
merged, and CXL is coming, that would even support/require cachable.

VFIO always had this sort of programming model where the userspace
needs to know details about the device it is using so it hasn't really
been an issue so far that the kernel doesn't tell the userspace what
cachability it got.. We could add it but now we are adding new kernel
code, qemu code and kvm code that is actually pretty pointless.

> > I don't really see a point in this. If the KVM has the cap then
> > userspace should assume the S2FWB behavior for all cachable memslots.
> 
> Wait, so userspace simultaneously doesn't know the cacheability at host
> stage-1 but *does* for stage-2? 

No, it doesn't know either. The point is the VMM doesn't care about
any of this. It just wants to connect KVM to VFIO and have the kernel
internally negotiate the details of how that works.

There is zero value in the VMM being aware that KVM/VFIO is using
cachable or non-cachable mappings because it will never touch this
memory anyhow, and arguably it would be happier if it wasn't even in a
VMA in the first place.

> This is why I contend that userspace needs a mechanism to discover
> the memory attributes on a given memslot.  Without it there's no way
> of knowing what's a cacheable memslot.

If it cares about this then it should know by virtue of having put a
cachable VMA into the memslot.

> Along those lines, how is the VMM going to describe that cacheable
> PFNMAP region to the guest?

Heh. It creates a virtual PCI device in the guest and the space is
mapped to a virtual BAR. When the guest driver binds to this device it
will map the virtual BAR as cachable instead of as IO because it knows
to do that based on the vPCI device ID.

If something goes weird and the guest tries to use UC instead of
cachable the S2FWB will block it and the guest will probably
malfunction.

CXL will probably have the VMM understand things a bit more and will
generate the various ACPI tables CXL uses.

> > What should happen if you have S2FWB but don't pass the flag? For
> > normal kernel memory it should still use S2FWB. Thus for cachable
> > PFNMAP it makes sense that it should also still use S2FWB without the
> > flag?
> 
> For kernel-managed memory, I agree. Accepting the flag for a memslot
> containing such memory would solely be for discoverability.
> 
> OTOH, cacheable PFNMAP is a new feature and I see no issue compelling
> the use of a new bit with it.

Feels weird to me. Now the VMM has to discover if the KVM supports the
new flag and only use it on new KVM versions just to accomplish..
nothing?

> The entire reason I'm dragging my feet about this is I'm concerned we've
> papered over the complexity of memory attributes (regardless of
> provenance) for way too long. KVM's done enough to make this dance 'work'
> in the context of kernel-managed memory, but adding more implicit KVM
> behavior for cacheable thingies makes the KVM UAPI even more
> unintelligible (as if it weren't already).

It is very complex.. I'm not sure adding a flag in this case is making
it any simpler though.

If you had a flag from day 0 that said 'this is a MMIO mapping do MMIO
stuff' that would be alot clearer about how it should be used.

But here we have a flag that doesn't seem well defined. When should
the VMM set this flag?

> So this flag isn't about giving userspace any degree of control over
> memory attributes. Just a way to know for things it _expects_ to be
> treated as cacheable can be guaranteed to use cacheable attributes in
> the VM.

Can you get some agreement with Sean? He seems very strongly opposed to
this direction. 

Jason