linux-kernel - Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aAgJ8g8Gbb06quSM@linux.dev>
Date: Tue, 22 Apr 2025 14:28:18 -0700
From: Oliver Upton <oliver.upton@...ux.dev>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Catalin Marinas <catalin.marinas@....com>,
	Ankit Agrawal <ankita@...dia.com>,
	Sean Christopherson <seanjc@...gle.com>,
	Marc Zyngier <maz@...nel.org>,
	"joey.gouly@....com" <joey.gouly@....com>,
	"suzuki.poulose@....com" <suzuki.poulose@....com>,
	"yuzenghui@...wei.com" <yuzenghui@...wei.com>,
	"will@...nel.org" <will@...nel.org>,
	"ryan.roberts@....com" <ryan.roberts@....com>,
	"shahuang@...hat.com" <shahuang@...hat.com>,
	"lpieralisi@...nel.org" <lpieralisi@...nel.org>,
	"david@...hat.com" <david@...hat.com>,
	Aniket Agashe <aniketa@...dia.com>, Neo Jia <cjia@...dia.com>,
	Kirti Wankhede <kwankhede@...dia.com>,
	"Tarun Gupta (SW-GPU)" <targupta@...dia.com>,
	Vikram Sethi <vsethi@...dia.com>, Andy Currid <acurrid@...dia.com>,
	Alistair Popple <apopple@...dia.com>,
	John Hubbard <jhubbard@...dia.com>, Dan Williams <danw@...dia.com>,
	Zhi Wang <zhiw@...dia.com>, Matt Ochs <mochs@...dia.com>,
	Uday Dhoke <udhoke@...dia.com>, Dheeraj Nigam <dnigam@...dia.com>,
	Krishnakant Jaju <kjaju@...dia.com>,
	"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
	"sebastianene@...gle.com" <sebastianene@...gle.com>,
	"coltonlewis@...gle.com" <coltonlewis@...gle.com>,
	"kevin.tian@...el.com" <kevin.tian@...el.com>,
	"yi.l.liu@...el.com" <yi.l.liu@...el.com>,
	"ardb@...nel.org" <ardb@...nel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"gshan@...hat.com" <gshan@...hat.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"ddutile@...hat.com" <ddutile@...hat.com>,
	"tabba@...gle.com" <tabba@...gle.com>,
	"qperret@...gle.com" <qperret@...gle.com>,
	"kvmarm@...ts.linux.dev" <kvmarm@...ts.linux.dev>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>
Subject: Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using
 VMA flags

On Tue, Apr 22, 2025 at 10:54:52AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 22, 2025 at 12:49:28AM -0700, Oliver Upton wrote:
> > The reality is that userspace is an equal participant in remaining coherent with
> > the guest. Whether or not FWB is employed for a particular region of IPA
> > space is useful information for userspace deciding what it needs to do to access guest
> > memory. Ignoring the Nvidia widget for a second, userspace also needs to know this for
> > 'normal', kernel-managed memory so it understands what CMOs may be necessary when (for
> > example) doing live migration of the VM.
> 
> Really? How does it work today then? Is this another existing problem?
> Userspace is doing CMOs during live migration that are not necessary?

Yes, this is a pre-existing problem. I'm not aware of a live migration
implementation that handles !S2FWB correctly, and assumes all guest
accesses are done through a cacheable alias.

So, if a VMM wants to do migration of VMs on !S2FWB correctly, it'd
probably want to know it can elide CMOs on something that actually bears
the feature.

> >  - The memslot flag says userspace expects a particular GFN range to guarantee
> >    Write-Back semantics. This can be applied to 'normal', kernel-managed memory
> >    and PFNMAP thingies that have cacheable attributes at host stage-1.
> 
> Userspace doesn't actaully know if it has a cachable mapping from VFIO
> though :(

That seems like a shortcoming on the VFIO side, not a KVM issue. What if
userspace wants to do atomics on some VFIO mapping, doesn't it need to
know that it has something with WB?

> I don't really see a point in this. If the KVM has the cap then
> userspace should assume the S2FWB behavior for all cachable memslots.

Wait, so userspace simultaneously doesn't know the cacheability at host
stage-1 but *does* for stage-2? This is why I contend that userspace
needs a mechanism to discover the memory attributes on a given memslot.
Without it there's no way of knowing what's a cacheable memslot.

Along those lines, how is the VMM going to describe that cacheable
PFNMAP region to the guest?

> What should happen if you have S2FWB but don't pass the flag? For
> normal kernel memory it should still use S2FWB. Thus for cachable
> PFNMAP it makes sense that it should also still use S2FWB without the
> flag?

For kernel-managed memory, I agree. Accepting the flag for a memslot
containing such memory would solely be for discoverability.

OTOH, cacheable PFNMAP is a new feature and I see no issue compelling
the use of a new bit with it.

On Tue, Apr 22, 2025 at 02:03:24PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 22, 2025 at 05:50:32PM +0100, Catalin Marinas wrote:
> 
> > So, for the above, the VMM needs to know that it somehow got into such
> > situation. If it knows the device (VFIO) capabilities and that the user
> > mapping is Cacheable, coupled with the new KVM CAP, it can infer that
> > Stage 2 will be S2FWB, no need for a memory slot flag.
> 
> So long as the memslot creation fails for cachable PFNMAP without
> S2FWB the VMM is fine. qemu will begin its first steps to startup the
> migration destination and immediately fail. The migration will be
> aborted before it even gets started on the source side.
> 
> As I said before, the present situation requires the site's
> orchestration to manage compatibility for live migration of VFIO
> devices. We only expect that the migration will abort early if the
> site has made a configuration error.
> 
> > have such information, maybe a new memory slot flag can be used to probe
> > what Stage 2 mapping is going to be: ask for KVM_MEM_PFNMAP_WB. If it
> > fails, Stage 2 is Device/NC and can attempt again with the WB flag.
> > It's a bit of a stretch for the KVM API but IIUC there's no option to
> > query the properties of a memory slot.
> 
> I don't know of any use case for something like this. If VFIO gives
> the VMM a cachable mapping there is no fallback to WB.
> 
> The operator could use a different VFIO device, one that doesn't need
> cachable, but the VMM can't flip the VFIO device between modes on the
> fly.

I agree with you that in the context of a VFIO device userspace doesn't
have any direct influence on the resulting memory attributes.

The entire reason I'm dragging my feet about this is I'm concerned we've
papered over the complexity of memory attributes (regardless of
provenance) for way too long. KVM's done enough to make this dance 'work'
in the context of kernel-managed memory, but adding more implicit KVM
behavior for cacheable thingies makes the KVM UAPI even more
unintelligible (as if it weren't already).

So this flag isn't about giving userspace any degree of control over
memory attributes. Just a way to know for things it _expects_ to be
treated as cacheable can be guaranteed to use cacheable attributes in
the VM.

Thanks,
Oliver