linux-kernel - Re: [PATCH v2 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250117140050.GC5556@nvidia.com>
Date: Fri, 17 Jan 2025 10:00:50 -0400
From: Jason Gunthorpe <jgg@...dia.com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: Ankit Agrawal <ankita@...dia.com>, David Hildenbrand <david@...hat.com>,
	"maz@...nel.org" <maz@...nel.org>,
	"oliver.upton@...ux.dev" <oliver.upton@...ux.dev>,
	"joey.gouly@....com" <joey.gouly@....com>,
	"suzuki.poulose@....com" <suzuki.poulose@....com>,
	"yuzenghui@...wei.com" <yuzenghui@...wei.com>,
	"will@...nel.org" <will@...nel.org>,
	"ryan.roberts@....com" <ryan.roberts@....com>,
	"shahuang@...hat.com" <shahuang@...hat.com>,
	"lpieralisi@...nel.org" <lpieralisi@...nel.org>,
	Aniket Agashe <aniketa@...dia.com>, Neo Jia <cjia@...dia.com>,
	Kirti Wankhede <kwankhede@...dia.com>,
	"Tarun Gupta (SW-GPU)" <targupta@...dia.com>,
	Vikram Sethi <vsethi@...dia.com>, Andy Currid <acurrid@...dia.com>,
	Alistair Popple <apopple@...dia.com>,
	John Hubbard <jhubbard@...dia.com>, Dan Williams <danw@...dia.com>,
	Zhi Wang <zhiw@...dia.com>, Matt Ochs <mochs@...dia.com>,
	Uday Dhoke <udhoke@...dia.com>, Dheeraj Nigam <dnigam@...dia.com>,
	"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
	"sebastianene@...gle.com" <sebastianene@...gle.com>,
	"coltonlewis@...gle.com" <coltonlewis@...gle.com>,
	"kevin.tian@...el.com" <kevin.tian@...el.com>,
	"yi.l.liu@...el.com" <yi.l.liu@...el.com>,
	"ardb@...nel.org" <ardb@...nel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"gshan@...hat.com" <gshan@...hat.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"kvmarm@...ts.linux.dev" <kvmarm@...ts.linux.dev>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>
Subject: Re: [PATCH v2 1/1] KVM: arm64: Allow cacheable stage 2 mapping using
 VMA flags

On Thu, Jan 16, 2025 at 10:28:48PM +0000, Catalin Marinas wrote:

> Basically I don't care whether MTE is supported on such vma, I doubt
> you'd want to enable MTE anyway. But the way MTE was designed in the Arm
> architecture, prior to FEAT_MTE_PERM, it allows a guest to enable MTE at
> Stage 1 when Stage 2 is Normal WB Cacheable. We have no idea what enable
> MTE at Stage 1 means if the memory range doesn't support it.

I'm reading Aneesh's cover letter (Add support for NoTagAccess memory
attribute) and it seems like we already have exactly the behavior we
want. If MTE is enabled in the KVM then memory types, like from VFIO,
are not permitted - this looks like it happens during memslot
creation, not in the fault handler.

So I think at this point Ankit's series should rely on that. We never
have a fault on a PFNMAP VMA in a MTE enabled KVM in the first place
because you can't even create a memslot.

After Aneesh's series it would make the memory NoTagAccess (though I
don't understand from the cover letter how this works for MMIO) amd
faults will be fully contained.

> with FEAT_MTE_PERM (patches from Aneesh on the list). Or, a bigger
> happen, disable MTE in guests (well, not that big, not many platforms
> supporting MTE, especially in the enterprise space).

As above, it seems we already effectively disable MTE in guests to use
VFIO.

> A second problem, similar to relaxing to Normal NC we merged last year,
> we can't tell what allowing Stage 2 cacheable means (SError etc).

That was a very different argument. On that series KVM was upgrading a
VM with pgprot noncached to Normal NC, that upgrade was what triggered
the discussions about SError.

For this case the VMA is already pgprot cache. KVM is not changing
anything. The KVM S2 will have the same Normal NC memory type as the
VMA has in the S1.  Thus KVM has no additional responsibility for
safety here.

If using Normal Cachable on this memory is unsafe then VFIO must not
create such a VMA in the first place.

Today the vfio-grace driver is the only place that creates cachable
VMAs in VFIO and it is doing so with platform specific knowledge that
this memory is fully cachable safe.

> information. Checking vm_page_prot instead of a VM_* flag may work if
> it's mapped in user space but this might not always be the case. 

For this series it is only about mapping VMAs. Some future FD based
mapping for CC is going to also need similar metadata.. I have another
thread about that :)

> I don't see how VM_PFNMAP alone can tell us anything about the
> access properties supported by a device address range. Either way,
> it's the driver setting vm_page_prot or some VM_* flag. KVM has no
> clue, it's just a memory slot.

I think David's point about VM_PFNMAP was to avoid some of the
pfn_valid() logic. If we get VM_PFNMAP we just assume it is non-struct
page and follow the VMA's pgprot.

> A third aspect, more of a simplification when reasoning about this, was
> to use FWB at Stage 2 to force cacheability and not care about cache
> maintenance, especially when such range might be mapped both in user
> space and in the guest.

Yes, I thought we needed this anyhow as KVM can't cache invalidate
non-struct page memory..

Jason