linux-kernel - Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86wmcmn0dp.wl-maz@kernel.org>
Date: Tue, 18 Mar 2025 09:39:30 +0000
From: Marc Zyngier <maz@...nel.org>
To: Catalin Marinas <catalin.marinas@....com>
Cc: Ankit Agrawal <ankita@...dia.com>,
	Jason Gunthorpe <jgg@...dia.com>,
	"oliver.upton@...ux.dev" <oliver.upton@...ux.dev>,
	"joey.gouly@....com" <joey.gouly@....com>,
	"suzuki.poulose@....com" <suzuki.poulose@....com>,
	"yuzenghui@...wei.com" <yuzenghui@...wei.com>,
	"will@...nel.org" <will@...nel.org>,
	"ryan.roberts@....com" <ryan.roberts@....com>,
	"shahuang@...hat.com" <shahuang@...hat.com>,
	"lpieralisi@...nel.org" <lpieralisi@...nel.org>,
	"david@...hat.com" <david@...hat.com>,
	Aniket Agashe <aniketa@...dia.com>,
	Neo Jia <cjia@...dia.com>,
	Kirti Wankhede <kwankhede@...dia.com>,
	"Tarun Gupta (SW-GPU)" <targupta@...dia.com>,
	Vikram Sethi <vsethi@...dia.com>,
	Andy Currid <acurrid@...dia.com>,
	Alistair Popple <apopple@...dia.com>,
	John Hubbard <jhubbard@...dia.com>,
	Dan Williams <danw@...dia.com>,
	Zhi Wang <zhiw@...dia.com>,
	Matt Ochs <mochs@...dia.com>,
	Uday Dhoke <udhoke@...dia.com>,
	Dheeraj Nigam <dnigam@...dia.com>,
	Krishnakant Jaju <kjaju@...dia.com>,
	"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
	"sebastianene@...gle.com" <sebastianene@...gle.com>,
	"coltonlewis@...gle.com" <coltonlewis@...gle.com>,
	"kevin.tian@...el.com" <kevin.tian@...el.com>,
	"yi.l.liu@...el.com" <yi.l.liu@...el.com>,
	"ardb@...nel.org" <ardb@...nel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"gshan@...hat.com" <gshan@...hat.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"ddutile@...hat.com" <ddutile@...hat.com>,
	"tabba@...gle.com" <tabba@...gle.com>,
	"qperret@...gle.com" <qperret@...gle.com>,
	"seanjc@...gle.com" <seanjc@...gle.com>,
	"kvmarm@...ts.linux.dev" <kvmarm@...ts.linux.dev>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>
Subject: Re: [PATCH v3 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

On Mon, 17 Mar 2025 19:54:25 +0000,
Catalin Marinas <catalin.marinas@....com> wrote:
> 
> On Mon, Mar 17, 2025 at 09:27:52AM +0000, Marc Zyngier wrote:
> > On Mon, 17 Mar 2025 05:55:55 +0000,
> > Ankit Agrawal <ankita@...dia.com> wrote:
> > > 
> > > >> For my education, what is an accepted way to communicate this? Please let
> > > >> me know if there are any relevant examples that you may be aware of.
> > > >
> > > > A KVM capability is what is usually needed.
> > > 
> > > I see. If IIUC, this would involve a corresponding Qemu (usermode) change
> > > to fetch the new KVM cap. Then it could fail in case the FWB is not
> > > supported with some additional conditions (so that the currently supported
> > > configs with !FWB won't break on usermode). 
> > > 
> > > The proposed code change is to map in S2 as NORMAL when vma flags
> > > has VM_PFNMAP. However, Qemu cannot know that driver is mapping
> > > with PFNMAP or not. So how may Qemu decide whether it is okay to
> > > fail for !FWB or not?
> > 
> > This is not about FWB as far as userspace is concerned. This is about
> > PFNMAP as non-device memory. If the host doesn't have FWB, then the
> > "PFNMAP as non-device memory" capability doesn't exist, and userspace
> > fails early.
> > 
> > Userspace must also have some knowledge of what device it obtains the
> > mapping from, and whether that device requires some extra host
> > capability to be assigned to the guest.
> > 
> > You can then check whether the VMA associated with the memslot is
> > PFNMAP or not, if the memslot has been enabled for PFNMAP mappings
> > (either globally or on a per-memslot basis, I don't really care).
> 
> Trying to page this back in, I think there are three stages:
> 
> 1. A KVM cap that the VMM can use to check for non-device PFNMAP (or
>    rather cacheable PFNMAP since we already support Normal NC).
> 
> 2. Memslot registration - we need a way for the VMM to require such
>    cacheable PFNMAP and for KVM to check. Current patch relies on (a)
>    the stage 1 vma attributes which I'm not a fan of. An alternative I
>    suggested was (b) a VM_FORCE_CACHEABLE vma flag, on the assumption
>    that the vfio driver knows if it supports cacheable (it's a bit of a
>    stretch trying to make this generic). Yet another option is (c) a
>    KVM_MEM_CACHEABLE flag that the VMM passes at memslot registration.
> 
> 3. user_mem_abort() - follows the above logic (whatever we decide),
>    maybe with some extra check and WARN in case we got the logic wrong.
> 
> The problems in (2) are that we need to know that the device supports
> cacheable mappings and we don't introduce additional issues or end up
> with FWB on a PFNMAP that does not support cacheable. Without any vma
> flag like the current VM_ALLOW_ANY_UNCACHED, the next best thing is
> relying on the stage 1 attributes. But we don't know them at the memslot
> registration, only later in step (3) after a GUP on the VMM address
> space.
> 
> So in (2), when !FWB, we only want to reject VM_PFNMAP slots if we know
> they are going to be mapped as cacheable. So we need this information
> somehow, either from the vma->vm_flags or slot->flags.

Yup, that's mostly how I think of it.

Obtaining a mapping from the xPU driver must result in VM_PFNMAP being
set in the VMA. I don't think that's particularly controversial.

The memslot must also be created with a new flag ((2c) in the taxonomy
above) that carries the "Please map VM_PFNMAP VMAs as cacheable". This
flag is only allowed if (1) is valid.

This results in the following behaviours:

- If the VMM creates the memslot with the cacheable attribute without
  (1) being advertised, we fail.

- If the VMM creates the memslot without the cacheable attribute, we
  map as NC, as it is today.

What this doesn't do is *automatically* decide for the VMM what
attributes to use. The VMM must know what it is doing, and only
provide the memslot flag when appropriate. Doing otherwise may eat
your data and/or take the machine down (cacheable mapping on a device
can be great fun). If you want to address this, then "someone" needs
to pass some additional VMA flag that KVM can check.

Of course, all of this only caters for well behaved userspace, and we
need to gracefully handle (3) when the VMM sneaks in a new VMA that
has conflicting attributes.

For that, we need a reasonable fault reporting interface that allows
userspace to correctly handle it. I don't think this is unique to this
case, but also covers things like MTE and other funky stuff that
relies on the backing memory having some particular "attributes".

An alternative could be to require the VMA to be sealed, which would
prevent any overlapping mapping. But I only have looked at that for 2
minutes...

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.