linux-kernel - Re: [PATCH v2 1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SA1PR12MB7199AF5268C8384F9DAE3B50B01C2@SA1PR12MB7199.namprd12.prod.outlook.com>
Date: Fri, 10 Jan 2025 21:04:06 +0000
From: Ankit Agrawal <ankita@...dia.com>
To: Catalin Marinas <catalin.marinas@....com>
CC: Jason Gunthorpe <jgg@...dia.com>, "maz@...nel.org" <maz@...nel.org>,
	"oliver.upton@...ux.dev" <oliver.upton@...ux.dev>, "joey.gouly@....com"
	<joey.gouly@....com>, "suzuki.poulose@....com" <suzuki.poulose@....com>,
	"yuzenghui@...wei.com" <yuzenghui@...wei.com>, "will@...nel.org"
	<will@...nel.org>, "ryan.roberts@....com" <ryan.roberts@....com>,
	"shahuang@...hat.com" <shahuang@...hat.com>, "lpieralisi@...nel.org"
	<lpieralisi@...nel.org>, Aniket Agashe <aniketa@...dia.com>, Neo Jia
	<cjia@...dia.com>, Kirti Wankhede <kwankhede@...dia.com>, "Tarun Gupta
 (SW-GPU)" <targupta@...dia.com>, Vikram Sethi <vsethi@...dia.com>, Andy
 Currid <acurrid@...dia.com>, Alistair Popple <apopple@...dia.com>, John
 Hubbard <jhubbard@...dia.com>, Dan Williams <danw@...dia.com>, Zhi Wang
	<zhiw@...dia.com>, Matt Ochs <mochs@...dia.com>, Uday Dhoke
	<udhoke@...dia.com>, Dheeraj Nigam <dnigam@...dia.com>,
	"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
	"sebastianene@...gle.com" <sebastianene@...gle.com>, "coltonlewis@...gle.com"
	<coltonlewis@...gle.com>, "kevin.tian@...el.com" <kevin.tian@...el.com>,
	"yi.l.liu@...el.com" <yi.l.liu@...el.com>, "ardb@...nel.org"
	<ardb@...nel.org>, "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"gshan@...hat.com" <gshan@...hat.com>, "linux-mm@...ck.org"
	<linux-mm@...ck.org>, "kvmarm@...ts.linux.dev" <kvmarm@...ts.linux.dev>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-arm-kernel@...ts.infradead.org"
	<linux-arm-kernel@...ts.infradead.org>
Subject: Re: [PATCH v2 1/1] KVM: arm64: Allow cacheable stage 2 mapping using
 VMA flags

Thanks Catalin for the review. Comments inline.

> Please note that a pfn_valid() does not imply the memory is in the
> linear map, only that there's a struct page. Some cache maintenance you
> do later in the patch may fail. kvm_is_device_pfn() was changed by
> commit 873ba463914c ("arm64: decouple check whether pfn is in linear map
> from pfn_valid()"), see the log for some explanation.

Thanks for the clarification. So, it appears that the original pfn_is_map_memory()
need to be kept. I am getting the impression from the comments that we should
only add the change to extend the cacheability for the very specific case that we
are trying to address. i.e. all of the following is true:
- type is VM_PFNMAP
- user mapping is cacheable (MT_NORMAL or MT_NORMAL_TAGGED)
- The suggested VM_FORCE_CACHED is set.

>> Also take care of the following two cases that prevents the memory to
>> be safely mapped as cacheable:
>> 1. The VMA pgprot have VM_IO set alongwith MT_NORMAL or
>>    MT_NORMAL_TAGGED. Although unexpected and wrong, presence of such
>>    configuration cannot be ruled out.
>> 2. Configurations where VM_MTE_ALLOWED is not set and KVM_CAP_ARM_MTE
>>    is enabled. Otherwise a malicious guest can enable MTE at stage 1
>>    without the hypervisor being able to tell. This could cause external
>>    aborts.
>
> A first point I'd make - we can simplify this a bit and only allow such
> configuration if FWB is present. Do you have a platform without FWB that
> needs such feature?

No, we don't have a platform without FWB. So I'll check for FWB presence.

> Another reason for the above is my second point - I don't like relying
> on the user mapping memory type for this (at some point we may have
> device pass-through without a VMM mapping). Can we use something like a
> new VM_FORCE_CACHED flag instead? There's precedent for this with
> VM_ALLOW_ANY_UNCACHED.

Ack, this will help better control the affected configurations. I'll introduce
this flag in the next version.

>> Note when FWB is not enabled, the kernel expects to trivially do
>> cache management by flushing the memory by linearly converting a
>> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
>> only possibile for struct page backed memory. Do not allow non-struct
>> page memory to be cachable without FWB.
>
> I want to be sure we actually have a real case for this for the !FWB
> case. One issue is that it introduces a mismatch between the VMM and the
> guest mappings I'd rather not have to have to deal with. Another is that
> we can't guarantee it is mapped in the kernel linear map, pfn_valid()
> does not imply this (I'll say this a few times through this patch).

I am not aware of such case. I'll restrict the changes to FWB then.

>> The device memory such as on the Grace Hopper systems is interchangeable
>> with DDR memory and retains its properties. Allow executable faults
>> on the memory determined as Normal cacheable.
>
> As Will said, please introduce the exec handling separately, it will be
> easier to follow the patches.
>
> The exec fault would require cache maintenance in certain conditions
> (depending on CTR_EL0.{DIC,IDC}). Since you introduce some conditions on
> pfn_valid() w.r.t. D-cache maintenance, I assume we have similar
> restrictions for I/D cache coherency.

I suppose if we only do the change to extend to the aforementioned case
of the following being true, the check for exec fault could safely be as it is
in the patch (albeit it has to be moved to a separate patch).
- type is VM_PFNMAP
- user mapping is cacheable (MT_NORMAL or MT_NORMAL_TAGGED)
- The suggested VM_FORCE_CACHED is set.

>> +static bool mapping_type_normal_cacheable(unsigned long mt)
>> +{
>> +     return (mt == MT_NORMAL || mt == MT_NORMAL_TAGGED);
>> +}
>
> Personally I'd not use this at all, maybe at most as a safety check and
> warn but I'd rather have an opt-in from the host driver (which could
> also ensure that the user mapping is cached).

Understood, will make it part of the check as mentioned above.

>> +      * pfn_valid() indicates to the code if there is a struct page, or
>> +      * if the memory is in the kernel map. Any memory region otherwise
>> +      * is unsafe to be cacheable.
>> +      */
>> +     if (!pfn_valid(pfn))
>> +             noncacheable = true;
>
> The assumptions here are wrong. pfn_valid() does not imply the memory is
> in the kernel map.

Understood, thanks for the clarification.

>> +     if (!mapping_type_normal_cacheable(mt)) {
>>               if (vfio_allow_any_uc)
>>                       prot |= KVM_PGTABLE_PROT_NORMAL_NC;
>>               else
>> @@ -1684,6 +1725,20 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>               prot |= KVM_PGTABLE_PROT_X;
>>       }
>
> I'd leave the device check in place, maybe rename it to something else
> to distinguish from linear map memory (e.g. !pfn_is_map_memory()) and
> only deal with the attributes for this device pfn which could be Device,
> Normal-NC or Normal-WB depending on the presence of some VM_* flags.
> Deny execution in the first patch, introduce it subsequently.

Ack.

>> +     /*
>> +      *  When FWB is unsupported KVM needs to do cache flushes
>> +      *  (via dcache_clean_inval_poc()) of the underlying memory. This is
>> +      *  only possible if the memory is already mapped into the kernel map
>> +      *  at the usual spot.
>> +      *
>> +      *  Validate that there is a struct page for the PFN which maps
>> +      *  to the KVA that the flushing code expects.
>> +      */
>> +     if (!stage2_has_fwb(pgt) && !(pfn_valid(pfn))) {
>> +             ret = -EINVAL;
>> +             goto out_unlock;
>> +     }
>
> Cache maintenance relies on memory being mapped. That's what
> memblock_is_map_memory() gives us. Hotplugged memory ends up in memblock
> as arm64 selects ARCH_KEEP_MEMBLOCK.

Ok, so will replace the pfn_valid with pfn_is_map_memory. Did I get that right?

Apologies for being slow in getting back.

- Ankit Agrawal