[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eebe69ad-1a2f-91ee-75b1-f295d7a3389c@redhat.com>
Date: Wed, 31 Mar 2021 16:14:12 +0200
From: David Hildenbrand <david@...hat.com>
To: Steven Price <steven.price@....com>,
Catalin Marinas <catalin.marinas@....com>
Cc: Mark Rutland <mark.rutland@....com>,
Peter Maydell <peter.maydell@...aro.org>,
"Dr. David Alan Gilbert" <dgilbert@...hat.com>,
Andrew Jones <drjones@...hat.com>, Haibo Xu <Haibo.Xu@....com>,
Suzuki K Poulose <suzuki.poulose@....com>,
qemu-devel@...gnu.org, Marc Zyngier <maz@...nel.org>,
Juan Quintela <quintela@...hat.com>,
Richard Henderson <richard.henderson@...aro.org>,
linux-kernel@...r.kernel.org, Dave Martin <Dave.Martin@....com>,
James Morse <james.morse@....com>,
linux-arm-kernel@...ts.infradead.org,
Thomas Gleixner <tglx@...utronix.de>,
Will Deacon <will@...nel.org>, kvmarm@...ts.cs.columbia.edu,
Julien Thierry <julien.thierry.kdev@...il.com>
Subject: Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature
On 31.03.21 12:41, Steven Price wrote:
> On 31/03/2021 10:32, David Hildenbrand wrote:
>> On 31.03.21 11:21, Catalin Marinas wrote:
>>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
>>>> On 30.03.21 12:30, Catalin Marinas wrote:
>>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
>>>>>> On 28/03/2021 13:21, Catalin Marinas wrote:
>>>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
>>>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
>>>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644
>>>>>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu
>>>>>>>>> *vcpu, phys_addr_t fault_ipa,
>>>>>>>>> if (vma_pagesize == PAGE_SIZE && !force_pte)
>>>>>>>>> vma_pagesize = transparent_hugepage_adjust(memslot,
>>>>>>>>> hva,
>>>>>>>>> &pfn, &fault_ipa);
>>>>>>>>> +
>>>>>>>>> + if (fault_status != FSC_PERM && kvm_has_mte(kvm) &&
>>>>>>>>> pfn_valid(pfn)) {
>>>>>>>>> + /*
>>>>>>>>> + * VM will be able to see the page's tags, so we must
>>>>>>>>> ensure
>>>>>>>>> + * they have been initialised. if PG_mte_tagged is set,
>>>>>>>>> tags
>>>>>>>>> + * have already been initialised.
>>>>>>>>> + */
>>>>>>>>> + struct page *page = pfn_to_page(pfn);
>>>>>>>>> + unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
>>>>>>>>> +
>>>>>>>>> + for (i = 0; i < nr_pages; i++, page++) {
>>>>>>>>> + if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>>>>>>>>> + mte_clear_page_tags(page_address(page));
>>>>>>>>> + }
>>>>>>>>> + }
>>>>>>>>
>>>>>>>> This pfn_valid() check may be problematic. Following commit
>>>>>>>> eeb0753ba27b
>>>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it
>>>>>>>> returns
>>>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to
>>>>>>>> support
>>>>>>>> MTE.
>>>>>>>
>>>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be
>>>>>>> mapped as untagged memory in the kernel linear map. It could be
>>>>>>> slightly
>>>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
>>>>>>> untagged memory. Another overhead is pfn_valid() which will likely
>>>>>>> end
>>>>>>> up calling memblock_is_map_memory().
>>>>>>>
>>>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for
>>>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is
>>>>>>> there a
>>>>>>> way to detect what gets mapped in the guest as Normal Cacheable
>>>>>>> memory
>>>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE
>>>>>>> (or
>>>>>>> something else like on-chip memory)? If we can't guarantee that all
>>>>>>> Cacheable memory given to a guest supports tags, we should disable
>>>>>>> the
>>>>>>> feature altogether.
>>>>>>
>>>>>> In stage 2 I believe we only have two types of mapping - 'normal' or
>>>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the
>>>>>> latter is a
>>>>>> case of checking the 'device' variable, and makes sense to avoid the
>>>>>> overhead you describe.
>>>>>>
>>>>>> This should also guarantee that all stage-2 cacheable memory
>>>>>> supports tags,
>>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid()
>>>>>> should only
>>>>>> be true for memory that Linux considers "normal".
>>>>
>>>> If you think "normal" == "normal System RAM", that's wrong; see below.
>>>
>>> By "normal" I think both Steven and I meant the Normal Cacheable memory
>>> attribute (another being the Device memory attribute).
>
> Sadly there's no good standardised terminology here. Aarch64 provides
> the "normal (cacheable)" definition. Memory which is mapped as "Normal
> Cacheable" is implicitly MTE capable when shared with a guest (because
> the stage 2 mappings don't allow restricting MTE other than mapping it
> as Device memory).
>
> So MTE also forces us to have a definition of memory which is "bog
> standard memory"[1] separate from the mapping attributes. This is the
> main memory which fully supports MTE.
>
> Separate from the "bog standard" we have the "special"[1] memory, e.g.
> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but
> that memory may not support MTE tags. This memory can only be safely
> shared with a guest in the following situations:
>
> 1. MTE is completely disabled for the guest
>
> 2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)
>
> 3. We have some guarantee that guest MTE access are in some way safe.
>
> (1) is the situation today (without this patch series). But it prevents
> the guest from using MTE in any form.
>
> (2) is pretty terrible for general memory, but is the get-out clause for
> mapping devices into the guest.
>
> (3) isn't something we have any architectural way of discovering. We'd
> need to know what the device did with the MTE accesses (and any caches
> between the CPU and the device) to ensure there aren't any side-channels
> or h/w lockup issues. We'd also need some way of describing this memory
> to the guest.
>
> So at least for the time being the approach is to avoid letting a guest
> with MTE enabled have access to this sort of memory.
>
> [1] Neither "bog standard" nor "special" are real terms - like I said
> there's a lack of standardised terminology.
>
>>>>> That's the problem. With Anshuman's commit I mentioned above,
>>>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
>>>>> memory, not talking about some I/O mapping that requires Device_nGnRE).
>>>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as
>>>>> Normal but it is not guaranteed to support tagging.
>>>>
>>>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
>>>> touch the page, you won't fault. So Anshuman's commit is correct.
>>>
>>> I agree.
>>>
>>>> pfn_to_online_page() means, "there is a struct page and it's system RAM
>>>> that's in use; the memmap has a sane content"
>>>
>>> Does pfn_to_online_page() returns a valid struct page pointer for
>>> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
>>> some definition of system RAM (I assume NVDIMM != system RAM). For
>>> example, pmem_attach_disk() calls devm_memremap_pages() and this would
>>> use the Normal Cacheable memory attribute without necessarily being
>>> system RAM.
>>
>> No, not for ZONE_DEVICE.
>>
>> However, if you expose PMEM via dax/kmem as System RAM to the system (->
>> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or
>> ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system
>> thinks it's "ordinary system RAM" and the memory is managed by the buddy.
>
> So if I'm understanding this correctly for KVM we need to use
> pfn_to_online_pages() and reject if NULL is returned? In the case of
> dax/kmem there already needs to be validation that the memory supports
> MTE (otherwise we break user space) before it's allowed into the
> "ordinary system RAM" bucket.
That should work.
1. One alternative is
if (!pfn_valid(pfn))
return false;
#ifdef CONFIG_ZONE_DEVICE
page = pfn_to_page(pfn);
if (page_zonenum(page) == ZONE_DEVICE)
return false;
#endif
return true;
Note that when you are dealing with random PFNs, this approach is in
general not safe; the memmap could be uninitialized and contain garbage.
You can have false positives for ZONE_DEVICE.
2. Yet another (slower?) variant to detect (some?) ZONE_DEVICE is
pgmap = get_dev_pagemap(pfn, NULL);
put_dev_pagemap(pgmap);
if (pgmap)
return false;
return true;
I know that /dev/mem mappings can be problematic ... because the memmap
could be in any state and actually we shouldn't even touch/rely on any
"struct pages" at all, as we have a pure PFN mapping ...
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists