linux-kernel - Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <16d8b137-7e61-b981-e1cd-9f06f73d4300@redhat.com>
Date:   Wed, 14 Jun 2023 10:01:14 +1000
From:   Gavin Shan <gshan@...hat.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     kvmarm@...ts.linux.dev, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, pbonzini@...hat.com,
        oliver.upton@...ux.dev, maz@...nel.org, hshuai@...hat.com,
        zhenyzha@...hat.com, shan.gavin@...il.com
Subject: Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

Hi Sean,

On 6/14/23 12:58 AM, Sean Christopherson wrote:
> On Tue, Jun 13, 2023, Gavin Shan wrote:
>> On 6/13/23 12:41 AM, Sean Christopherson wrote:
>>>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
>>>
>>> This Fixes isn't correct.  That change only affected x86, which doesn't have this
>>> bug.  And looking at commit cd4c71835228 ("KVM: arm64: Convert to the gfn-based MMU
>>> notifier callbacks"), arm64 did NOT skip invalid slots
> 
> ...
> 
>>> Unless I'm missing something, this goes all the way back to initial arm64 support
>>> added by commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup").
>>>
>>
>> The fixes tag was sorted out based on 'git-bisect', not static code analysis. I
>> agree it should be d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") from
>> the code. From the 'git-bisect', we found the issue disappears when the head is
>> commit 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code").
>> And yes, the fixes tag should be cd4c71835228 ("KVM: arm64: Convert to the gfn-based
>> MMU notifier callbacks").
> 
> No, just because bisect points at a commit doesn't mean that's the commit that
> introduced a bug.  It's not uncommon for a completely valid change to expose a
> pre-existing bug, which is what I suspect happened here, e.g. after switching to
> the generic framework, clean_dcache_guest_page() is called *after* retrieving the
> memslot, versus clean_dcache_guest_page() being called before walking memslots.
> 
> Blaming the correct commit matters, both so that it's clear to future readers what
> went wrong, and also so that people maintaining older kernels at least are aware
> that there kernel may be affected.  E.g. a fix in generic KVM obviously won't
> backport to 5.10, but someone who cares deeply about a 5.10-based kernel on arm64
> could manually port the fix to KVM arm64 code.
> 

Yes, the change of the call site on clean_dcache_guest_page() enlarges the race
condition window between memory slot removal and change_pte() theoretically.
With this, we were able to reproduce the issue in a practical test. In next
revision, lets put a note about this in the changelog, like below.

   We tried a git-bisect and the first problematic commit is cd4c71835228 ("
   KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
   clean_dcache_guest_page() is called after the memory slots are iterated
   in kvm_mmu_notifier_change_pte(). Without this commit, clean_dcache_guest_page()
   is called before the iteration on the memory slots. This change literally
   enlarges the racy window between kvm_mmu_notifier_change_pte() and memory
   slot removal so that we're able to reproduce the issue in a practical test
   case. However, the issue exists since d5d8184d35c9 ("KVM: ARM: Memory
   virtualization setup").
   
   Cc: stable@...r.kernel.org # v3.9+
   Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")

>> I'm declined to fix the issue only for ARM64,
> 
> I never suggested that an ARM-only fix be made, in fact I explicitly suggested
> the exact opposite.
> 

Well, the original patch ignores the invalid memory slot for all MMU notifiers.
You suggested to have this ignorance only for change_pte(). The scope is decreased
and I was following this direction to narrow this issue/fix to change_pte() on
ARM64 only. Sorry that I misunderstood your suggestion.

>> more details are given below. If we're going to limit the issue to ARM64 and
>> fix it for ARM64 only, the fixes tag should be the one as you pointed. Lets
>> correct it in next revision with:
> 
> For a generic fix, just blame multiple commits, e.g. the guilty arm64, RISC-V,
> and MIPS commits, which at a glance are the affected architectures.  At one point
> in time, x86 was also likely affected, but that's probably not worth more than a
> brief mention in the changelog as I don't see any value in tracking down a very
> theoretical window of time where x86 was affected.
> 

Right. Please refer to above reply. I plan to add a section in the changelog
in next revision.

>>    Cc: stable@...r.kernel.org # v3.9+
>>    Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>>
>>>> Reported-by: Shuai Hu <hshuai@...hat.com>
>>>> Reported-by: Zhenyu Zhang <zhenyzha@...hat.com>
>>>> Signed-off-by: Gavin Shan <gshan@...hat.com>
>>>> ---
>>>>    virt/kvm/kvm_main.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 479802a892d4..7f81a3a209b6 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>>>    			unsigned long hva_start, hva_end;
>>>>    			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
>>>> +			if (slot->flags & KVM_MEMSLOT_INVALID)
>>>> +				continue;
>>>
>>> Skipping the memslot will lead to use-after-free.  E.g. if an invalidate_range_start()
>>> comes along between installing the invalid slot and zapping SPTEs, KVM will
>>> return from kvm_mmu_notifier_invalidate_range_start() without having dropped all
>>> references to the range.
>>>
>>> I.e.
>>>
>>> 	kvm_copy_memslot(invalid_slot, old);
>>> 	invalid_slot->flags |= KVM_MEMSLOT_INVALID;
>>> 	kvm_replace_memslot(kvm, old, invalid_slot);
>>>
>>> 	/*
>>> 	 * Activate the slot that is now marked INVALID, but don't propagate
>>> 	 * the slot to the now inactive slots. The slot is either going to be
>>> 	 * deleted or recreated as a new slot.
>>> 	 */
>>> 	kvm_swap_active_memslots(kvm, old->as_id);
>>>
>>>
>>> ==> invalidate_range_start()
>>>
>>> 	/*
>>> 	 * From this point no new shadow pages pointing to a deleted, or moved,
>>> 	 * memslot will be created.  Validation of sp->gfn happens in:
>>> 	 *	- gfn_to_hva (kvm_read_guest, gfn_to_pfn)
>>> 	 *	- kvm_is_visible_gfn (mmu_check_root)
>>> 	 */
>>> 	kvm_arch_flush_shadow_memslot(kvm, old);
>>>
>>> And even for change_pte(), skipping the memslot is wrong, as KVM would then fail
>>> unmap the prior SPTE.  Of course, that can't happen in the current code base
>>> because change_pte() is wrapped with invalidate_range_{start,end}(), but that's
>>> more of a bug than a design choice (see c13fda237f08 "KVM: Assert that notifier
>>> count is elevated in .change_pte()" for details).  That's also why this doesn't
>>> show up on x86; x86 installs a SPTE for the change_pte() notifier iff an existing
>>> SPTE is present, which is never true due to the invalidation.
>>>
>>
>> Right, those architectural dependencies are really something I worried about.
> 
> The zap case isn't a architecture specific, that's true for all KVM architectures.
> 

Yes.

>> It's safe to skip the invalid memory slots for ARM64,
> 
> No, it's not.  The problematic window where an invalidation can be incorrectly
> skipped is very tiny, and would have zero chance of being seen without something
> generating invalidations, e.g. page migration.  But that doesn't mean there's no
> bug.
> 

Yeah, Agree.

>> but it may be unsafe to do so for other architectures. You've listed the
>> potential risks to do so for x86. It might be risky with PowerPC's reverse
>> mapping stuff either. I didn't look into the code for the left architectures.
>> In order to escape from the architectural conflicts, I would move the check
>> and skip the invalid memory slot in arch/arm64/kvm/mmu.c::kvm_set_spte_gfn(),
>> something like below. In this way, the guest hang issue in ARM64 guest is
>> fixed. We may have similar issue on other architectures, but we can figure
>> out the fix when we have to.  Sean, please let me know if you're happy with
>> this?
>>
>> arch/arm64/kvm/mmu.c::kvm_set_spte_gfn()
>> ----------------------------------------
>>
>> bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>> {
>>          kvm_pfn_t pfn = pte_pfn(range->pte);
>>
>>          if (!kvm->arch.mmu.pgt)
>>                  return false;
>>
>>          /*
>>           * The memory slot can become invalid temporarily or permanently
>>           * when it's being moved or deleted. Avoid the stage2 mapping so
>>           * that all the read and write requests to the region of the memory
>>           * slot can be forwarded to VMM and emulated there.
>>           */
>>           if (range->slot->flags & KVM_MEMSLOT_INVALID)
>>               return false;
> 
> Please no.  (a) it papers over common KVM's reliance on the SPTE being zapped by
> invalidate_range_start(), and (b) it leaves the same bug in RISC-V (which copy+pasted
> much of arm64's MMU code) and in MIPS (which also installs SPTEs in its callback).
> 

Ok. Now I understand your suggestion to fix the issue for architectures. I will post
v3 with your suggested changes.

>>           WARN_ON(range->end - range->start != 1);
>>
>>           :
>> }
>>
>>> I'd honestly love to just delete the change_pte() callback, but my opinion is more
>>> than a bit biased since we don't use KSM.  Assuming we keep change_pte(), the best
>>> option is probably to provide a wrapper around kvm_set_spte_gfn() to skip the
>>> memslot, but with a sanity check and comment explaining the dependency on there
>>> being no SPTEs due to the invalidation.  E.g.
>>>
>>
>> It's good idea, but I'm afraid other architectures like PowerPC will still be
>> affected.
> 
> I don't follow.  PPC unmaps in response to a PTE change, but that's effectively
> dead code due to change_pte() being wrapped with invalidate_range_{start,end}().
> 

You're right. I forgot the fact that change_pte() is surrounded by invalidate_{start,end}().

>> So I would like to limit this issue to ARM64 and fix it in its
>> kvm_set_spte_gfn() variant, as above. One question about "we don't use KSM":
>> could you please share more information about this? I'm blindly guessing
>> you're saying KSM isn't used in google cloud?
> 
> Yeah, we == Google/GCE.  Sorry, should have clarified that.
> 

Ok, thanks for your clarification.

Thanks,
Gavin