linux-kernel - Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <695fe5a2-a295-4105-b31b-4cc92b089659@redhat.com>
Date:   Fri, 9 Jun 2023 09:17:34 +1000
From:   Gavin Shan <gshan@...hat.com>
To:     Marc Zyngier <maz@...nel.org>
Cc:     kvmarm@...ts.linux.dev, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, pbonzini@...hat.com,
        seanjc@...gle.com, oliver.upton@...ux.dev, hshuai@...hat.com,
        zhenyzha@...hat.com, shan.gavin@...il.com,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Xu <peterx@...hat.com>,
        David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

Hi Marc,

[Cc Andrea/David/Peter Xu]

On 6/9/23 12:31 AM, Marc Zyngier wrote:
> On Thu, 08 Jun 2023 10:03:48 +0100,
> Gavin Shan <gshan@...hat.com> wrote:
>>
>> We run into guest hang in edk2 firmware when KSM is kept as running
>> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
>> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
>> erasing or buffered write. The status is returned by reading the
>> memory region of the pflash device and the read request should
>> have been forwarded to QEMU and emulated by it. Unfortunately, the
>> read request is covered by an illegal stage2 mapping when the guest
>> hang issue occurs. The read request is completed with QEMU bypassed and
>> wrong status is fetched.
>>
>> The illegal stage2 mapping is populated due to same page mering by
>> KSM at (C) even the associated memory slot has been marked as invalid
>> at (B).
>>
>>    CPU-A                    CPU-B
>>    -----                    -----
>>                             ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
>>                             kvm_vm_ioctl_set_memory_region
>>                             kvm_set_memory_region
>>                             __kvm_set_memory_region
>>                             kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
>>                               kvm_invalidate_memslot
>>                                 kvm_copy_memslot
>>                                 kvm_replace_memslot
>>                                 kvm_swap_active_memslots        (A)
>>                                 kvm_arch_flush_shadow_memslot   (B)
>>    same page merging by KSM
>>    kvm_mmu_notifier_change_pte
>>    kvm_handle_hva_range
>>    __kvm_handle_hva_range       (C)
>>
>> Fix the issue by skipping the invalid memory slot at (C) to avoid the
>> illegal stage2 mapping. Without the illegal stage2 mapping, the read
>> request for the pflash's status is forwarded to QEMU and emulated by
>> it. The correct pflash's status can be returned from QEMU to break
>> the infinite wait in edk2 firmware.
> 
> Huh, nice one :-(.
> 

Yeah, it's a sneaky one :)

>>
>> Cc: stable@...r.kernel.org # v5.13+
>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
>> Reported-by: Shuai Hu <hshuai@...hat.com>
>> Reported-by: Zhenyu Zhang <zhenyzha@...hat.com>
>> Signed-off-by: Gavin Shan <gshan@...hat.com>
>> ---
>>   virt/kvm/kvm_main.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 479802a892d4..7f81a3a209b6 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>   			unsigned long hva_start, hva_end;
>>   
>>   			slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
>> +			if (slot->flags & KVM_MEMSLOT_INVALID)
>> +				continue;
>> +
>>   			hva_start = max(range->start, slot->userspace_addr);
>>   			hva_end = min(range->end, slot->userspace_addr +
>>   						  (slot->npages << PAGE_SHIFT));
> 
> I don't immediately see what makes it safer. If we're not holding one
> of slots_{,arch_}lock in the notifier, we can still race against the
> update, can't we?  I don't think holding the srcu lock helps us here.
> 

The dead-lock will be triggered if we're holding @slots_lock in
__kvm_handle_hva_range() when we have call site like below. There is
similar reason why we can't hold @slots_arch_lock in the function.

   CPU-A                                CPU-B
   -----                                ------
   invalidate_range_start()
     kvm->mn_active_invalidate_count++
     __kvm_handle_hva_range
                                        ioctl(kvm, KVM_SET_USER_MEMORY_REGION)   // Delete
                                        kvm_vm_ioctl_set_memory_region
                                        mutex_lock(&kvm->slots_lock);
                                        __kvm_set_memory_region
                                          kvm_set_memslot(..., KVM_MR_DELETE)
                                            kvm_invalidate_memslot
                                              kvm_replace_memslot
                                              kvm_swap_active_memslots          // infinite wait until
                                        mutex_unlock(&kvm->slots_lock);         // kvm->mn_active_invalidate_count is 0
   invalidate_range_end
     __kvm_handle_hva_range
       mutex_lock(&kvm->slots_lock);   // lock taken by CPU-B
       mutex_unlock(&kvm->slots_lock);
     --kvm->mn_active_invalidate_count

change_pte() is always surrounded by invalidate_range_start and
invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
larger than zero when change_pte() is called. With this condition
(kvm->mn_active_invalidate_count > 0), The swapping between the inactive
and active memory slots by kvm_swap_active_memslots() can't be done.
So there are two cases for one memory slot when change_pte() is called:
(a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
by kvm_invalidate_memslot(), called before invalidate_range_start();
(b) the memory slot has been deleted from the active memory slots. We're
only concerned by (a) when the active memory slots are iterated in
__kvm_handle_hva_range().

static void kvm_mmu_notifier_change_pte(...)
{
         :
         /*
          * .change_pte() must be surrounded by .invalidate_range_{start,end}().
          * If mmu_invalidate_in_progress is zero, then no in-progress
          * invalidations, including this one, found a relevant memslot at
          * start(); rechecking memslots here is unnecessary.  Note, a false
          * positive (count elevated by a different invalidation) is sub-optimal
          * but functionally ok.
          */
         WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
         if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
                 return;
         :
}


The srcu lock in __kvm_handle_hva_range() prevents the swapping of
the active and inactive memory slots by kvm_swap_active_memslots(). For
this particular case, it's not relevant because the swapping between
the inactive and active memory slots has been done for once, before
invalidate_range_start() is called.

Thanks,
Gavin