[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <695fe5a2-a295-4105-b31b-4cc92b089659@redhat.com>
Date: Fri, 9 Jun 2023 09:17:34 +1000
From: Gavin Shan <gshan@...hat.com>
To: Marc Zyngier <maz@...nel.org>
Cc: kvmarm@...ts.linux.dev, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, pbonzini@...hat.com,
seanjc@...gle.com, oliver.upton@...ux.dev, hshuai@...hat.com,
zhenyzha@...hat.com, shan.gavin@...il.com,
Andrea Arcangeli <aarcange@...hat.com>,
Peter Xu <peterx@...hat.com>,
David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot
Hi Marc,
[Cc Andrea/David/Peter Xu]
On 6/9/23 12:31 AM, Marc Zyngier wrote:
> On Thu, 08 Jun 2023 10:03:48 +0100,
> Gavin Shan <gshan@...hat.com> wrote:
>>
>> We run into guest hang in edk2 firmware when KSM is kept as running
>> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
>> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
>> erasing or buffered write. The status is returned by reading the
>> memory region of the pflash device and the read request should
>> have been forwarded to QEMU and emulated by it. Unfortunately, the
>> read request is covered by an illegal stage2 mapping when the guest
>> hang issue occurs. The read request is completed with QEMU bypassed and
>> wrong status is fetched.
>>
>> The illegal stage2 mapping is populated due to same page mering by
>> KSM at (C) even the associated memory slot has been marked as invalid
>> at (B).
>>
>> CPU-A CPU-B
>> ----- -----
>> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
>> kvm_vm_ioctl_set_memory_region
>> kvm_set_memory_region
>> __kvm_set_memory_region
>> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
>> kvm_invalidate_memslot
>> kvm_copy_memslot
>> kvm_replace_memslot
>> kvm_swap_active_memslots (A)
>> kvm_arch_flush_shadow_memslot (B)
>> same page merging by KSM
>> kvm_mmu_notifier_change_pte
>> kvm_handle_hva_range
>> __kvm_handle_hva_range (C)
>>
>> Fix the issue by skipping the invalid memory slot at (C) to avoid the
>> illegal stage2 mapping. Without the illegal stage2 mapping, the read
>> request for the pflash's status is forwarded to QEMU and emulated by
>> it. The correct pflash's status can be returned from QEMU to break
>> the infinite wait in edk2 firmware.
>
> Huh, nice one :-(.
>
Yeah, it's a sneaky one :)
>>
>> Cc: stable@...r.kernel.org # v5.13+
>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
>> Reported-by: Shuai Hu <hshuai@...hat.com>
>> Reported-by: Zhenyu Zhang <zhenyzha@...hat.com>
>> Signed-off-by: Gavin Shan <gshan@...hat.com>
>> ---
>> virt/kvm/kvm_main.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 479802a892d4..7f81a3a209b6 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>> unsigned long hva_start, hva_end;
>>
>> slot = container_of(node, struct kvm_memory_slot, hva_node[slots->node_idx]);
>> + if (slot->flags & KVM_MEMSLOT_INVALID)
>> + continue;
>> +
>> hva_start = max(range->start, slot->userspace_addr);
>> hva_end = min(range->end, slot->userspace_addr +
>> (slot->npages << PAGE_SHIFT));
>
> I don't immediately see what makes it safer. If we're not holding one
> of slots_{,arch_}lock in the notifier, we can still race against the
> update, can't we? I don't think holding the srcu lock helps us here.
>
The dead-lock will be triggered if we're holding @slots_lock in
__kvm_handle_hva_range() when we have call site like below. There is
similar reason why we can't hold @slots_arch_lock in the function.
CPU-A CPU-B
----- ------
invalidate_range_start()
kvm->mn_active_invalidate_count++
__kvm_handle_hva_range
ioctl(kvm, KVM_SET_USER_MEMORY_REGION) // Delete
kvm_vm_ioctl_set_memory_region
mutex_lock(&kvm->slots_lock);
__kvm_set_memory_region
kvm_set_memslot(..., KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_replace_memslot
kvm_swap_active_memslots // infinite wait until
mutex_unlock(&kvm->slots_lock); // kvm->mn_active_invalidate_count is 0
invalidate_range_end
__kvm_handle_hva_range
mutex_lock(&kvm->slots_lock); // lock taken by CPU-B
mutex_unlock(&kvm->slots_lock);
--kvm->mn_active_invalidate_count
change_pte() is always surrounded by invalidate_range_start and
invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
larger than zero when change_pte() is called. With this condition
(kvm->mn_active_invalidate_count > 0), The swapping between the inactive
and active memory slots by kvm_swap_active_memslots() can't be done.
So there are two cases for one memory slot when change_pte() is called:
(a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
by kvm_invalidate_memslot(), called before invalidate_range_start();
(b) the memory slot has been deleted from the active memory slots. We're
only concerned by (a) when the active memory slots are iterated in
__kvm_handle_hva_range().
static void kvm_mmu_notifier_change_pte(...)
{
:
/*
* .change_pte() must be surrounded by .invalidate_range_{start,end}().
* If mmu_invalidate_in_progress is zero, then no in-progress
* invalidations, including this one, found a relevant memslot at
* start(); rechecking memslots here is unnecessary. Note, a false
* positive (count elevated by a different invalidation) is sub-optimal
* but functionally ok.
*/
WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
return;
:
}
The srcu lock in __kvm_handle_hva_range() prevents the swapping of
the active and inactive memory slots by kvm_swap_active_memslots(). For
this particular case, it's not relevant because the swapping between
the inactive and active memory slots has been done for once, before
invalidate_range_start() is called.
Thanks,
Gavin
Powered by blists - more mailing lists