[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <295a2c95-d75a-6857-0b24-e33e387fec95@redhat.com>
Date: Fri, 9 Jun 2023 19:02:10 +1000
From: Gavin Shan <gshan@...hat.com>
To: Marc Zyngier <maz@...nel.org>
Cc: kvmarm@...ts.linux.dev, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, pbonzini@...hat.com,
seanjc@...gle.com, oliver.upton@...ux.dev, hshuai@...hat.com,
zhenyzha@...hat.com, shan.gavin@...il.com,
Andrea Arcangeli <aarcange@...hat.com>,
Peter Xu <peterx@...hat.com>,
David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot
On 6/9/23 6:06 PM, Marc Zyngier wrote:
> On Fri, 09 Jun 2023 00:17:34 +0100,
> Gavin Shan <gshan@...hat.com> wrote:
>> On 6/9/23 12:31 AM, Marc Zyngier wrote:
>>> On Thu, 08 Jun 2023 10:03:48 +0100,
>>> Gavin Shan <gshan@...hat.com> wrote:
>>>>
>>>> We run into guest hang in edk2 firmware when KSM is kept as running
>>>> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
>>>> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
>>>> erasing or buffered write. The status is returned by reading the
>>>> memory region of the pflash device and the read request should
>>>> have been forwarded to QEMU and emulated by it. Unfortunately, the
>>>> read request is covered by an illegal stage2 mapping when the guest
>>>> hang issue occurs. The read request is completed with QEMU bypassed and
>>>> wrong status is fetched.
>>>>
>>>> The illegal stage2 mapping is populated due to same page mering by
>>>> KSM at (C) even the associated memory slot has been marked as invalid
>>>> at (B).
>>>>
>>>> CPU-A CPU-B
>>>> ----- -----
>>>> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
>>>> kvm_vm_ioctl_set_memory_region
>>>> kvm_set_memory_region
>>>> __kvm_set_memory_region
>>>> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
>>>> kvm_invalidate_memslot
>>>> kvm_copy_memslot
>>>> kvm_replace_memslot
>>>> kvm_swap_active_memslots (A)
>>>> kvm_arch_flush_shadow_memslot (B)
>>>> same page merging by KSM
>>>> kvm_mmu_notifier_change_pte
>>>> kvm_handle_hva_range
>>>> __kvm_handle_hva_range (C)
>>>>
>>>> Fix the issue by skipping the invalid memory slot at (C) to avoid the
>>>> illegal stage2 mapping. Without the illegal stage2 mapping, the read
>>>> request for the pflash's status is forwarded to QEMU and emulated by
>>>> it. The correct pflash's status can be returned from QEMU to break
>>>> the infinite wait in edk2 firmware.
>>>
>>> Huh, nice one :-(.
>>>
>>
>> Yeah, it's a sneaky one :)
>>
>>>>
>>>> Cc: stable@...r.kernel.org # v5.13+
>>>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
>>>> Reported-by: Shuai Hu <hshuai@...hat.com>
>>>> Reported-by: Zhenyu Zhang <zhenyzha@...hat.com>
>>>> Signed-off-by: Gavin Shan <gshan@...hat.com>
>>>> ---
>>>> virt/kvm/kvm_main.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 479802a892d4..7f81a3a209b6 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>>> unsigned long hva_start, hva_end;
>>>> slot = container_of(node, struct
>>>> kvm_memory_slot, hva_node[slots->node_idx]);
>>>> + if (slot->flags & KVM_MEMSLOT_INVALID)
>>>> + continue;
>>>> +
>>>> hva_start = max(range->start, slot->userspace_addr);
>>>> hva_end = min(range->end, slot->userspace_addr +
>>>> (slot->npages << PAGE_SHIFT));
>>>
>>> I don't immediately see what makes it safer. If we're not holding one
>>> of slots_{,arch_}lock in the notifier, we can still race against the
>>> update, can't we? I don't think holding the srcu lock helps us here.
>>>
>
> [...]
>
>> change_pte() is always surrounded by invalidate_range_start and
>> invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
>> larger than zero when change_pte() is called. With this condition
>> (kvm->mn_active_invalidate_count > 0), The swapping between the inactive
>> and active memory slots by kvm_swap_active_memslots() can't be done.
>> So there are two cases for one memory slot when change_pte() is called:
>> (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
>> by kvm_invalidate_memslot(), called before invalidate_range_start();
>> (b) the memory slot has been deleted from the active memory slots. We're
>> only concerned by (a) when the active memory slots are iterated in
>> __kvm_handle_hva_range().
>
> OK, so to sum it up:
>
> - the memslot cannot be swapped while we're walking the active
> memslots because kvm->mn_active_invalidate_count is elevated, and
> kvm_swap_active_memslots() will busy loop until this has reached 0
> again
>
> - holding the srcu read_lock prevents an overlapping memslot update
> from being published at the wrong time (synchronize_srcu_expedited()
> in kvm_swap_active_memslots())
>
> If the above holds, then I agree the fix looks correct. I'd definitely
> want to see some of this rationale captured in the commit message
> though.
>
Yes, your summary is exactly what I understood. I will improve the
changelog to include the rationale in v2.
Thanks,
Gavin
>
>>
>> static void kvm_mmu_notifier_change_pte(...)
>> {
>> :
>> /*
>> * .change_pte() must be surrounded by .invalidate_range_{start,end}().
>> * If mmu_invalidate_in_progress is zero, then no in-progress
>> * invalidations, including this one, found a relevant memslot at
>> * start(); rechecking memslots here is unnecessary. Note, a false
>> * positive (count elevated by a different invalidation) is sub-optimal
>> * but functionally ok.
>> */
>> WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
>> if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
>> return;
>> :
>> }
>>
>>
>> The srcu lock in __kvm_handle_hva_range() prevents the swapping of
>> the active and inactive memory slots by kvm_swap_active_memslots(). For
>> this particular case, it's not relevant because the swapping between
>> the inactive and active memory slots has been done for once, before
>> invalidate_range_start() is called.
>
>
Powered by blists - more mailing lists