linux-kernel - Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <295a2c95-d75a-6857-0b24-e33e387fec95@redhat.com>
Date:   Fri, 9 Jun 2023 19:02:10 +1000
From:   Gavin Shan <gshan@...hat.com>
To:     Marc Zyngier <maz@...nel.org>
Cc:     kvmarm@...ts.linux.dev, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, pbonzini@...hat.com,
        seanjc@...gle.com, oliver.upton@...ux.dev, hshuai@...hat.com,
        zhenyzha@...hat.com, shan.gavin@...il.com,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Xu <peterx@...hat.com>,
        David Hildenbrand <david@...hat.com>
Subject: Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot

On 6/9/23 6:06 PM, Marc Zyngier wrote:
> On Fri, 09 Jun 2023 00:17:34 +0100,
> Gavin Shan <gshan@...hat.com> wrote:
>> On 6/9/23 12:31 AM, Marc Zyngier wrote:
>>> On Thu, 08 Jun 2023 10:03:48 +0100,
>>> Gavin Shan <gshan@...hat.com> wrote:
>>>>
>>>> We run into guest hang in edk2 firmware when KSM is kept as running
>>>> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's
>>>> pflash device (TYPE_PFLASH_CFI01) during the operation for sector
>>>> erasing or buffered write. The status is returned by reading the
>>>> memory region of the pflash device and the read request should
>>>> have been forwarded to QEMU and emulated by it. Unfortunately, the
>>>> read request is covered by an illegal stage2 mapping when the guest
>>>> hang issue occurs. The read request is completed with QEMU bypassed and
>>>> wrong status is fetched.
>>>>
>>>> The illegal stage2 mapping is populated due to same page mering by
>>>> KSM at (C) even the associated memory slot has been marked as invalid
>>>> at (B).
>>>>
>>>>     CPU-A                    CPU-B
>>>>     -----                    -----
>>>>                              ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
>>>>                              kvm_vm_ioctl_set_memory_region
>>>>                              kvm_set_memory_region
>>>>                              __kvm_set_memory_region
>>>>                              kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
>>>>                                kvm_invalidate_memslot
>>>>                                  kvm_copy_memslot
>>>>                                  kvm_replace_memslot
>>>>                                  kvm_swap_active_memslots        (A)
>>>>                                  kvm_arch_flush_shadow_memslot   (B)
>>>>     same page merging by KSM
>>>>     kvm_mmu_notifier_change_pte
>>>>     kvm_handle_hva_range
>>>>     __kvm_handle_hva_range       (C)
>>>>
>>>> Fix the issue by skipping the invalid memory slot at (C) to avoid the
>>>> illegal stage2 mapping. Without the illegal stage2 mapping, the read
>>>> request for the pflash's status is forwarded to QEMU and emulated by
>>>> it. The correct pflash's status can be returned from QEMU to break
>>>> the infinite wait in edk2 firmware.
>>>
>>> Huh, nice one :-(.
>>>
>>
>> Yeah, it's a sneaky one :)
>>
>>>>
>>>> Cc: stable@...r.kernel.org # v5.13+
>>>> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
>>>> Reported-by: Shuai Hu <hshuai@...hat.com>
>>>> Reported-by: Zhenyu Zhang <zhenyzha@...hat.com>
>>>> Signed-off-by: Gavin Shan <gshan@...hat.com>
>>>> ---
>>>>    virt/kvm/kvm_main.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>> index 479802a892d4..7f81a3a209b6 100644
>>>> --- a/virt/kvm/kvm_main.c
>>>> +++ b/virt/kvm/kvm_main.c
>>>> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
>>>>    			unsigned long hva_start, hva_end;
>>>>      			slot = container_of(node, struct
>>>> kvm_memory_slot, hva_node[slots->node_idx]);
>>>> +			if (slot->flags & KVM_MEMSLOT_INVALID)
>>>> +				continue;
>>>> +
>>>>    			hva_start = max(range->start, slot->userspace_addr);
>>>>    			hva_end = min(range->end, slot->userspace_addr +
>>>>    						  (slot->npages << PAGE_SHIFT));
>>>
>>> I don't immediately see what makes it safer. If we're not holding one
>>> of slots_{,arch_}lock in the notifier, we can still race against the
>>> update, can't we?  I don't think holding the srcu lock helps us here.
>>>
> 
> [...]
> 
>> change_pte() is always surrounded by invalidate_range_start and
>> invalidate_range_end(). It means kvm->mn_active_invalidate_count is always
>> larger than zero when change_pte() is called. With this condition
>> (kvm->mn_active_invalidate_count > 0), The swapping between the inactive
>> and active memory slots by kvm_swap_active_memslots() can't be done.
>> So there are two cases for one memory slot when change_pte() is called:
>> (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots
>> by kvm_invalidate_memslot(), called before invalidate_range_start();
>> (b) the memory slot has been deleted from the active memory slots. We're
>> only concerned by (a) when the active memory slots are iterated in
>> __kvm_handle_hva_range().
> 
> OK, so to sum it up:
> 
> - the memslot cannot be swapped while we're walking the active
>    memslots because kvm->mn_active_invalidate_count is elevated, and
>    kvm_swap_active_memslots() will busy loop until this has reached 0
>    again
> 
> - holding the srcu read_lock prevents an overlapping memslot update
>    from being published at the wrong time (synchronize_srcu_expedited()
>    in kvm_swap_active_memslots())
> 
> If the above holds, then I agree the fix looks correct. I'd definitely
> want to see some of this rationale captured in the commit message
> though.
> 

Yes, your summary is exactly what I understood. I will improve the
changelog to include the rationale in v2.

Thanks,
Gavin
> 
>>
>> static void kvm_mmu_notifier_change_pte(...)
>> {
>>          :
>>          /*
>>           * .change_pte() must be surrounded by .invalidate_range_{start,end}().
>>           * If mmu_invalidate_in_progress is zero, then no in-progress
>>           * invalidations, including this one, found a relevant memslot at
>>           * start(); rechecking memslots here is unnecessary.  Note, a false
>>           * positive (count elevated by a different invalidation) is sub-optimal
>>           * but functionally ok.
>>           */
>>          WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
>>          if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
>>                  return;
>>          :
>> }
>>
>>
>> The srcu lock in __kvm_handle_hva_range() prevents the swapping of
>> the active and inactive memory slots by kvm_swap_active_memslots(). For
>> this particular case, it's not relevant because the swapping between
>> the inactive and active memory slots has been done for once, before
>> invalidate_range_start() is called.
> 
>