[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e77266ae-6695-a1a2-f96a-b92c97a900c7@arm.com>
Date: Mon, 17 Jul 2017 14:03:11 +0100
From: Suzuki K Poulose <Suzuki.Poulose@....com>
To: Christoffer Dall <cdall@...aro.org>
Cc: Alexander Graf <agraf@...e.de>, kvmarm@...ts.cs.columbia.edu,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
Andrea Arcangeli <aarcange@...hat.com>, stable@...r.kernel.org
Subject: Re: [PATCH v2] KVM: arm/arm64: Handle hva aging while destroying the
vm
On 16/07/17 20:56, Christoffer Dall wrote:
> On Fri, Jul 14, 2017 at 05:40:48PM +0100, Suzuki K Poulose wrote:
>> On 06/07/17 10:42, Christoffer Dall wrote:
>>> On Thu, Jul 06, 2017 at 10:34:58AM +0100, Suzuki K Poulose wrote:
>>>> On 06/07/17 08:45, Christoffer Dall wrote:
>>>>> On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote:
>>>>>>
>>>>>>
>>>>>> On 05.07.17 10:57, Suzuki K Poulose wrote:
>>>>>>> Hi Alex,
>>>>>>>
>>>>>>> On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote:
>>>>>>>> The kvm_age_hva callback may be called all the way concurrently while
>>>>>>>> kvm_mmu_notifier_release() is running.
>>>>>>>>
>>>>>>>> The release function sets kvm->arch.pgd = NULL which the aging function
>>>>>>>> however implicitly relies on in stage2_get_pud(). That means they can
>>>>>>>> race and the aging function may dereference a NULL pgd pointer.
>>>>>>>>
>>>>>>>> This patch adds a check for that case, so that we leave the aging
>>>>>>>> function silently.
>>>>>>>>
>>>>>>>> Cc: stable@...r.kernel.org
>>>>>>>> Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
>>>>>>>> Signed-off-by: Alexander Graf <agraf@...e.de>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> v1 -> v2:
>>>>>>>>
>>>>>>>> - Fix commit message
>>>>>>>> - Add Fixes and stable tags
>>>>>>>> ---
>>>>>>>> virt/kvm/arm/mmu.c | 4 ++++
>>>>>>>> 1 file changed, 4 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>>>>>>> index f2d5b6c..227931f 100644
>>>>>>>> --- a/virt/kvm/arm/mmu.c
>>>>>>>> +++ b/virt/kvm/arm/mmu.c
>>>>>>>> @@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>>>>>>>> pgd_t *pgd;
>>>>>>>> pud_t *pud;
>>>>>>>> + /* Do we clash with kvm_free_stage2_pgd()? */
>>>>>>>> + if (!kvm->arch.pgd)
>>>>>>>> + return NULL;
>>>>>>>> +
>>>>>>>
>>>>>>> I think this check should be moved up in the chain. We call kvm_age_hva(), with
>>>>>>> the kvm->mmu_lock held and we don't release it till we reach here. So, ideally,
>>>>>>> if we find the PGD is null when we reach kvm_age_hva(), we could simply return
>>>>>>> there, like we do for other call backs from the KVM mmu_notifier.
>>>>>>
>>>>>> That probably works too - I'm not sure which version is more
>>>>>> consistent as well as more maintainable in the long run. I'll leave
>>>>>> the call here to Christoffer.
>>>>>>
>>>>>
>>>>> Let's look at the callers to stage2_get_pmd, which is the only caller of
>>>>> stage2_get_pud, where the problem was observed:
>>>>>
>>>>> user_mem_abort
>>>>> -> stage2_set_pmd_huge
>>>>> -> stage2_get_pmd
>>>>>
>>>>> user_mem_abort
>>>>> -> stage2_set_pte
>>>>> -> stage2_get_pmd
>>>>>
>>>>> handle_access_fault
>>>>> -> stage2_get_pmd
>>>>>
>>>>> For the above three functions, pgd cannot ever be NULL, because this is
>>>>> running in the context of a VCPU thread, which means the reference on
>>>>> the VM fd must not reach zero, so no need to call that here.
>>>>
>>>> I think there is some problem here. See below for more information.
>>>>
>>>>>
>>>>> kvm_set_spte_handler
>>>>> -> stage2_set_pte
>>>>> -> stage2_get_pmd
>>>>>
>>>>> This is called from kvm_set_spte_hva, which is one of the MMU notifiers,
>>>>> so it can race similarly kvm_age_hva and kvm_test_age_hva, but it
>>>>> already checks for !kvm->arch.pgd.
>>>>>
>>>>> kvm_phys_addr_ioremap
>>>>> -> stage2_set_pte
>>>>> -> stage2_get_pmd
>>>>>
>>>>> This is called from two places: (1) The VGIC code (as part of
>>>>> vgic_v2_map_resources) and can only be called in the context of running
>>>>> a VCPU, so the pgd cannot be null by virtue of the same argument as for
>>>>> user_mem_abort. (2) kvm_arch_prepare_memory_region calls
>>>>> kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see
>>>>> how the VM can be in the middle of being freed while handling ioctls on
>>>>> the fd. Therefore, following the same argument, this should be safe as
>>>>> well.
>>>>>
>>>>> kvm_age_hva_handler and kvm_test_age_hva_handler
>>>>> -> stage2_get_pmd
>>>>>
>>>>> Handled by the patch proposed by Suzuki.
>>>>>
>>>>> What does all that tell us? First, it should give us some comfort that we
>>>>> don't have more races of this kind. Second, it teels us that there are
>>>>> a number of different and not-obvious call paths to stage2_pet_pud,
>>>>> which could be an argument to simply check the pgd whenever it's called,
>>>>> despite the minor runtime overhead. On the other hand, the check itself
>>>>> is only valid knowing that we synchronize against kvm_free_stage2_pgd
>>>>> using the kvm->mmu_lock() and understanding that this only happens when
>>>>> mmu notifiers call into the KVM MMU code outside the context of the VM.
>>>>>
>>>>> The last consideration is the winning argument for me to put the check
>>>>> in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's
>>>>> important that we document why it's only these three high-level callers
>>>>> (incl. kvm_set_spte_handler) that need the check, either in the code or
>>>>> in the commit message.
>>>>
>>>> The only way we end up freeing the stage-2 PGD is via the mmu_notifier_release(),
>>>> which could be triggered via two different paths.
>>>>
>>>> 1) kvm_destroy_vm(), where all the VM resources has been released and the
>>>> refcount on the KVM instances are dropped, via kvm_put_kvm().
>>>>
>>>> kvm_put_kvm()
>>>> kvm_destroy_vm()
>>>> mmu_notifier_unregsiter
>>>> mmu_notifier_ops->release()
>>>> kvm_arch_flush_shadow_all
>>>> kvm_free_stage2_pgd -> free the page table with the mmu_lock held
>>>> occasionally releasing it to avoid contention.
>>>> or
>>>>
>>>> 2) do_signal -> get_signal -> do_group_exit - >
>>>> do_exit
>>>> exit_mm
>>>> mmput => __mmput
>>>> exit_mmap
>>>> mmu_notifier_release
>>>> mmu_notifier_ops->release
>>>> kvm_arch_flush_shadow_all
>>>> kvm_free_stage2_pgd
>>>>
>>>> In the first case, all references to the VM are dropped and hence none of the
>>>> VCPU could still be executing.
>>>>
>>>> However, in the second case it may not be. So we have a potential problem with
>>>> the VCPU trying to run even when the pages were unmapped. I think the root cause
>>>> of all these issues boils down to the assumption that KVM holds a reference to
>>>> MM (which is not necessarily the user space mapping. i.e mmgrab vs mmget).
>>>> I am not sure if the VCPU should hold a reference to the mmaps to make sure
>>>> it is safe to run. That way, the mmap stays until the VCPU eventually exits
>>>> and we are safe all the way around.
>>>
>>> Hmmm, my assumption is that if a VCPU is running, it means there is a
>>> VCPU thread that shares the struct mm which is running, so I don't
>>> understand how mmput would be able to call exit_mmap in the scenario
>>> above?
>>>
>>> So the distinction here is that I don't assume that the VCPU fd holds a
>>> reference to the mm, but I assume that the (running) VCPU thread does.
>>> Is this incorrect?
>>
>> Sorry, I lost this thread in between.
>
> No worries.
>
>>
>> Hmm. You're right. The VCPU should have a refcount on mmap and it shouldn't
>> do anything with the mmu if it has dropped it. I was confused based on an
>> old bug report,[ See the description of commit 293f293637b55d "kvm-arm: Unmap shadow pagetables
>> properly"], which was fixed.
>>
>
This keeps haunting me :(. I happened to look at another report from Alex back in April,
where a VCPU ends up in an mmu_notifier call back after the PGD was free'd ! So it does
look like you could end up in freeing the stage2 page table and yet another VCPU could
be running.
See : https://lkml.org/lkml/2017/4/21/820
I am trying to see if I can reproduce the hypothesis.
Suzuki
> OK, that makes sense to me. I hope Andrea also agrees with this, so as
> long as VCPU thread is running, exit_mm() will not be called.
>
> Thanks,
> -Christoffer
>
Powered by blists - more mailing lists