[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f6639daa-cfba-c65a-7320-c9dcc1ef8377@huawei.com>
Date: Sun, 17 Mar 2019 21:34:11 +0800
From: Zenghui Yu <yuzenghui@...wei.com>
To: Suzuki K Poulose <suzuki.poulose@....com>, <zhengxiang9@...wei.com>
CC: <marc.zyngier@....com>, <christoffer.dall@....com>,
<catalin.marinas@....com>, <will.deacon@....com>,
<james.morse@....com>, <linux-arm-kernel@...ts.infradead.org>,
<kvmarm@...ts.cs.columbia.edu>, <linux-kernel@...r.kernel.org>,
<wanghaibin.wang@...wei.com>, <lious.lilei@...ilicon.com>,
<lishuo1@...ilicon.com>
Subject: Re: [RFC] Question about TLB flush while set Stage-2 huge pages
Hi Suzuki,
On 2019/3/15 22:56, Suzuki K Poulose wrote:
> Hi Zhengui,
s/Zhengui/Zheng/
(I think you must wanted to say "Hi" to Zheng :-) )
I have looked into your patch and the kernel log, and I believe that
your patch had already addressed this issue. But I think we can do it
a little better - two more points need to be handled with caution.
Take PMD hugepage (PMD_SIZE == 2M) for example:
>
> On 15/03/2019 08:21, Zheng Xiang wrote:
>> Hi Suzuki,
>>
>> I have tested this patch, VM doesn't hang and we get expected WARNING
>> log:
>
> Thanks for the quick testing !
>
>> However, we also get the following unexpected log:
>>
>> [ 908.329900] BUG: Bad page state in process qemu-kvm pfn:a2fb41cf
>> [ 908.339415] page:ffff7e28bed073c0 count:-4 mapcount:0
>> mapping:0000000000000000 index:0x0
>> [ 908.339416] flags: 0x4ffffe0000000000()
>> [ 908.339418] raw: 4ffffe0000000000 dead000000000100 dead000000000200
>> 0000000000000000
>> [ 908.339419] raw: 0000000000000000 0000000000000000 fffffffcffffffff
>> 0000000000000000
>> [ 908.339420] page dumped because: nonzero _refcount
>> [ 908.339437] CPU: 32 PID: 72599 Comm: qemu-kvm Kdump: loaded
>> Tainted: G B W 5.0.0+ #1
>> [ 908.339438] Call trace:
>> [ 908.339439] dump_backtrace+0x0/0x188
>> [ 908.339441] show_stack+0x24/0x30
>> [ 908.339442] dump_stack+0xa8/0xcc
>> [ 908.339443] bad_page+0xf0/0x150
>> [ 908.339445] free_pages_check_bad+0x84/0xa0
>> [ 908.339446] free_pcppages_bulk+0x4b8/0x750
>> [ 908.339448] free_unref_page_commit+0x13c/0x198
>> [ 908.339449] free_unref_page+0x84/0xa0
>> [ 908.339451] __free_pages+0x58/0x68
>> [ 908.339452] zap_huge_pmd+0x290/0x2d8
>> [ 908.339454] unmap_page_range+0x2b4/0x470
>> [ 908.339455] unmap_single_vma+0x94/0xe8
>> [ 908.339457] unmap_vmas+0x8c/0x108
>> [ 908.339458] exit_mmap+0xd4/0x178
>> [ 908.339459] mmput+0x74/0x180
>> [ 908.339460] do_exit+0x2b4/0x5b0
>> [ 908.339462] do_group_exit+0x3c/0xe0
>> [ 908.339463] __arm64_sys_exit_group+0x24/0x28
>> [ 908.339465] el0_svc_common+0xa0/0x180
>> [ 908.339466] el0_svc_handler+0x38/0x78
>> [ 908.339467] el0_svc+0x8/0xc
>
> Thats bad, we seem to be making upto 4 unbalanced put_page().
>
>>>> ---
>>>> virt/kvm/arm/mmu.c | 51
>>>> +++++++++++++++++++++++++++++++++++----------------
>>>> 1 file changed, 35 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>>> index 66e0fbb5..04b0f9b 100644
>>>> --- a/virt/kvm/arm/mmu.c
>>>> +++ b/virt/kvm/arm/mmu.c
>>>> @@ -1076,24 +1076,38 @@ static int stage2_set_pmd_huge(struct kvm
>>>> *kvm, struct kvm_mmu_memory_cache
>>>> * Skip updating the page table if the entry is
>>>> * unchanged.
>>>> */
>>>> - if (pmd_val(old_pmd) == pmd_val(*new_pmd))
>>>> + if (pmd_val(old_pmd) == pmd_val(*new_pmd)) {
>>>> return 0;
>>>> -
>>>> + } else if (WARN_ON_ONCE(!pmd_thp_or_huge(old_pmd))) {
>>>> /*
>>>> - * Mapping in huge pages should only happen through a
>>>> - * fault. If a page is merged into a transparent huge
>>>> - * page, the individual subpages of that huge page
>>>> - * should be unmapped through MMU notifiers before we
>>>> - * get here.
>>>> - *
>>>> - * Merging of CompoundPages is not supported; they
>>>> - * should become splitting first, unmapped, merged,
>>>> - * and mapped back in on-demand.
>>>> + * If we have PTE level mapping for this block,
>>>> + * we must unmap it to avoid inconsistent TLB
>>>> + * state. We could end up in this situation if
>>>> + * the memory slot was marked for dirty logging
>>>> + * and was reverted, leaving PTE level mappings
>>>> + * for the pages accessed during the period.
>>>> + * Normal THP split/merge follows mmu_notifier
>>>> + * callbacks and do get handled accordingly.
>>>> */
>>>> - VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
>>>> + unmap_stage2_range(kvm, (addr & S2_PMD_MASK),
>>>> S2_PMD_SIZE);
First, using unmap_stage2_range() here is not quite appropriate. Suppose
we've only accessed one 2M page in HPA [x, x+1]Gib range, with other
pages unaccessed. What will happen if unmap_stage2_range(this_2M_page)?
We'll unexpectedly reach clear_stage2_pud_entry(), and things are going
to get really bad. So we'd better use unmap_stage2_ptes() here since we
only want to unmap a 2M range.
Second, consider below function stack:
unmap_stage2_ptes()
clear_stage2_pmd_entry()
put_page(virt_to_page(pmd))
It seems that we have one "redundant" put_page() here, (thus comes the
bad kernel log ... ,) but actually we do not. By stage2_set_pmd_huge(),
the PMD table entry will then point to a 2M block (originally pointed
to a PTE table), the _refcount of this PMD-level table page should _not_
change after unmap_stage2_ptes(). So what we really should do is adding
a get_page() after unmapping to keep the _refcount a balance!
thoughts ? A simple patch below (based on yours) for details.
thanks,
zenghui
>>
>> It seems that kvm decreases the _refcount of the page twice in
>> transparent_hugepage_adjust()
>> and unmap_stage2_range().
>
> But I thought we should be doing that on the head_page already, as this
> is THP.
> I will take a look and get back to you on this. Btw, is it possible for you
> to turn on CONFIG_DEBUG_VM and re-run with the above patch ?
>
> Kind regards
> Suzuki
>
---8<---
test: kvm: arm: Maybe two more fixes
Applied based on Suzuki's patch.
Signed-off-by: Zenghui Yu <yuzenghui@...wei.com>
---
virt/kvm/arm/mmu.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 05765df..ccd5d5d 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1089,7 +1089,9 @@ static int stage2_set_pmd_huge(struct kvm *kvm,
struct kvm_mmu_memory_cache
* Normal THP split/merge follows mmu_notifier
* callbacks and do get handled accordingly.
*/
- unmap_stage2_range(kvm, (addr & S2_PMD_MASK), S2_PMD_SIZE);
+ addr &= S2_PMD_MASK;
+ unmap_stage2_ptes(kvm, pmd, addr, addr + S2_PMD_SIZE);
+ get_page(virt_to_page(pmd));
} else {
/*
@@ -1138,7 +1140,9 @@ static int stage2_set_pud_huge(struct kvm *kvm,
struct kvm_mmu_memory_cache *cac
if (stage2_pud_present(kvm, old_pud)) {
/* If we have PTE level mapping, unmap the entire range */
if (WARN_ON_ONCE(!stage2_pud_huge(kvm, old_pud))) {
- unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
+ addr &= S2_PUD_MASK;
+ unmap_stage2_pmds(kvm, pudp, addr, addr + S2_PUD_SIZE);
+ get_page(virt_to_page(pudp));
} else {
stage2_pud_clear(kvm, pudp);
kvm_tlb_flush_vmid_ipa(kvm, addr);
--
1.8.3.1
Powered by blists - more mailing lists