[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.00.1202160130500.16147@eggly.anvils>
Date: Thu, 16 Feb 2012 01:53:04 -0800 (PST)
From: Hugh Dickins <hughd@...gle.com>
To: Andrea Arcangeli <aarcange@...hat.com>
cc: Dave Jones <davej@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Rientjes <rientjes@...gle.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
kernel-team@...oraproject.org
Subject: Re: exit_mmap() BUG_ON triggering since 3.1
On Thu, 16 Feb 2012, Andrea Arcangeli wrote:
> On Wed, Feb 15, 2012 at 06:14:12PM -0800, Hugh Dickins wrote:
> > Now most of those paths in THP also hold mmap_sem for read or write (with
> > appropriate checks on mm_users), but two do not: when split_huge_page()
> > is called by hwpoison_user_mappings(), and when called by add_to_swap().
>
> So the race is split_huge_page_map() called by add_to_swap() running
> concurrently with free_pgtables. Great catch!!
>
> > Or should we try harder to avoid the extra locking: test mm_users?
> > #ifdef on THP? Or consider the accuracy of this count not worth
> > extra locking, and just scrap the BUG_ON now?
>
> It's probably also happening with a large munmap, while add_to_swap
> runs on another vma. Process didn't exit yet, but the actual BUG_ON
> check runs at exit. So I doubt aborting split_huge_page on zero
> mm_users could solve it.
Indeed, what I meant was, I was wondering whether to make the
spin_lock and spin_unlock in my patch conditional on mm_users,
not to make split_huge_page conditional on it.
>
> Good part is, this being a false positive makes these oopses a
> nuisance, so it means they can't corrupt any memory or disk etc...
Yes (and I think less troublesome than most BUGs, coming at exit
while not holding locks; though we could well make it a WARN_ON,
I don't think that existed back in the day).
>
> The simplest is probably to change nr_ptes to count THPs too. I tried
> that and no oopses so far but it's not very well tested yet.
Oh, I like that, that's a much nicer fix than mine. If you're happy
to change the THP end (which I could hardly blame for getting those odd
rules slightly wrong), and it passes your testing, then certainly add my
Acked-by: Hugh Dickins <hughd@...gle.com>
In looking into the bug, it had actually bothered me a little that you
were setting aside those pages, yet not counting them into nr_ptes;
though the only thing that cares is oom_kill.c, and the count of pages
in each hugepage can only dwarf the count in nr_ptes (whereas, without
hugepages, it's possible to populate very sparsely and nr_ptes become
significant).
>
> ====
> From: Andrea Arcangeli <aarcange@...hat.com>
> Subject: [PATCH] mm: thp: fix BUG on mm->nr_ptes
>
> Quoting Hugh's discovery and explanation of the SMP race condition:
>
> ===
> mm->nr_ptes had unusual locking: down_read mmap_sem plus
> page_table_lock when incrementing, down_write mmap_sem (or mm_users 0)
> when decrementing; whereas THP is careful to increment and decrement
> it under page_table_lock.
>
> Now most of those paths in THP also hold mmap_sem for read or write
> (with appropriate checks on mm_users), but two do not: when
> split_huge_page() is called by hwpoison_user_mappings(), and when
> called by add_to_swap().
>
> It's conceivable that the latter case is responsible for the
> exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora.
> ===
>
> The simplest way to fix it without having to alter the locking is to
> make split_huge_page() a noop in nr_ptes terms, so by counting the
> preallocated pagetables that exists for every mapped hugepage. It was
> an arbitrary choice not to count them and either way is not wrong or
> right, because they are not used but they're still allocated.
>
> Reported-by: Dave Jones <davej@...hat.com>
> Reported-by: Hugh Dickins <hughd@...gle.com>
> Signed-off-by: Andrea Arcangeli <aarcange@...hat.com>
> ---
> mm/huge_memory.c | 6 +++---
> 1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 91d3efb..8f7fc39 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -671,6 +671,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> set_pmd_at(mm, haddr, pmd, entry);
> prepare_pmd_huge_pte(pgtable, mm);
> add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> + mm->nr_ptes++;
> spin_unlock(&mm->page_table_lock);
> }
>
> @@ -789,6 +790,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> pmd = pmd_mkold(pmd_wrprotect(pmd));
> set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> prepare_pmd_huge_pte(pgtable, dst_mm);
> + dst_mm->nr_ptes++;
>
> ret = 0;
> out_unlock:
> @@ -887,7 +889,6 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> }
> kfree(pages);
>
> - mm->nr_ptes++;
> smp_wmb(); /* make pte visible before pmd */
> pmd_populate(mm, pmd, pgtable);
> page_remove_rmap(page);
> @@ -1047,6 +1048,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> VM_BUG_ON(page_mapcount(page) < 0);
> add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> VM_BUG_ON(!PageHead(page));
> + tlb->mm->nr_ptes--;
> spin_unlock(&tlb->mm->page_table_lock);
> tlb_remove_page(tlb, page);
> pte_free(tlb->mm, pgtable);
> @@ -1375,7 +1377,6 @@ static int __split_huge_page_map(struct page *page,
> pte_unmap(pte);
> }
>
> - mm->nr_ptes++;
> smp_wmb(); /* make pte visible before pmd */
> /*
> * Up to this point the pmd is present and huge and
> @@ -1988,7 +1989,6 @@ static void collapse_huge_page(struct mm_struct *mm,
> set_pmd_at(mm, address, pmd, _pmd);
> update_mmu_cache(vma, address, _pmd);
> prepare_pmd_huge_pte(pgtable, mm);
> - mm->nr_ptes--;
> spin_unlock(&mm->page_table_lock);
>
> #ifndef CONFIG_NUMA
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists