[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LNX.2.00.1310151029040.12481@eggly.anvils>
Date: Tue, 15 Oct 2013 10:53:10 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
cc: Andrea Arcangeli <aarcange@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Rientjes <rientjes@...gle.com>,
Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: mm: fix BUG in __split_huge_page_pmd
On Tue, 15 Oct 2013, Kirill A. Shutemov wrote:
> Andrea Arcangeli wrote:
> > Hi Hugh,
> >
> > On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote:
> > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
> > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
> > >
> > > It's invalid: we don't always have down_write of mmap_sem there:
> > > a racing do_huge_pmd_wp_page() might have copied-on-write to another
> > > huge page before our split_huge_page() got the anon_vma lock.
> > >
> >
> > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you
> > elaborate?
>
> I think the scenario is follow:
>
> CPU0: CPU1
>
> __split_huge_page_pmd()
> page = pmd_page(*pmd);
> do_huge_pmd_wp_page() copy the
> page and changes pmd (the same as on CPU0)
> to point to newly copied page.
> split_huge_page(page)
> where page is original page,
> not allocated on COW.
> pmd still points on huge page.
>
>
> Hugh, have I got it correctly?
Yes, that's correct, that's what I've been assuming the race is.
With CPU0 split_huge_page_pmd() being called from zap_pmd_range()
in the service of madvise(,,MADV_DONTNEED).
I don't have the stacktrace to hand: could perfectly well dig it out
in an hour or two, but honestly, it adds nothing more to the picture.
I have no trace of the CPU1 side of things, and have merely surmised
that it was doing a COW.
As to whether the MADV_DONTNEED down_read optimization is important:
I don't recall, might be able to discover justification in old mail,
0a27a14a6292 doesn't actually say; but in general we're much better
off using down_read than down_write where it's safe to do so.
But more importantly, MADV_DONTNEED down_read across zap_page_range
is building on the fact that file invalidation/truncation can already
call zap_page_range without touching mmap_sem at all: not a problem
for traditional anon-only THP, but something you'll have had to worry
about for THPageCache.
I'm afraid Andrea's mail about concurrent madvises gives me far more
to think about than I have time for: seems to get into problems he
knows a lot about but I'm unfamiliar with. If this patch looks good
for now on its own, let's put it in; but no problem if you guys prefer
to wait for a fuller solution of more problems, we can ride with this
one internally for the moment.
And I should admit that the crash has occurred too rarely for us yet
to be able to judge whether this patch actually fixes it in practice.
Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists