[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170213105906.GA16419@node.shutemov.name>
Date: Mon, 13 Feb 2017 13:59:06 +0300
From: "Kirill A. Shutemov" <kirill@...temov.name>
To: Zi Yan <zi.yan@...rutgers.edu>,
Andrea Arcangeli <aarcange@...hat.com>
Cc: Minchan Kim <minchan@...nel.org>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, kirill.shutemov@...ux.intel.com,
akpm@...ux-foundation.org, vbabka@...e.cz,
mgorman@...hsingularity.net, n-horiguchi@...jp.nec.com,
khandual@...ux.vnet.ibm.com, Zi Yan <ziy@...dia.com>
Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in
zap_pmd_range()
On Sun, Feb 12, 2017 at 06:25:09PM -0600, Zi Yan wrote:
> Hi Kirill,
>
> >>>> The crash scenario I guess is like:
> >>>> 1. A huge page pmd entry is in the middle of being changed into either a
> >>>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> >>>>
> >>>> 2. At the same time, the application frees the vma this page belongs to.
> >>>
> >>> Em... no.
> >>>
> >>> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> >>> And we only be able to remove vma under down_write(mmap_sem), so the
> >>> scenario should be excluded.
> >>>
> >>> What do I miss?
> >>
> >> You are right. This problem will not happen in the upstream kernel.
> >>
> >> The problem comes from my customized kernel, where I migrate pages away
> >> instead of reclaiming them when memory is under pressure. I did not take
> >> any mmap_sem when I migrate pages. So I got this error.
> >>
> >> It is a false alarm. Sorry about that. Thanks for clarifying the problem.
> >
> > I think there's still a race between MADV_DONTNEED and
> > change_huge_pmd(.prot_numa=1) resulting in skipping THP by
> > zap_pmd_range(). It need to be addressed.
> >
> > And MADV_FREE requires a fix.
> >
> > So, minus one non-bug, plus two bugs.
> >
>
> You said a huge page pmd entry needs to be changed under down_read(mmap_sem).
> It is only true for huge pages, right?
mmap_sem is a way to make sure that the VMA will not go away under you.
Besides mmap_sem, anon_vma_lock/i_mmap_lock can be used for this.
> Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
> compaction. Namely, base page migrations do not hold down_read(mmap_sem),
> so in zap_pte_range(), the kernel needs to hold PTE page table locks.
> Am I right about this?
>
> If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
> in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
> possible.
>
> Do you agree?
I *think* we can get away with speculative (without ptl) check in
zap_pmd_range() if we make page fault the only place that can turn
pmd_none() into something else.
It means all other sides that change pmd must not clear it intermittently
during pmd change, unless run under down_write(mmap_sem).
I found two such problematic places in kernel:
- change_huge_pmd(.prot_numa=1);
- madvise_free_huge_pmd();
Both clear pmd before setting up a modified version. Both under
down_read(mmap_sem).
The migration path also would need to establish migration pmd atomically
to make this work.
Once all these cases will be fixed, zap_pmd_range() would only be able to
race with page fault if it called from MADV_DONTNEED.
This case is not a problem.
Andrea, does it sound reasonable to you?
--
Kirill A. Shutemov
Powered by blists - more mailing lists