linux-kernel - Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170213105906.GA16419@node.shutemov.name>
Date:   Mon, 13 Feb 2017 13:59:06 +0300
From:   "Kirill A. Shutemov" <kirill@...temov.name>
To:     Zi Yan <zi.yan@...rutgers.edu>,
        Andrea Arcangeli <aarcange@...hat.com>
Cc:     Minchan Kim <minchan@...nel.org>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, kirill.shutemov@...ux.intel.com,
        akpm@...ux-foundation.org, vbabka@...e.cz,
        mgorman@...hsingularity.net, n-horiguchi@...jp.nec.com,
        khandual@...ux.vnet.ibm.com, Zi Yan <ziy@...dia.com>
Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in
 zap_pmd_range()

On Sun, Feb 12, 2017 at 06:25:09PM -0600, Zi Yan wrote:
> Hi Kirill,
> 
> >>>> The crash scenario I guess is like:
> >>>> 1. A huge page pmd entry is in the middle of being changed into either a
> >>>> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
> >>>>
> >>>> 2. At the same time, the application frees the vma this page belongs to.
> >>>
> >>> Em... no.
> >>>
> >>> This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
> >>> And we only be able to remove vma under down_write(mmap_sem), so the
> >>> scenario should be excluded.
> >>>
> >>> What do I miss?
> >>
> >> You are right. This problem will not happen in the upstream kernel.
> >>
> >> The problem comes from my customized kernel, where I migrate pages away
> >> instead of reclaiming them when memory is under pressure. I did not take
> >> any mmap_sem when I migrate pages. So I got this error.
> >>
> >> It is a false alarm. Sorry about that. Thanks for clarifying the problem.
> >
> > I think there's still a race between MADV_DONTNEED and
> > change_huge_pmd(.prot_numa=1) resulting in skipping THP by
> > zap_pmd_range(). It need to be addressed.
> >
> > And MADV_FREE requires a fix.
> >
> > So, minus one non-bug, plus two bugs.
> >
> 
> You said a huge page pmd entry needs to be changed under down_read(mmap_sem).
> It is only true for huge pages, right?

mmap_sem is a way to make sure that the VMA will not go away under you.
Besides mmap_sem, anon_vma_lock/i_mmap_lock can be used for this.

> Since in mm/compaction.c, the kernel does not down_read(mmap_sem) during memory
> compaction. Namely, base page migrations do not hold down_read(mmap_sem),
> so in zap_pte_range(), the kernel needs to hold PTE page table locks.
> Am I right about this?
> 
> If yes. IMHO, ultimately, when we need to compact 2MB pages to form 1GB pages,
> in zap_pmd_range(), pmd locks have to be taken to make that kind of compactions
> possible.
> 
> Do you agree?

I *think* we can get away with speculative (without ptl) check in
zap_pmd_range() if we make page fault the only place that can turn
pmd_none() into something else.

It means all other sides that change pmd must not clear it intermittently
during pmd change, unless run under down_write(mmap_sem).

I found two such problematic places in kernel:

 - change_huge_pmd(.prot_numa=1);

 - madvise_free_huge_pmd();

Both clear pmd before setting up a modified version. Both under
down_read(mmap_sem).

The migration path also would need to establish migration pmd atomically
to make this work.

Once all these cases will be fixed, zap_pmd_range() would only be able to
race with page fault if it called from MADV_DONTNEED.
This case is not a problem.

Andrea, does it sound reasonable to you?

-- 
 Kirill A. Shutemov