linux-kernel - Re: [PATCH 01/13] mm/munlock: delete page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ed2c4952-f3e-688d-40c1-53afebe5c7cb@google.com>
Date:   Wed, 9 Feb 2022 14:28:08 -0800 (PST)
From:   Hugh Dickins <hughd@...gle.com>
To:     Vlastimil Babka <vbabka@...e.cz>
cc:     Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...e.com>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Matthew Wilcox <willy@...radead.org>,
        David Hildenbrand <david@...hat.com>,
        Alistair Popple <apopple@...dia.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Rik van Riel <riel@...riel.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Yu Zhao <yuzhao@...gle.com>, Greg Thelen <gthelen@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH 01/13] mm/munlock: delete page_mlock() and all its
 works

On Wed, 9 Feb 2022, Vlastimil Babka wrote:
> On 2/6/22 22:30, Hugh Dickins wrote:
> 
> While I understand the reasons to clear the ground first, wonder what are
> the implications for bisectability - is there a risk of surprising failures?
> Maybe we should at least explicitly spell out the implications here?
> IIUC, pages that once become mlocked, will stay mlocked, implicating the
> Mlocked meminfo counter and inability to reclaim them. But if e.g. a process
> that did mlockall() exits, its exclusive pages will be freed anyway, so it's
> not a catastrophic kind of leak, right?

Thanks for taking a look, Vlastimil.  You make a good point here.

I had satisfied myself that no stage of the series was going to introduce
boot failures or BUGs; and if someone is bisecting some mlock/munlock
misbehaviour, I would not worry about which commit of the series they
alight on, but root cause it keeping all the patches in mind.

But we certainly wouldn't want the series split up into separately
submitted parts (that is, split anywhere between 01/13 and 07/13:
splitting the rest apart wouldn't matter much); and it would be
unfortunate if someone were bisecting some reclaim failure OOM problem
elsewhere, and their test app happened to be using mlock, and their
bisection landed part way between 01 and 07 here - the unimplemented
munlock confusing the bisection for OOM.

The worst of it would be, I think, landing between 05 and 07: where
your mlockall could mlock a variety of shared libraries etc, moving
all their pages to unevictable, and those pagecache pages not getting
moved back to evictable when unmapped.  I forget the current shrinker
situation, whether inode pressure could evict that pagecache or not.

Two mitigations come to mind, let me think on it some more (and hope
nobody's bisecting OOMs meanwhile): one might be to shift 05 (the one
which replaces clear_page_inode() on last unmap by clearance when
freeing) later in the series - its position was always somewhat
arbitrary, but that position is where it's been tested; another might
be to put nothing at all on the unevictable list in between 01 and 07.

Though taking this apart and putting it back together brings its own
dangers.  That second suggestion probably won't fly very well, with
06/13 using mlock_count only while on the unevictable.  I'm not sure
how much rethinking the bisection possibility deserves.

> Yet it differs from the existing "failure modes" where pages would be left
> as "stranded" due to failure of being isolated, because they would at least
> go through TestClearPageMlocked and counters update.
> 
> >  
> >  /*
> > @@ -413,75 +136,11 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
> >   *
> >   * Returns with VM_LOCKED cleared.  Callers must be prepared to
> >   * deal with this.
> > - *
> > - * We don't save and restore VM_LOCKED here because pages are
> > - * still on lru.  In unmap path, pages might be scanned by reclaim
> > - * and re-mlocked by page_mlock/try_to_unmap before we unmap and
> > - * free them.  This will result in freeing mlocked pages.
> >   */
> >  void munlock_vma_pages_range(struct vm_area_struct *vma,
> >  			     unsigned long start, unsigned long end)
> >  {
> > -	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> 
> Should we at least keep doing the flags clearing? I haven't check if there
> are some VM_BUG_ONs that would trip on not cleared, but wouldn't be entirely
> surprised.

There are two flags in question here, VM_LOCKED and VM_LOCKONFAULT:
I'm not sure which of them you're particularly concerned about.

As to VM_LOCKED: yes, you're right, at this stage of the series the
munlock really ought to be clearing VM_LOCKED (even while it doesn't
go on to do anything about the pages), as it claims in the comment above.
I removed this line at a later stage (07/13), when changing it to
mlock_vma_pages_range() serving both mlock and munlock according to
whether VM_LOCKED is provided - and mistakenly folded back that deletion
to this patch.  End result the same, but better to restore that maskout
in this patch, as you suggest.

As to VM_LOCKONFAULT: I had checked the rest of mm/mlock.c, and the
rest of the tree, and it only ever reached here along with VM_LOCKED;
so when in 07/13 I switched over to "vma->vm_flags = newflags" (or
WRITE_ONCE equivalent), I just didn't see the need to mask it out in
the munlocking case; but could add a VM_BUG_ON that newflags never
has it without VM_LOCKED, if you like. 

(You'll say VM_WARN_ON_ONCE, I'll say VM_BUG_ON because it never happens,
then as soon as I put it in and run LTP or kselftests, I'll be ashamed
to discover I've got it wrong, perhaps.)

Hugh