linux-kernel - Re: mlockall(MCL_CURRENT) blocking infinitely

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191106120315.GF16085@quack2.suse.cz>
Date:   Wed, 6 Nov 2019 13:03:15 +0100
From:   Jan Kara <jack@...e.cz>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Vlastimil Babka <vbabka@...e.cz>, snazy@...zy.de,
        Michal Hocko <mhocko@...nel.org>,
        Josef Bacik <josef@...icpanda.com>, Jan Kara <jack@...e.cz>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Randy Dunlap <rdunlap@...radead.org>,
        linux-kernel@...r.kernel.org, Linux MM <linux-mm@...ck.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Potyra, Stefan" <Stefan.Potyra@...ktrobit.com>
Subject: Re: mlockall(MCL_CURRENT) blocking infinitely

On Tue 05-11-19 13:22:11, Johannes Weiner wrote:
> On Tue, Nov 05, 2019 at 04:28:21PM +0100, Vlastimil Babka wrote:
> > On 11/5/19 2:23 PM, Robert Stupp wrote:
> > > "git bisect" led to a result.
> > > 
> > > The offending merge commit is f91f2ee54a21404fbc633550e99d69d14c2478f2
> > > "Merge branch 'akpm' (rest of patches from Andrew)".
> > > 
> > > The first bad commit in the merged series of commits is
> > > https://github.com/torvalds/linux/commit/6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11
> > > . a75d4c33377277b6034dd1e2663bce444f952c14, the commit before 6b4c9f44,
> > > is good.
> > 
> > Ah, great you could bisect this. CCing people from the commit
> > 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
> 
> Judging from Robert's stack captures, the task is not hung but
> busy-looping in __mm_populate(). AFAICS, the only way this can occur
> is if populate_vma_page_range() returns 0 and we don't advance the
> iteration position (if it returned an error, we wouldn't reset nend
> and move on to the next vma as ignore_errors is 1 for mlockall.)
> 
> populate_vma_page_range() returns 0 when the first page is not found
> and faultin_page() returns -EBUSY (if it were processing pages, or if
> the error from faultin_page() would be a different one, we would
> return the number of pages processed or -error).
> 
> faultin_page() returns -EBUSY when VM_FAULT_RETRY is set, i.e. we
> dropped the mmap_sem in order to initiate IO and require a retry. That
> is consistent with the bisect result (new VM_FAULT_RETRY conditions).
> 
> At this point, regular page fault would retry with FAULT_FLAG_TRIED to
> indicate that the mmap_sem cannot be dropped a second time. But this
> mlock path doesn't set that flag and we can loop repeatedly. That is
> something we probably need to fix with a FOLL_TRIED somewhere.

It seems we could use __get_user_pages_locked() for that in
populate_vma_page_range() if we were guaranteed that mm stays alive.  This
is guaranteed for current->mm cases but there seem to be some callers to
populate_vma_page_range() where mm could indeed go away once we drop
mmap_sem. These luckily pass NULL for the 'nonblocking' parameter though so
all call sites seem to be fine but it would be fragile...

> What I don't quite understand yet is why the fault path doesn't make
> progress eventually. We must drop the mmap_sem without changing the
> state in any way. How can we keep looping on the same page?

That may be a slight suboptimality with Josef's patches. If the page
is marked as PageReadahead, we always drop mmap_sem if we can and start
readahead without checking whether that makes sense or not in
do_async_mmap_readahead(). OTOH page_cache_async_readahead() then clears
PageReadahead so the only way how I can see we could loop like this is when
file->ra->ra_pages is 0. Not sure if that's what's happening through. We'd
need to find which of the paths in filemap_fault() calls
maybe_unlock_mmap_for_io() to tell more.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR