linux-kernel - Re: [PATCH] mm: avoid blocking lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200128060304.GA6615@bombadil.infradead.org>
Date:   Mon, 27 Jan 2020 22:03:04 -0800
From:   Matthew Wilcox <willy@...radead.org>
To:     Yang Shi <shy828301@...il.com>
Cc:     Michal Hocko <mhocko@...nel.org>,
        Cong Wang <xiyou.wangcong@...il.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-mm <linux-mm@...ck.org>, Mel Gorman <mgorman@...e.de>,
        Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH] mm: avoid blocking lock_page() in kcompactd

On Mon, Jan 27, 2020 at 05:25:13PM -0800, Yang Shi wrote:
> On Mon, Jan 27, 2020 at 11:06 AM Matthew Wilcox <willy@...radead.org> wrote:
> >
> > On Mon, Jan 27, 2020 at 04:00:24PM +0100, Michal Hocko wrote:
> > > On Sun 26-01-20 15:39:35, Matthew Wilcox wrote:
> > > > On Sun, Jan 26, 2020 at 11:53:55AM -0800, Cong Wang wrote:
> > > > > I suspect the process gets stuck in the retry loop in try_charge(), as
> > > > > the _shortest_ stacktrace of the perf samples indicated:
> > > > >
> > > > > cycles:ppp:
> > > > >         ffffffffa72963db mem_cgroup_iter
> > > > >         ffffffffa72980ca mem_cgroup_oom_unlock
> > > > >         ffffffffa7298c15 try_charge
> > > > >         ffffffffa729a886 mem_cgroup_try_charge
> > > > >         ffffffffa720ec03 __add_to_page_cache_locked
> > > > >         ffffffffa720ee3a add_to_page_cache_lru
> > > > >         ffffffffa7312ddb iomap_readpages_actor
> > > > >         ffffffffa73133f7 iomap_apply
> > > > >         ffffffffa73135da iomap_readpages
> > > > >         ffffffffa722062e read_pages
> > > > >         ffffffffa7220b3f __do_page_cache_readahead
> > > > >         ffffffffa7210554 filemap_fault
> > > > >         ffffffffc039e41f __xfs_filemap_fault
> > > > >         ffffffffa724f5e7 __do_fault
> > > > >         ffffffffa724c5f2 __handle_mm_fault
> > > > >         ffffffffa724cbc6 handle_mm_fault
> > > > >         ffffffffa70a313e __do_page_fault
> > > > >         ffffffffa7a00dfe page_fault
> > > > >
> > > > > But I don't see how it could be, the only possible case is when
> > > > > mem_cgroup_oom() returns OOM_SUCCESS. However I can't
> > > > > find any clue in dmesg pointing to OOM. These processes in the
> > > > > same memcg are either running or sleeping (that is not exiting or
> > > > > coredump'ing), I don't see how and why they could be selected as
> > > > > a victim of OOM killer. I don't see any signal pending either from
> > > > > their /proc/X/status.
> > > >
> > > > I think this is a situation where we might end up with a genuine deadlock
> > > > if we're not trylocking the pages.  readahead allocates a batch of
> > > > locked pages and adds them to the pagecache.  If it has allocated,
> > > > say, 5 pages, successfully inserted the first three into i_pages, then
> > > > needs to allocate memory to insert the fourth one into i_pages, and
> > > > the process then attempts to migrate the pages which are still locked,
> > > > they will never come unlocked because they haven't yet been submitted
> > > > to the filesystem for reading.
> > >
> > > Just to make sure I understand. Do you mean this?
> > > lock_page(A)
> > > alloc_pages
> > >   try_to_compact_pages
> > >     compact_zone_order
> > >       compact_zone(MIGRATE_SYNC_LIGHT)
> > >         migrate_pages
> > >         unmap_and_move
> > >           __unmap_and_move
> > >             lock_page(A)
> >
> > Yes.  There's a little more to it than that, eg slab is involved, but
> > you have it in a nutshell.
> 
> But, how compact could get blocked for readahead page if it is not on LRU?
> 
> The page is charged before adding to LRU, so if kernel just retry
> charge or reclaim forever, the page should be not on LRU, so it should
> not block compaction.

The five pages are allocated ABCDE, then they are added one at a time to
both the LRU list and i_pages.  Once ABCDE have been added to the page
cache, they are submitted to the filesystem in a batch.  The sceanrio
here is that once ABC have been added, we need to allocate memory to
add D.  The page migration code then attempts to migrate A.  Deadlock;
it will never come unlocked.

This kind of problem can occur in any filesystem in either readpages or
readpage.  Once the page is locked and on the LRU list, if the filesystem
attempts to allocate memory (and many do), this kind of deadlock can
occur.  It's avoided if the filesystem uses a mempool to avoid memory
allocation in this path, but they certainly don't all do that.

This specific deadlock can be avoided if we skip !PageUptodate pages.
But I don't know what other situations there are where we allocate memory
while holding a page locked that is on the LRU list.