linux-kernel - Re: [PATCH] mm, thp: relax migration wait when failed to get tail page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LSU.2.11.2106020831590.6388@eggly.anvils>
Date:   Wed, 2 Jun 2021 08:57:25 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Yu Xu <xuyu@...ux.alibaba.com>
cc:     Hugh Dickins <hughd@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
        gavin.dg@...ux.alibaba.com, Greg Thelen <gthelen@...gle.com>,
        Wei Xu <weixugc@...gle.com>,
        Matthew Wilcox <willy@...radead.org>,
        Nicholas Piggin <npiggin@...il.com>,
        Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH] mm, thp: relax migration wait when failed to get tail
 page

On Wed, 2 Jun 2021, Yu Xu wrote:
> On 6/2/21 12:55 AM, Hugh Dickins wrote:
> > On Wed, 2 Jun 2021, Xu Yu wrote:
> > 
> > > We notice that hung task happens in a conner but practical scenario when
> > > CONFIG_PREEMPT_NONE is enabled, as follows.
> > > 
> > > Process 0                       Process 1                     Process
> > > 2..Inf
> > > split_huge_page_to_list
> > >      unmap_page
> > >          split_huge_pmd_address
> > >                                  __migration_entry_wait(head)
> > >                                                                __migration_entry_wait(tail)
> > >      remap_page (roll back)
> > >          remove_migration_ptes
> > >              rmap_walk_anon
> > >                  cond_resched
> > > 
> > > Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
> > > copy_to_user, which will immediately fault again without rescheduling,
> > > and thus occupy the cpu fully.
> > > 
> > > When there are too many processes performing __migration_entry_wait on
> > > tail page, remap_page will never be done after cond_resched.
> > > 
> > > This relaxes __migration_entry_wait on tail page, thus gives remap_page
> > > a chance to complete.
> > > 
> > > Signed-off-by: Gang Deng <gavin.dg@...ux.alibaba.com>
> > > Signed-off-by: Xu Yu <xuyu@...ux.alibaba.com>
> > 
> > Well caught: you're absolutely right that there's a bug there.
> > But isn't cond_resched() just papering over the real bug, and
> > what it should do is a "page = compound_head(page);" before the
> > get_page_unless_zero()? How does that work out in your testing?
> 
> compound_head works. The patched kernel is alive for hours under
> our reproducer, which usually makes the vanilla kernel hung after
> tens of minutes at most.

Oh, that's good news, thanks.

(It's still likely that a well-placed cond_resched() somewhere in
mm/gup.c would also be a good idea, but none of us have yet got
around to identifying where.)

> 
> If we use compound_head, the behavior of __migration_entry_wait(tail)
> changes from "retry fault" to "prevent THP from being split". Is that
> right?  Then which is preferred? If it were me, I would prefer "retry
> fault".

As Matthew remarked, you are asking very good questions, and split
migration entries are difficult to think about.  But I believe you'll
find it works out okay.

The point of *put_and_* wait_on_page_locked() is that it does drop
the page reference you acquired with get_page_unless_zero, as soon
as the page is on the wait queue, before actually waiting.

So splitting the THP is only prevented for a brief interval.  Now,
it's true that if there are very many tasks faulting on portions
of the huge page, in that interval between inserting the migration
entries and freezing the huge page's refcount to 0, they can reduce
the chance of splitting considerably.  But that's not an excuse for 
for doing get_page_unless_zero() on the wrong thing, as it was doing.

Hugh