lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191018073310.GB5017@dhcp22.suse.cz>
Date:   Fri, 18 Oct 2019 09:33:10 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Naoya Horiguchi <n-horiguchi@...jp.nec.com>
Cc:     Qian Cai <cai@....pw>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        David Hildenbrand <david@...hat.com>,
        Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: memory offline infinite loop after soft offline

On Fri 18-10-19 06:32:22, Naoya Horiguchi wrote:
> On Fri, Oct 18, 2019 at 08:06:35AM +0200, Michal Hocko wrote:
> > On Fri 18-10-19 02:19:06, Naoya Horiguchi wrote:
> > > On Thu, Oct 17, 2019 at 08:27:59PM +0200, Michal Hocko wrote:
> > > > On Thu 17-10-19 14:07:13, Qian Cai wrote:
> > > > > On Thu, 2019-10-17 at 12:01 +0200, Michal Hocko wrote:
> > > > > > On Thu 17-10-19 09:34:10, Naoya Horiguchi wrote:
> > > > > > > On Mon, Oct 14, 2019 at 10:39:14AM +0200, Michal Hocko wrote:
> > > > > > 
> > > > > > [...]
> > > > > > > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > > > > > > > index 89c19c0feadb..5fb3fee16fde 100644
> > > > > > > > --- a/mm/page_isolation.c
> > > > > > > > +++ b/mm/page_isolation.c
> > > > > > > > @@ -274,7 +274,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
> > > > > > > >  			 * simple way to verify that as VM_BUG_ON(), though.
> > > > > > > >  			 */
> > > > > > > >  			pfn += 1 << page_order(page);
> > > > > > > > -		else if (skip_hwpoisoned_pages && PageHWPoison(page))
> > > > > > > > +		else if (skip_hwpoisoned_pages && PageHWPoison(compound_head(page)))
> > > > > > > >  			/* A HWPoisoned page cannot be also PageBuddy */
> > > > > > > >  			pfn++;
> > > > > > > >  		else
> > > > > > > 
> > > > > > > This fix looks good to me. The original code only addresses hwpoisoned 4kB-page,
> > > > > > > we seem to have this issue since the following commit,
> > > > > > 
> > > > > > Thanks a lot for double checking Naoya!
> > > > > >  
> > > > > > >   commit b023f46813cde6e3b8a8c24f432ff9c1fd8e9a64
> > > > > > >   Author: Wen Congyang <wency@...fujitsu.com>
> > > > > > >   Date:   Tue Dec 11 16:00:45 2012 -0800
> > > > > > >   
> > > > > > >       memory-hotplug: skip HWPoisoned page when offlining pages
> > > > > > > 
> > > > > > > and extension of LTP coverage finally discovered this.
> > > > > > 
> > > > > > Qian, could you give the patch some testing?
> > > > > 
> > > > > Unfortunately, this does not solve the problem. It looks to me that in
> > > > > soft_offline_huge_page(), set_hwpoison_free_buddy_page() will only set
> > > > > PG_hwpoison for buddy pages, so the even the compound_head() has no PG_hwpoison
> > > > > set.
> > > > > 
> > > > > 		if (PageBuddy(page_head) && page_order(page_head) >= order) {
> > > > > 			if (!TestSetPageHWPoison(page))
> > > > > 				hwpoisoned = true;
> > > > 
> > > > This is more than unexpected. How are we supposed to find out that the
> > > > page is poisoned? Any idea Naoya?
> > > 
> > > # sorry for my poor review...
> > > 
> > > We set PG_hwpoison bit only on the head page for hugetlb, that's because
> > > we handle multiple pages as a single one for hugetlb. So it's enough
> > > to check isolation only on the head page.  Simply skipping pfn cursor to
> > > the page after the hugepage should avoid the infinite loop:
> > 
> > But the page dump Qian provided shows that the head page doesn't have
> > HWPoison bit either. If it had then going pfn at a time should just work
> > because all tail pages would be skipped. Or do I miss something?
> 
> You're right, then I don't see how this happens.

OK, this is a bit relieving. I thought that there are legitimate cases
when none of the hugetlb gets the HWPoison bit (e.g. when the page has 0
reference count which is the case here). That would be utterly broken
because we would have no way to tell the page is hwpoisoned.

Anyway, do you think the patch as I've posted makes sense regardless
another potential problem? Or would you like to resend yours which skips
over tail pages at once?
-- 
Michal Hocko
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ