lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191021031641.GA8007@hori.linux.bs1.fc.nec.co.jp>
Date:   Mon, 21 Oct 2019 03:16:41 +0000
From:   Naoya Horiguchi <n-horiguchi@...jp.nec.com>
To:     Qian Cai <cai@....pw>
CC:     Michal Hocko <mhocko@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        David Hildenbrand <david@...hat.com>,
        Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: memory offline infinite loop after soft offline

On Fri, Oct 18, 2019 at 07:56:09AM -0400, Qian Cai wrote:
> 
> 
>     On Oct 18, 2019, at 2:35 AM, Naoya Horiguchi <n-horiguchi@...jp.nec.com>
>     wrote:
> 
> 
>     You're right, then I don't see how this happens. If the error hugepage was
>     isolated without having PG_hwpoison set, it's unexpected and problematic.
>     I'm testing myself with v5.4-rc2 (simply ran move_pages12 and did hotremove
>     /hotadd)
>     but don't reproduce the issue yet.  Do we need specific kernel version/
>     config
>     to trigger this?
> 
> 
> This is reproducible on linux-next with the config. Not sure if it is
> reproducible on x86.
> 
> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
> 
> and kernel cmdline if that matters
> 
> page_poison=on page_owner=on numa_balancing=enable \
> systemd.unified_cgroup_hierarchy=1 debug_guardpage_minorder=1 \
> page_alloc.shuffle=1

Thanks for the info.

> 
> BTW, where does the code set PG_hwpoison for the head page?

Precisely speaking, soft offline only sets PG_hwpoison after the target
hugepage is successfully dissolved (then it's not a hugepage any more),
so PG_hwpoison is set on the raw page in set_hwpoison_free_buddy_page().

In move_pages12 case, madvise(MADV_SOFT_OFFLINE) is called for the range
of 2 hugepages, so the expected result is that page offset 0 and 512
are marked as PG_hwpoison after injection.

Looking at your dump_page() output, the end_pfn is page offset 1
("page:c00c000800458040" is likely to point to pfn 0x11601.)
The page belongs to high order buddy free page, but doesn't have
PageBuddy nor PageHWPoison because it was not the head page or
the raw error page.

> Unfortunately, this does not solve the problem. It looks to me that in            
> soft_offline_huge_page(), set_hwpoison_free_buddy_page() will only set            
> PG_hwpoison for buddy pages, so the even the compound_head() has no PG_hwpoison   
> set.                                                                              

Your analysis is totally correct, and this behavior will be fixed by
the change (https://lkml.org/lkml/2019/10/17/551) in Oscar's rework.
The raw error page will be taken off from buddy system and the other
subpages are properly split into lower orderer pages (we'll properly
manage PageBuddy flags). So all possible cases would be covered by
branches in __test_page_isolated_in_pageblock.

Thanks,
Naoya Horiguchi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ