linux-kernel - Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <57DC1CE0.5070400@linux.intel.com>
Date:   Fri, 16 Sep 2016 09:25:04 -0700
From:   Dave Hansen <dave.hansen@...ux.intel.com>
To:     Rui Teng <rui.teng@...ux.vnet.ibm.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        Michal Hocko <mhocko@...e.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        "Aneesh Kumar K . V" <aneesh.kumar@...ux.vnet.ibm.com>,
        Paul Gortmaker <paul.gortmaker@...driver.com>,
        Santhosh G <santhog4@...ibm.com>
Subject: Re: [PATCH] memory-hotplug: Fix bad area access on
 dissolve_free_huge_pages()

On 09/16/2016 06:58 AM, Rui Teng wrote:
> On 9/15/16 12:37 AM, Dave Hansen wrote:
>> On 09/14/2016 09:33 AM, Rui Teng wrote:
>> But, as far as describing the initial problem, can you explain how the
>> tail pages still ended up being PageHuge()?  Seems like dissolving the
>> huge page should have cleared that.
>>
> I use the scripts of tools/testing/selftests/memory-hotplug/mem-on-
> off-test.sh to test and reproduce this bug. And I printed the pfn range
> on dissolve_free_huge_pages(). The sizes of the pfn range are always
> 4096, and the ranges are separated.
> [   72.362427] start_pfn: 204800, end_pfn: 208896
> [   72.371677] start_pfn: 2162688, end_pfn: 2166784
> [   72.373945] start_pfn: 217088, end_pfn: 221184
> [   72.383218] start_pfn: 2170880, end_pfn: 2174976
> [   72.385918] start_pfn: 2306048, end_pfn: 2310144
> [   72.388254] start_pfn: 2326528, end_pfn: 2330624
> 
> Sometimes, it will report a failure:
> [   72.371690] memory offlining [mem 0x2100000000-0x210fffffff] failed
> 
> And sometimes, it will report following:
> [   72.373956] Offlined Pages 4096
> 
> Whether the start_pfn and end_pfn of dissolve_free_huge_pages could be
> *random*? If so, the range may not include any page head and start from
> tail page, right?

That's an interesting data point, but it still doesn't quite explain
what is going on.

It seems like there might be parts of gigantic pages that have
PageHuge() set on tail pages, while other parts don't.  If that's true,
we have another bug and your patch just papers over the issue.

I think you really need to find the root cause before we apply this patch.