linux-kernel - Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <00d429c9-6ade-42c9-a1f3-a7519375324f@nvidia.com>
Date: Wed, 18 Dec 2024 19:40:29 -0800
From: John Hubbard <jhubbard@...dia.com>
To: Dev Jain <dev.jain@....com>, Ryan Roberts <ryan.roberts@....com>,
	<akpm@...ux-foundation.org>, <david@...hat.com>, <willy@...radead.org>,
	<kirill.shutemov@...ux.intel.com>
CC: <anshuman.khandual@....com>, <catalin.marinas@....com>, <cl@...two.org>,
	<vbabka@...e.cz>, <mhocko@...e.com>, <apopple@...dia.com>,
	<dave.hansen@...ux.intel.com>, <will@...nel.org>, <baohua@...nel.org>,
	<jack@...e.cz>, <srivatsa@...il.mit.edu>, <haowenchao22@...il.com>,
	<hughd@...gle.com>, <aneesh.kumar@...nel.org>, <yang@...amperecomputing.com>,
	<peterx@...hat.com>, <ioworker0@...il.com>, <wangkefeng.wang@...wei.com>,
	<ziy@...dia.com>, <jglisse@...gle.com>, <surenb@...gle.com>,
	<vishal.moola@...il.com>, <zokeefe@...gle.com>, <zhengqi.arch@...edance.com>,
	<21cnbao@...il.com>, <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is
 already mapped

On 12/18/24 1:34 AM, Dev Jain wrote:
> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>> On 16/12/2024 16:51, Dev Jain wrote:
>>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>>> to go ahead with the collapse since some pages will be unmapped, leading to
>>> the entire folio getting unmapped. Therefore, skip the corresponding range.
...
>> It would be good if you can spell out the desired policy when khugepaged hits
>> partially unmapped large folios and unaligned large folios. I think the simple
>> approach is to always collapse them to fully mapped, aligned folios even if the
>> resulting order is smaller than the original. But I'm not sure that's definitely
>> going to always be the best thing.
>>
>> Regardless, I'm struggling to understand the logic in this patch. Taking the
>> order of a folio based on having hit one of it's pages says anything about
>> whether the whole of that folio is mapped or not or it's alignment. And it's not
>> clear to me how we would get to a situation where we are scanning for a lower
>> order and find a (fully mapped, aligned) folio of higher order in the first place.
>>
>> Let's assume the desired policy is that khugepaged should always collapse to
>> naturally aligned large folios. If there happens to be an existing aligned
>> order-4 folio that is fully mapped, we will identify that for collapse as part
>> of the scan for order-4. At that point, we should just notice that it is already
>> an aligned order-4 folio and bypass collapse. Of course we may have already
>> chosen to collapse it into a higher order, but we should definitely not get to a
>> lower order before we notice it.
>>
>> Hmm... I guess if the sysfs thp settings have been changed then things could get
>> spicy... if order-8 was previously enabled and we have an order-8 folio, then it
>> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
>> then hits the order-8; what's the expected policy? rework into 2 order-4 folios
>> or leave it as as single order-8?
> 
> Exactly, sorry, I should have made it clear in the patch description that I am
> handling the following scenario: there is a long running system on which we are
> using order-8 folios, and now we decide to downgrade to order-4. Will it be a
> good idea to take the pain of splitting order-8 to 16 order-4 folios? This should
> be a rare situation in the first place, so I have currently decided to ignore the
> folios set up by the previous sysfs setting and only focus on collapsing fresh memory.
> 
> Thinking again, a sys-admin deciding to downgrade order of folios, should do that in
> the hopes of reducing internal fragmentation or increasing swap speed etc, so it makes
> sense to shatter large folios....maybe we can have a sysfs tunable for this?

Maybe we should not support it (at runtime) at all. We are trying to build
systems that don't require incredibly detailed sysadmin involvement, and
this level of tweaking qualifies, thoroughly, as "incredibly detailed
sysadmin micromanagement", imho.

Apologies for not having gone through the series in detail yet, but this
point jumped out at me.

thanks,
-- 
John Hubbard