linux-kernel - Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is already mapped

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aa647830-cf55-48f0-98c2-8230796e35b3@arm.com>
Date: Fri, 20 Dec 2024 11:57:20 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: Dev Jain <dev.jain@....com>, John Hubbard <jhubbard@...dia.com>,
 akpm@...ux-foundation.org, david@...hat.com, willy@...radead.org,
 kirill.shutemov@...ux.intel.com
Cc: anshuman.khandual@....com, catalin.marinas@....com, cl@...two.org,
 vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
 dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
 jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
 hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
 peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
 ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
 vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
 21cnbao@...il.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 10/12] khugepaged: Skip PTE range if a larger mTHP is
 already mapped

On 19/12/2024 08:07, Dev Jain wrote:
> 
> On 19/12/24 1:29 pm, Dev Jain wrote:
>>
>> On 19/12/24 9:10 am, John Hubbard wrote:
>>> On 12/18/24 1:34 AM, Dev Jain wrote:
>>>> On 18/12/24 1:06 pm, Ryan Roberts wrote:
>>>>> On 16/12/2024 16:51, Dev Jain wrote:
>>>>>> We may hit a situation wherein we have a larger folio mapped. It is incorrect
>>>>>> to go ahead with the collapse since some pages will be unmapped, leading to
>>>>>> the entire folio getting unmapped. Therefore, skip the corresponding range.
>>> ...
>>>>> It would be good if you can spell out the desired policy when khugepaged hits
>>>>> partially unmapped large folios and unaligned large folios. I think the simple
>>>>> approach is to always collapse them to fully mapped, aligned folios even if
>>>>> the
>>>>> resulting order is smaller than the original. But I'm not sure that's
>>>>> definitely
>>>>> going to always be the best thing.
>>>>>
>>>>> Regardless, I'm struggling to understand the logic in this patch. Taking the
>>>>> order of a folio based on having hit one of it's pages says anything about
>>>>> whether the whole of that folio is mapped or not or it's alignment. And
>>>>> it's not
>>>>> clear to me how we would get to a situation where we are scanning for a lower
>>>>> order and find a (fully mapped, aligned) folio of higher order in the first
>>>>> place.
>>>>>
>>>>> Let's assume the desired policy is that khugepaged should always collapse to
>>>>> naturally aligned large folios. If there happens to be an existing aligned
>>>>> order-4 folio that is fully mapped, we will identify that for collapse as part
>>>>> of the scan for order-4. At that point, we should just notice that it is
>>>>> already
>>>>> an aligned order-4 folio and bypass collapse. Of course we may have already
>>>>> chosen to collapse it into a higher order, but we should definitely not get
>>>>> to a
>>>>> lower order before we notice it.
>>>>>
>>>>> Hmm... I guess if the sysfs thp settings have been changed then things
>>>>> could get
>>>>> spicy... if order-8 was previously enabled and we have an order-8 folio,
>>>>> then it
>>>>> get's disabled and khugepaged is scanning for order-4 (which is still enabled)
>>>>> then hits the order-8; what's the expected policy? rework into 2 order-4
>>>>> folios
>>>>> or leave it as as single order-8?
>>>>
>>>> Exactly, sorry, I should have made it clear in the patch description that I am
>>>> handling the following scenario: there is a long running system on which we are
>>>> using order-8 folios, and now we decide to downgrade to order-4. Will it be a
>>>> good idea to take the pain of splitting order-8 to 16 order-4 folios? This
>>>> should
>>>> be a rare situation in the first place, so I have currently decided to
>>>> ignore the
>>>> folios set up by the previous sysfs setting and only focus on collapsing
>>>> fresh memory.
>>>>
>>>> Thinking again, a sys-admin deciding to downgrade order of folios, should do
>>>> that in
>>>> the hopes of reducing internal fragmentation or increasing swap speed etc,
>>>> so it makes
>>>> sense to shatter large folios....maybe we can have a sysfs tunable for this?
>>>
>>> Maybe we should not support it (at runtime) at all. We are trying to build
>>> systems that don't require incredibly detailed sysadmin involvement, and
>>> this level of tweaking qualifies, thoroughly, as "incredibly detailed
>>> sysadmin micromanagement", imho.

Agreed that we definitely don't want any new controls for this.

>>
>> Ryan pointed out one thing: what about unaligned, or partially mapped large
>> folios? For the previous sysfs settings, it may happen that we have an unaligned
>> order-8 folio, let us say it got unaligned due to mremap(). Then it is a good
>> idea to start from the order-4 aligned page and start collapsing memory so
>> that we can take advantage of the contig bit. Otherwise if it is a fully-mapped
>> aligned order-8 folio, then we anyways are abusing the contig bit advantage
>> so collapsing is pointless.
> 
> 
> In fact, in the current code, we are collapsing an unaligned PMD-size folio to an
> aligned PMD-mapped folio; we will not see a block mapping in the PMD, and go
> ahead with the scan...so the logic should be, skip the scan if the VAs and PAs are
> aligned.

Here is my stab at what the policy should be. It sounds quite complex, but
actually I think the implementation is quite simple.

memory may initially be mapped partially or fully from large or small folios.
And the mapping may be naturally aligned for the folio order or not.

khugepaged's goal should always to end up with the largest, aligned orders that
are possible. It will only collapse to an enabled order, but it will never
collapse to a smaller or equal order than the source memory's contiguous mapping
_and_ aligment.

So if for example, the first (or last) 64K of a 128K folio is mapped on a 64K VA
boundary, if scanning for order-4 (64K) we would leave that partial mapping
alone. But if either less than 64K was mapped from the 128K folio, or it was
mapped in a way that it was not 64K aligned in VA, then it would be a candidate
for collapse to a new 64K folio.

I think this can all be implemented by remembering if all of the PFNs for the
VA-aligned order that is being scanned are present and congituous and if they
all belong to the same folio. If that is true, and if the first PFN is aligned
on a PA boundary for that order, then skip the collapse and move to the next
block (addr + (PAGE_SIZE << order)).

I think this will give us the properties we want; if we encounter a larger folio
that is fully mapped and aligned, we will leave it alone. If we encounter a mix
of small folios and partially mapped large folios we will collapse. If we
encounter a partial mapping of a larger folio but that partial mapping fills the
block we are scanning in an appropriately aligned manner, we will leave it alone.

Thanks,
Ryan

> 
>>>
>>> Apologies for not having gone through the series in detail yet, but this
>>> point jumped out at me.
>>>
>>> thanks,
>>