[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b73961a2-87ec-45a5-b6fb-83d3505a0f39@redhat.com>
Date: Tue, 27 Aug 2024 13:46:26 +0200
From: David Hildenbrand <david@...hat.com>
To: Johannes Weiner <hannes@...xchg.org>, Usama Arif <usamaarif642@...il.com>
Cc: Nico Pache <npache@...hat.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>, Barry Song <baohua@...nel.org>,
Ryan Roberts <ryan.roberts@....com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, Lance Yang
<ioworker0@...il.com>, Peter Xu <peterx@...hat.com>,
Rafael Aquini <aquini@...hat.com>, Andrea Arcangeli <aarcange@...hat.com>,
Jonathan Corbet <corbet@....net>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Zi Yan <ziy@...dia.com>
Subject: Re: [RFC 0/2] mm: introduce THP deferred setting
On 27.08.24 13:09, Johannes Weiner wrote:
> On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote:
>>
>>
>> On 26/08/2024 17:14, Nico Pache wrote:
>>> On Mon, Aug 26, 2024 at 10:47 AM Usama Arif <usamaarif642@...il.com> wrote:
>>>>
>>>>
>>>>
>>>> On 26/08/2024 11:40, Nico Pache wrote:
>>>>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache <npache@...hat.com> wrote:
>>>>>>
>>>>>> Hi Zi Yan,
>>>>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan <ziy@...dia.com> wrote:
>>>>>>>
>>>>>>> +Kirill
>>>>>>>
>>>>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote:
>>>>>>>
>>>>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a
>>>>>>>> significant increase in the memory footprint for the same workloads.
>>>>>>>>
>>>>>>>> Through our investigations we found that a large contributing factor to
>>>>>>>> the increase in RSS was an increase in THP usage.
>>>>>>>
>>>>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage?
>>>>>> IIRC, most of the systems tuning is the same. We attributed the
>>>>>> increase in THP usage to a combination of improvements in the kernel,
>>>>>> and improvements in the libraries (better alignments). That allowed
>>>>>> THP allocations to succeed at a higher rate. I can go back and confirm
>>>>>> this tomorrow though.
>>>>>>>
>>>>>>>>
>>>>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is
>>>>>>>> often recommended to set /transparent_hugepages/enabled=never. This is
>>>>>>>> in part due to performance degradations and increased memory waste.
>>>>>>>>
>>>>>>>> This series introduces enabled=defer, this setting acts as a middle
>>>>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
>>>>>>>> page fault handler will act normally, making a hugepage if possible. If
>>>>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will
>>>>>>>> default to the base size allocation. The caveat is that khugepaged can
>>>>>>>> still operate on pages thats not MADV_HUGEPAGE.
>>>>>>>
>>>>>>> Why? If user does not explicitly want huge page, why bother providing huge
>>>>>>> pages? Wouldn't it increase memory footprint?
>>>>>>
>>>>>> So we have "always", which will always try to allocate a THP when it
>>>>>> can. This setting gives good performance in a lot of conditions, but
>>>>>> tends to waste memory. Additionally applications DON'T need to be
>>>>>> modified to take advantage of THPs.
>>>>>>
>>>>>> We have "madvise" which will only satisfy allocations that are
>>>>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times
>>>>>> these madvises come from libraries. Unlike "always" you DO need to
>>>>>> modify your application if you want to use THPs.
>>>>>>
>>>>>> Then we have "never", which of course, never allocates THPs.
>>>>>>
>>>>>> Ok. back to your question, like "madvise", "defer" gives you the
>>>>>> benefits of THPs when you specifically know you want them
>>>>>> (madv_hugepage), but also benefits applications that dont specifically
>>>>>> ask for them (or cant be modified to ask for them), like "always"
>>>>>> does. The applications that dont ask for THPs must wait for khugepaged
>>>>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory
>>>>>> waste, and gives an increased tunability over "always". Another added
>>>>>> benefit is that khugepaged will most likely not operate on short lived
>>>>>> allocations, meaning that only longstanding memory will be collapsed
>>>>>> to THPs.
>>>>>>
>>>>>> The memory waste can be tuned with max_ptes_none... lets say you want
>>>>>> ~90% of your PMD to be full before collapsing into a huge page. simply
>>>>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the
>>>>>> 512 pages to be present before being collapsed.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> This allows for two things... one, applications specifically designed to
>>>>>>>> use hugepages will get them, and two, applications that don't use
>>>>>>>> hugepages can still benefit from them without aggressively inserting
>>>>>>>> THPs at every possible chance. This curbs the memory waste, and defers
>>>>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory
>>>>>>>> for eligible collapsing.
>>>>>>>
>>>>>>> khugepaged would replace application memory with huge pages without specific
>>>>>>> goal. Why not use a user space agent with process_madvise() to collapse
>>>>>>> huge pages? Admin might have more knobs to tweak than khugepaged.
>>>>>>
>>>>>> The benefits of "always" are that no userspace agent is needed, and
>>>>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to
>>>>>> benefit from THPs. This setting hopes to gain some of the same
>>>>>> benefits without the significant waste of memory and an increased
>>>>>> tunability.
>>>>>>
>>>>>> future changes I have in the works are to make khugepaged more
>>>>>> "smart". Moving it away from the round robin fashion it currently
>>>>>> operates in, to instead make smart and informed decisions of what
>>>>>> memory to collapse (and potentially split).
>>>>>>
>>>>>> Hopefully that helped explain the motivation for this new setting!
>>>>>
>>>>> Any last comments before I resend this?
>>>>>
>>>>> Ive been made aware of
>>>>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u
>>>>> which introduces THP splitting. These are both trying to achieve the
>>>>> same thing through different means. Our approach leverages khugepaged
>>>>> to promote pages, while Usama's uses the reclaim path to demote
>>>>> hugepages and shrink the underlying memory.
>>>>>
>>>>> I will leave it up to reviewers to determine which is better; However,
>>>>> we can't have both, as we'd be introducing trashing conditions.
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> Just inserting this here from my cover letter:
>>>>
>>>> Waiting for khugepaged to scan memory and
>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>> Obviously not part of my patchset here, but I have been testing some
>>> changes to khugepaged to make it more aware of what processes are hot.
>>> Ideally then it can make better choices of what to operate on.
>>>> (i.e. you dont know when the collapse will happen), while production
>>>> environments require predictable performance. If there is enough memory
>>>> available, its better for both performance and predictability to have
>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>> running out of memory.
>>>>
>>>> I just went through your patches, and am not sure why we can't have both?
>>> Fair point, we can. I've been playing around with splitting hugepages
>>> and via khugepaged and was thinking of the trashing conditions there--
>>> but your implementation takes a different approach.
>>> I've been working on performance testing my "defer" changes, once I
>>> find the appropriate workloads I'll try adding your changes to the
>>> mix. I have a feeling my approach is better for latency sensitive
>>> workloads, while yours is better for throughput, but let me find a way
>>> to confirm that.
>>>
>>>
>> Hmm, I am not sure if its latency vs throughput.
>>
>> There are 2 things we probably want to consider, short lived and long lived mappings, and
>> in each of these situations, having enough memory and running out of memory.
>>
>> For short lived mappings, I believe reducing page faults is a bigger factor in
>> improving performance. In that case, khugepaged won't have enough time to work,
>> so THP=always will perform better than THP=defer. THP=defer in this case will perform
>> the same as THP=madvise?
>> If there is enough memory, then the changes I introduced in the shrinker won't cost anything
>> as the shrinker won't run, and the system performance will be the same as THP=always.
>> If there is low memory and the shrinker runs, it will only split THPs that have zero-filled
>> pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory.
>> There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits
>> underused THPs.
>>
>> For long lived mappings, reduced TLB misses would be the bigger factor in improving performance.
>> For the initial run of the application THP=always will perform better wrt TLB misses as
>> page fault handler will give THPs from start.
>> Later on in the run, the memory might look similar between THP=always with shrinker and
>> max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR?
>> This is because khugepaged will have collapsed pages that might have initially been faulted in.
>> And collapsing has a cost, which would not have been incurred if the THPs were present from fault.
>> If there is low memory, then shrinker would split memory (which has a cost as well) and the system
>> memory would look similar or better than THP=defer, as the shrinker would split THPs that initially
>> might not have been underused, but are underused at time of memory pressure.
>>
>> With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed.
>> While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time,
>> even if the system might have plenty of memory available and there is no need to take a performance hit.
>
> I agree with this. The defer mode is an improvement over the upstream
> status quo, no doubt. However, both defer mode and the shrinker solve
> the issue of memory waste under pressure, while the shrinker permits
> more desirable behavior when memory is abundant.
>
> So my take is that the shrinker is the way to go, and I don't see a
> bonafide usecase for defer mode that the shrinker couldn't cover.
Page fault latency? IOW, zeroing a complete THP, which might be up to
512 MiB on arm64. This is one of the things people bring up, where
FreeBSD is different because it will zero fragments on-demand (but also
result in more pagefaults).
On the downside, in the past (before) we could easily and repeatedly
fail to collapse THPs in busy environments. With per-VMA locks this
might have improved in the meantime.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists