[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ad2c1355-be13-45d9-8474-a32a4b710aa5@kernel.org>
Date: Tue, 23 Dec 2025 10:59:29 +0100
From: "David Hildenbrand (Red Hat)" <david@...nel.org>
To: Vernon Yang <vernon2gm@...il.com>
Cc: Wei Yang <richard.weiyang@...il.com>, akpm@...ux-foundation.org,
lorenzo.stoakes@...cle.com, ziy@...dia.com, baohua@...nel.org,
lance.yang@...ux.dev, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Vernon Yang <yanglincheng@...inos.cn>
Subject: Re: [PATCH 3/4] mm: khugepaged: move mm to list tail when
MADV_COLD/MADV_FREE
On 12/21/25 13:34, Vernon Yang wrote:
> On Sun, Dec 21, 2025 at 10:24:11AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/21/25 05:25, Vernon Yang wrote:
>>> On Sun, Dec 21, 2025 at 02:10:44AM +0000, Wei Yang wrote:
>>>> On Fri, Dec 19, 2025 at 09:58:17AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/19/25 06:29, Vernon Yang wrote:
>>>>>> On Thu, Dec 18, 2025 at 10:31:58AM +0100, David Hildenbrand (Red Hat) wrote:
>>>>>>> On 12/15/25 10:04, Vernon Yang wrote:
>>>>>>>> For example, create three task: hot1 -> cold -> hot2. After all three
>>>>>>>> task are created, each allocate memory 128MB. the hot1/hot2 task
>>>>>>>> continuously access 128 MB memory, while the cold task only accesses
>>>>>>>> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
>>>>>>>> still prioritizes scanning the cold task and only scans the hot2 task
>>>>>>>> after completing the scan of the cold task.
>>>>>>>>
>>>>>>>> So if the user has explicitly informed us via MADV_COLD/FREE that this
>>>>>>>> memory is cold or will be freed, it is appropriate for khugepaged to
>>>>>>>> scan it only at the latest possible moment, thereby avoiding unnecessary
>>>>>>>> scan and collapse operations to reducing CPU wastage.
>>>>>>>>
>>>>>>>> Here are the performance test results:
>>>>>>>> (Throughput bigger is better, other smaller is better)
>>>>>>>>
>>>>>>>> Testing on x86_64 machine:
>>>>>>>>
>>>>>>>> | task hot2 | without patch | with patch | delta |
>>>>>>>> |---------------------|---------------|---------------|---------|
>>>>>>>> | total accesses time | 3.14 sec | 2.92 sec | -7.01% |
>>>>>>>> | cycles per access | 4.91 | 2.07 | -57.84% |
>>>>>>>> | Throughput | 104.38 M/sec | 112.12 M/sec | +7.42% |
>>>>>>>> | dTLB-load-misses | 288966432 | 1292908 | -99.55% |
>>>>>>>>
>>>>>>>> Testing on qemu-system-x86_64 -enable-kvm:
>>>>>>>>
>>>>>>>> | task hot2 | without patch | with patch | delta |
>>>>>>>> |---------------------|---------------|---------------|---------|
>>>>>>>> | total accesses time | 3.35 sec | 2.96 sec | -11.64% |
>>>>>>>> | cycles per access | 7.23 | 2.12 | -70.68% |
>>>>>>>> | Throughput | 97.88 M/sec | 110.76 M/sec | +13.16% |
>>>>>>>> | dTLB-load-misses | 237406497 | 3189194 | -98.66% |
>>>>>>>
>>>>>>> Again, I also don't like that because you make assumptions on a full process
>>>>>>> based on some part of it's address space.
>>>>>>>
>>>>>>> E.g., if a library issues a MADV_COLD on some part of the memory the library
>>>>>>> manages, why should the remaining part of the process suffer as well?
>>>>>>
>>>>>> Yes, you make a good point, thanks!
>>>>>>
>>>>>>> This seems to be an heuristic focused on some specific workloads, no?
>>>>>>
>>>>>> Right.
>>>>>>
>>>>>> Could we use the VM_NOHUGEPAGE flag to indicate that this region should
>>>>>> not be collapsed, so that khugepaged can simply skip this VMA during
>>>>>> scanning? This way, it won't affect the remaining part of the task's
>>>>>> memory regions.
>>>>>
>>>>> I thought we would skip these regions already properly in khugeapged, or
>>>>> maybe I misunderstood your question.
>>>>>
>>>>
>>>> I think we should, but seems we didn't do this for anonymous memory during
>>>> khugepaged.
>>>>
>>>> We check the vma with thp_vma_allowable_order() during scan.
>>>>
>>>> * For anonymous memory during khugepaged, if we always enable 2M collapse,
>>>> we will scan this vma. Even VM_NOHUGEPAGE is set.
>>>>
>>>> * For other cases, it looks good since __thp_vma_allowable_order() will skip
>>>> this vma with vma_thp_disabled().
>>>
>>> Hi David, Wei,
>>>
>>> The khugepaged has already checked the VM_NOHUGEPAGE flag for anonymous
>>> memory during scan, as below:
>>>
>>> khugepaged_scan_mm_slot()
>>> thp_vma_allowable_order()
>>> thp_vma_allowable_orders()
>>> __thp_vma_allowable_orders()
>>> vma_thp_disabled() {
>>> if (vm_flags & VM_NOHUGEPAGE)
>>> return true;
>>> }
>>>
>>> REAL ISSUE: when madvise(MADV_COLD),not set VM_NOHUGEPAGE flag to vma,
>>> so the khugepaged will continue scan this vma.
>>>
>>> I set VM_NOHUGEPAGE flag to vma when madvise(MADV_COLD), the test has
>>> been successful. I will send it in the next version.
>>
>> No we must not do that. That's a user-space visible change. :/
>
> David, what good ideas do you have to achieve this goal? let me know
> please, thank!
Your idea would be to skip a VMA when we issues madvise(MADV_COLD).
That sounds like yet another heuristic that can easily be wrong? :/
In particular, imagine if the VMA is much larger than the madvise'd
region (other parts used for something else) or if the previously cold
memory area is used for something that is now hot.
With memory allocators that manage most of the memory in a single large
VMA, it's rather easy to see how such a heuristic would be bad, no?
--
Cheers
David
Powered by blists - more mailing lists