lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <084eee6b-6c9e-454b-a563-b2babb76b099@kernel.org>
Date: Tue, 30 Dec 2025 20:54:33 +0100
From: "David Hildenbrand (Red Hat)" <david@...nel.org>
To: Vernon Yang <vernon2gm@...il.com>, akpm@...ux-foundation.org,
 lorenzo.stoakes@...cle.com
Cc: ziy@...dia.com, dev.jain@....com, baohua@...nel.org,
 lance.yang@...ux.dev, richard.weiyang@...il.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Vernon Yang <yanglincheng@...inos.cn>
Subject: Re: [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when
 MADV_COLD/MADV_FREE

On 12/29/25 06:51, Vernon Yang wrote:
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
> 
> So if the user has explicitly informed us via MADV_COLD/FREE that this
> memory is cold or will be freed, it is appropriate for khugepaged to
> skip it only, thereby avoiding unnecessary scan and collapse operations
> to reducing CPU wastage.
> 
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
> 
> Testing on x86_64 machine:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> | cycles per access   |  4.96         |  2.21         | -55.44% |
> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
> 
> Testing on qemu-system-x86_64 -enable-kvm:
> 
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.29         |  2.07         | -71.60% |
> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
> 
> Signed-off-by: Vernon Yang <yanglincheng@...inos.cn>
> ---

As raised in v1, this is not the way to go. Just because something was 
once indicated to be cold does not meant that it will stay like that 
forever.

Also,

(1) You are turning this into an operation that will perform VMA
     modifications and require the mmap lock in write mode, bad.

(2) You might now create many VMAs, possibly breaking user space, bad.

If user space knows that memory will stay cold, it can use madvise() to 
indicate that these regions are not a good fit for THPs.

But are they really not a good fit? What about smaller-order THPs?

Nobody knows, but changing the behavior like you suggest is definetly 
bad. :)

-- 
Cheers

David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ