lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4yLocY3c0oi022b=2Eq2CoxttQMn-XeWuSrjbKPmt5QPA@mail.gmail.com>
Date: Mon, 29 Dec 2025 21:20:12 +1300
From: Barry Song <21cnbao@...il.com>
To: Vernon Yang <vernon2gm@...il.com>
Cc: akpm@...ux-foundation.org, david@...nel.org, lorenzo.stoakes@...cle.com, 
	ziy@...dia.com, dev.jain@....com, lance.yang@...ux.dev, 
	richard.weiyang@...il.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Vernon Yang <yanglincheng@...inos.cn>
Subject: Re: [PATCH v2 3/4] mm: khugepaged: set VM_NOHUGEPAGE flag when MADV_COLD/MADV_FREE

On Mon, Dec 29, 2025 at 6:52 PM Vernon Yang <vernon2gm@...il.com> wrote:
>
> For example, create three task: hot1 -> cold -> hot2. After all three
> task are created, each allocate memory 128MB. the hot1/hot2 task
> continuously access 128 MB memory, while the cold task only accesses
> its memory briefly andthen call madvise(MADV_COLD). However, khugepaged
> still prioritizes scanning the cold task and only scans the hot2 task
> after completing the scan of the cold task.
>
> So if the user has explicitly informed us via MADV_COLD/FREE that this
> memory is cold or will be freed, it is appropriate for khugepaged to
> skip it only, thereby avoiding unnecessary scan and collapse operations
> to reducing CPU wastage.
>
> Here are the performance test results:
> (Throughput bigger is better, other smaller is better)
>
> Testing on x86_64 machine:
>
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.14 sec     |  2.93 sec     | -6.69%  |
> | cycles per access   |  4.96         |  2.21         | -55.44% |
> | Throughput          |  104.38 M/sec |  111.89 M/sec | +7.19%  |
> | dTLB-load-misses    |  284814532    |  69597236     | -75.56% |
>
> Testing on qemu-system-x86_64 -enable-kvm:
>
> | task hot2           | without patch | with patch    |  delta  |
> |---------------------|---------------|---------------|---------|
> | total accesses time |  3.35 sec     |  2.96 sec     | -11.64% |
> | cycles per access   |  7.29         |  2.07         | -71.60% |
> | Throughput          |  97.67 M/sec  |  110.77 M/sec | +13.41% |
> | dTLB-load-misses    |  241600871    |  3216108      | -98.67% |
>
> Signed-off-by: Vernon Yang <yanglincheng@...inos.cn>
> ---
>  mm/madvise.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index b617b1be0f53..3a48d725a3fc 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1360,11 +1360,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>                 return madvise_remove(madv_behavior);
>         case MADV_WILLNEED:
>                 return madvise_willneed(madv_behavior);
> -       case MADV_COLD:
> -               return madvise_cold(madv_behavior);
>         case MADV_PAGEOUT:
>                 return madvise_pageout(madv_behavior);
> -       case MADV_FREE:
>         case MADV_DONTNEED:
>         case MADV_DONTNEED_LOCKED:
>                 return madvise_dontneed_free(madv_behavior);
> @@ -1378,6 +1375,18 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>
>         /* The below behaviours update VMAs via madvise_update_vma(). */
>
> +       case MADV_COLD:
> +               error = madvise_cold(madv_behavior);
> +               if (error)
> +                       goto out;
> +               new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
> +               break;
> +       case MADV_FREE:
> +               error = madvise_dontneed_free(madv_behavior);
> +               if (error)
> +                       goto out;
> +               new_flags = (new_flags & ~VM_HUGEPAGE) | VM_NOHUGEPAGE;
> +               break;

I am not convinced this is the right patch for MADV_FREE. Userspace
heaps may call MADV_FREE on free(), which does not mean they no longer
want huge pages; it only indicates that the old contents are no longer
needed. New allocations may still occur in the same region.

The same concern applies to MADV_COLD. MADV_COLD may only indicate
that the VMA is cold at the moment and for the near future, but it
can become hot again. For example, MADV_COLD may be issued when an
app moves to the background, but the memory can become hot again
once the app returns to the foreground.

In short, MADV_FREE and MADV_COLD only indicate that the memory is cold
or may be freed for a period of time; they are not permanent states.
Changing the VMA flags implies that the VMA is permanently free or
cold, which is not true in either case.

Your patch also prevents potential per-VMA lock optimizations.

However, if the intent is to treat folios hinted by MADV_FREE or
MADV_COLD as candidates not to be collapsed, I agree that this makes sense.

For MADV_FREE, could we simply skip the lazy-free folios instead?
For MADV_COLD, I am not sure how we can determine which folios
have actually been madvised as cold.

Thanks
Barry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ