lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aca74036-f37f-4247-b3b8-112059f53659@lucifer.local>
Date: Thu, 31 Jul 2025 15:38:44 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, david@...hat.com,
        linux-mm@...ck.org, linux-fsdevel@...r.kernel.org, corbet@....net,
        rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
        hannes@...xchg.org, baohua@...nel.org, shakeel.butt@...ux.dev,
        riel@...riel.com, ziy@...dia.com, laoar.shao@...il.com,
        dev.jain@....com, baolin.wang@...ux.alibaba.com, npache@...hat.com,
        Liam.Howlett@...cle.com, ryan.roberts@....com, vbabka@...e.cz,
        jannh@...gle.com, Arnd Bergmann <arnd@...db.de>, sj@...nel.org,
        linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
        kernel-team@...a.com
Subject: Re: [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise
 with PR_THP_DISABLE_EXCEPT_ADVISED

Nits on subject:

- It's >75 chars
- advise is the verb, advice is the noun.

On Thu, Jul 31, 2025 at 01:27:20PM +0100, Usama Arif wrote:
> From: David Hildenbrand <david@...hat.com>
>
> Let's allow for making MADV_COLLAPSE succeed on areas that neither have
> VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled
> unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED).

Hmm, I'm not sure about this.

So far this prctl() has been the only way to override MADV_COLLAPSE
behaviour, but now we're allowing for this one case to not.

I suppose the precedent is that MADV_COLLAPSE overrides 'madvise' sysfs
behaviour.

I suppose what saves us here is 'advised' can be read to mean either
MADV_HUGEPAGE or MADV_COLLAPSE.

And yes, MADV_COLLAPSE is clearly the user requesting this behaviour.

I think the vagueness here is one that already existed, because one could
perfectly one have expected MADV_COLLAPSE to obey sysfs and require
MADV_HUGEPAGE to have been applied, but of course this is not the case.

OK so fine.

BUT.

I think the MADV_COLLAPSE man page will need to be updated to mention this.

And I REALLY think we should update the THP doc too to mention all these
prctl() modes.

I'm not sure we cover that right now _at all_ and obviously we should
describe the new flags.

Usama - can you add a patch to this series to do that?

>
> MADV_COLLAPSE is a clear advise that we want to collapse.

advise -> advice.

>
> Note that we still respect the VM_NOHUGEPAGE flag, just like
> MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
> refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED.

You also need to mention the shmem change you've made I think.

>
> Co-developed-by: Usama Arif <usamaarif642@...il.com>
> Signed-off-by: Usama Arif <usamaarif642@...il.com>
> Signed-off-by: David Hildenbrand <david@...hat.com>
> ---
>  include/linux/huge_mm.h    | 8 +++++++-
>  include/uapi/linux/prctl.h | 2 +-
>  mm/huge_memory.c           | 5 +++--
>  mm/memory.c                | 6 ++++--
>  mm/shmem.c                 | 2 +-
>  5 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index b0ff54eee81c..aeaf93f8ac2e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -329,7 +329,7 @@ struct thpsize {
>   * through madvise or prctl.
>   */
>  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
> -		vm_flags_t vm_flags)
> +		vm_flags_t vm_flags, bool forced_collapse)
>  {
>  	/* Are THPs disabled for this VMA? */
>  	if (vm_flags & VM_NOHUGEPAGE)
> @@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
>  	 */
>  	if (vm_flags & VM_HUGEPAGE)
>  		return false;
> +	/*
> +	 * Forcing a collapse (e.g., madv_collapse), is a clear advise to

advise -> advice.

> +	 * use THPs.
> +	 */
> +	if (forced_collapse)
> +		return false;
>  	return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
>  }
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 9c1d6e49b8a9..ee4165738779 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -185,7 +185,7 @@ struct prctl_mm_map {
>  #define PR_SET_THP_DISABLE	41
>  /*
>   * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
> - * VM_HUGEPAGE).
> + * VM_HUGEPAGE / MADV_COLLAPSE).

This is confusing you're mixing VMA flags with MADV ones... maybe just
stick to madvise ones, or add extra context around VM_HUGEPAGE bit?

Would need to be fixed up in a prior commit obviously.

>   */
>  # define PR_THP_DISABLE_EXCEPT_ADVISED	(1 << 1)
>  #define PR_GET_THP_DISABLE	42
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 85252b468f80..ef5ccb0ec5d5 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  {
>  	const bool smaps = type == TVA_SMAPS;
>  	const bool in_pf = type == TVA_PAGEFAULT;
> -	const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
> +	const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
> +	const bool enforce_sysfs = !forced_collapse;

Can we just get rid of this enforce_sysfs altogether in patch 2/5 and use
forced_collapse?

The first place we use it we negate it:

		return orders & shmem_allowable_huge_orders(file_inode(vma->vm_file),
						   vma, vma->vm_pgoff, 0,
						   !enforce_sysfs);

And the one other place we'd have to negate, but it actually helps document
behaviour I think.


>  	unsigned long supported_orders;
>
>  	/* Check the intersection of requested and supported orders. */
> @@ -122,7 +123,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  	if (!vma->vm_mm)		/* vdso */
>  		return 0;
>
> -	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags))
> +	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vm_flags, forced_collapse))
>  		return 0;
>
>  	/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
> diff --git a/mm/memory.c b/mm/memory.c
> index be761753f240..bd04212d6f79 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5186,9 +5186,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
>  	 * It is too late to allocate a small folio, we already have a large
>  	 * folio in the pagecache: especially s390 KVM cannot tolerate any
>  	 * PMD mappings, but PTE-mapped THP are fine. So let's simply refuse any
> -	 * PMD mappings if THPs are disabled.
> +	 * PMD mappings if THPs are disabled. As we already have a THP ...
> +	 * behave as if we are forcing a collapse.
>  	 */
> -	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags))
> +	if (thp_disabled_by_hw() || vma_thp_disabled(vma, vma->vm_flags,
> +						     /* forced_collapse=*/ true))
>  		return ret;
>
>  	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e6cdfda08aed..30609197a266 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1816,7 +1816,7 @@ unsigned long shmem_allowable_huge_orders(struct inode *inode,
>  	vm_flags_t vm_flags = vma ? vma->vm_flags : 0;
>  	unsigned int global_orders;
>
> -	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags)))
> +	if (thp_disabled_by_hw() || (vma && vma_thp_disabled(vma, vm_flags, shmem_huge_force)))

Hm this is an extra bit of logic, as noted above, we definitely need to
mention we're changing this in the commit message also.

Since shmem_allowable_huge_orders() can only be invoked by
__thp_vma_allowable_orders() with shmem_huge_force set true, and it'd only
be the case should TVA_FORCED_COLLAPSE is set, this makes sense.


>  		return 0;
>
>  	global_orders = shmem_huge_global_enabled(inode, index, write_end,
> --
> 2.47.3
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ