linux-kernel - Re: [PATCH v3] hugetlbfs: skip PMD unsharing when shareable lock unavailable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aOPEGkWdbeY2htsH@localhost.localdomain>
Date: Mon, 6 Oct 2025 15:28:58 +0200
From: Oscar Salvador <osalvador@...e.de>
To: Deepanshu Kartikey <kartikey406@...il.com>
Cc: muchun.song@...ux.dev, david@...hat.com, akpm@...ux-foundation.org,
	lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, vbabka@...e.cz,
	rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
	broonie@...nel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	syzbot+f26d7c75c26ec19790e7@...kaller.appspotmail.com
Subject: Re: [PATCH v3] hugetlbfs: skip PMD unsharing when shareable lock
 unavailable

On Fri, Oct 03, 2025 at 11:15:53PM +0530, Deepanshu Kartikey wrote:
> When hugetlb_vmdelete_list() cannot acquire the shareable lock for a VMA,
> the previous fix (dd83609b8898) skipped the entire VMA to avoid lock
> assertions in huge_pmd_unshare(). However, this prevented pages from being
> unmapped and freed, causing a regression in fallocate(PUNCH_HOLE) operations
> where pages were not freed immediately, as reported by Mark Brown.
> 
> The issue occurs because:
> 1. hugetlb_vmdelete_list() calls hugetlb_vma_trylock_write()
> 2. For shareable VMAs, this attempts to acquire the shareable lock
> 3. If successful, huge_pmd_unshare() expects the lock to be held
> 4. huge_pmd_unshare() asserts the lock via hugetlb_vma_assert_locked()
> 
> The v2 fix avoided calling code that requires locks, but this prevented
> page unmapping entirely, breaking the expected behavior where pages are
> freed during punch hole operations.
> 
> This v3 fix takes a different approach: instead of skipping the entire VMA,
> we skip only the PMD unsharing operation when we don't have the required
> lock, while still proceeding with page unmapping. This is safe because:
> 
> - PMD unsharing is an optimization to reduce shared page table overhead
> - Page unmapping can proceed safely with just the VMA write lock
> - Pages get freed immediately as expected by PUNCH_HOLE operations
> - The PMD metadata will be cleaned up when the VMA is destroyed
> 
> We introduce a new ZAP_FLAG_NO_UNSHARE flag that communicates to
> __unmap_hugepage_range() that it should skip huge_pmd_unshare() while
> still clearing page table entries and freeing pages.
> 
> Reported-by: syzbot+f26d7c75c26ec19790e7@...kaller.appspotmail.com
> Reported-by: Mark Brown <broonie@...nel.org>
> Fixes: dd83609b8898 ("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list")
> Tested-by: syzbot+f26d7c75c26ec19790e7@...kaller.appspotmail.com
> Signed-off-by: Deepanshu Kartikey <kartikey406@...il.com>
> 
> ---
...
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 9c94ed8c3ab0..519497bc1045 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -474,29 +474,31 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>  	vma_interval_tree_foreach(vma, root, start, end ? end - 1 : ULONG_MAX) {
>  		unsigned long v_start;
>  		unsigned long v_end;
> +		bool have_shareable_lock;
> +		zap_flags_t local_flags = zap_flags;
>  
>  		if (!hugetlb_vma_trylock_write(vma))
>  			continue;
> -
> +
> +		have_shareable_lock = __vma_shareable_lock(vma);
> +
>  		/*
> -		 * Skip VMAs without shareable locks. Per the design in commit
> -		 * 40549ba8f8e0, these will be handled by remove_inode_hugepages()
> -		 * called after this function with proper locking.
> +		 * If we can't get the shareable lock, set ZAP_FLAG_NO_UNSHARE
> +		 * to skip PMD unsharing. We still proceed with unmapping to
> +		 * ensure pages are properly freed, which is critical for punch
> +		 * hole operations that expect immediate page freeing.
>  		 */
> -		if (!__vma_shareable_lock(vma))
> -			goto skip;
> -
> +		if (!have_shareable_lock)
> +			local_flags |= ZAP_FLAG_NO_UNSHARE;

This is quite a head-spinning thing.

First of all, as David pointed out, that comment is misleading as it looks like
__vma_shareable_lock() performs a taking action which is not true, so that should
reworded.

Now, the thing is:

- Prior to commit dd83609b8898("hugetlbfs: skip VMAs without shareable
  locks in hugetlb_vmdelete_list"), we were unconditionally calling
  huge_pmd_unshare(), which asserted the vma lock and we didn't hold it.
  My question would be that Mike's vma-lock addition happened in 2022,
  how's that we didn't see this sooner? It should be rather easy to
  trigger? I'm a bit puzzled.

- Ok, since there's nothing to unshare, we skip the vma here and
  remove_inode_hugepages() should take care of it.
  But that seems to be troublesome because on punch-hole operation pages
  don't get freed.

- So instead, we just skip the unsharing operation and keep carrying
  with the unmapping/freeing in __unmap_hugepage_range.

I don't know but to me it seems that we're going to large extends to fix
an assertion.
So, the thing is, can't we check __vma_shareable_lock in
__unmap_hugepage_range() and only call huge_pmd_unshare() if we need to?

 

-- 
Oscar Salvador
SUSE Labs