lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7d72dd8-00c4-4731-99d1-9faaca4275ba@linux.alibaba.com>
Date: Thu, 17 Jul 2025 16:46:47 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: Hugh Dickins <hughd@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>
Cc: Baoquan He <bhe@...hat.com>, Barry Song <21cnbao@...il.com>,
 Chris Li <chrisl@...nel.org>, David Rientjes <rientjes@...gle.com>,
 Kairui Song <ryncsn@...il.com>, Kemeng Shi <shikemeng@...weicloud.com>,
 Shakeel Butt <shakeel.butt@...ux.dev>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org
Subject: Re: [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not
 mutex) much less



On 2025/7/16 16:05, Hugh Dickins wrote:
> A flamegraph (from an MGLRU load) showed shmem_writeout()'s use of the
> global shmem_swaplist_mutex worryingly hot: improvement is long overdue.
> 
> 3.1 commit 6922c0c7abd3 ("tmpfs: convert shmem_writepage and enable swap")
> apologized for extending shmem_swaplist_mutex across add_to_swap_cache(),
> and hoped to find another way: yes, there may be lots of work to allocate
> radix tree nodes in there.  Then 6.15 commit b487a2da3575 ("mm, swap:
> simplify folio swap allocation") will have made it worse, by moving
> shmem_writeout()'s swap allocation under that mutex too (but the worrying
> flamegraph was observed even before that change).
> 
> There's a useful comment about pagelock no longer protecting from eviction
> once moved to swap cache: but it's good till shmem_delete_from_page_cache()
> replaces page pointer by swap entry, so move the swaplist add between them.
> 
> We would much prefer to take the global lock once per inode than once per
> page: given the possible races with shmem_unuse() pruning when !swapped
> (and other tasks racing to swap other pages out or in), try the swaplist
> add whenever swapped was incremented from 0 (but inode may already be on
> the list - only unuse and evict bother to remove it).
> 
> This technique is more subtle than it looks (we're avoiding the very lock
> which would make it easy), but works: whereas an unlocked list_empty()
> check runs a risk of the inode being unqueued and left off the swaplist
> forever, swapoff only completing when the page is faulted in or removed.
> 
> The need for a sleepable mutex went away in 5.1 commit b56a2d8af914
> ("mm: rid swapoff of quadratic complexity"): a spinlock works better now.
> 
> This commit is certain to take shmem_swaplist_mutex out of contention,
> and has been seen to make a practical improvement (but there is likely
> to have been an underlying issue which made its contention so visible).
> 
> Signed-off-by: Hugh Dickins <hughd@...gle.com>

Makes sense to me. Thanks.
Reviewed-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@...ux.alibaba.com>

> ---
>   mm/shmem.c | 59 ++++++++++++++++++++++++++++++------------------------
>   1 file changed, 33 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 60247dc48505..33675361031b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -292,7 +292,7 @@ bool vma_is_shmem(struct vm_area_struct *vma)
>   }
>   
>   static LIST_HEAD(shmem_swaplist);
> -static DEFINE_MUTEX(shmem_swaplist_mutex);
> +static DEFINE_SPINLOCK(shmem_swaplist_lock);
>   
>   #ifdef CONFIG_TMPFS_QUOTA
>   
> @@ -432,10 +432,13 @@ static void shmem_free_inode(struct super_block *sb, size_t freed_ispace)
>    *
>    * But normally   info->alloced == inode->i_mapping->nrpages + info->swapped
>    * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)
> + *
> + * Return: true if swapped was incremented from 0, for shmem_writeout().
>    */
> -static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
> +static bool shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
>   {
>   	struct shmem_inode_info *info = SHMEM_I(inode);
> +	bool first_swapped = false;
>   	long freed;
>   
>   	spin_lock(&info->lock);
> @@ -450,8 +453,11 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
>   	 * to stop a racing shmem_recalc_inode() from thinking that a page has
>   	 * been freed.  Compensate here, to avoid the need for a followup call.
>   	 */
> -	if (swapped > 0)
> +	if (swapped > 0) {
> +		if (info->swapped == swapped)
> +			first_swapped = true;
>   		freed += swapped;
> +	}
>   	if (freed > 0)
>   		info->alloced -= freed;
>   	spin_unlock(&info->lock);
> @@ -459,6 +465,7 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
>   	/* The quota case may block */
>   	if (freed > 0)
>   		shmem_inode_unacct_blocks(inode, freed);
> +	return first_swapped;
>   }
>   
>   bool shmem_charge(struct inode *inode, long pages)
> @@ -1399,11 +1406,11 @@ static void shmem_evict_inode(struct inode *inode)
>   			/* Wait while shmem_unuse() is scanning this inode... */
>   			wait_var_event(&info->stop_eviction,
>   				       !atomic_read(&info->stop_eviction));
> -			mutex_lock(&shmem_swaplist_mutex);
> +			spin_lock(&shmem_swaplist_lock);
>   			/* ...but beware of the race if we peeked too early */
>   			if (!atomic_read(&info->stop_eviction))
>   				list_del_init(&info->swaplist);
> -			mutex_unlock(&shmem_swaplist_mutex);
> +			spin_unlock(&shmem_swaplist_lock);
>   		}
>   	}
>   
> @@ -1526,7 +1533,7 @@ int shmem_unuse(unsigned int type)
>   	if (list_empty(&shmem_swaplist))
>   		return 0;
>   
> -	mutex_lock(&shmem_swaplist_mutex);
> +	spin_lock(&shmem_swaplist_lock);
>   start_over:
>   	list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
>   		if (!info->swapped) {
> @@ -1540,12 +1547,12 @@ int shmem_unuse(unsigned int type)
>   		 * (igrab() would protect from unlink, but not from unmount).
>   		 */
>   		atomic_inc(&info->stop_eviction);
> -		mutex_unlock(&shmem_swaplist_mutex);
> +		spin_unlock(&shmem_swaplist_lock);
>   
>   		error = shmem_unuse_inode(&info->vfs_inode, type);
>   		cond_resched();
>   
> -		mutex_lock(&shmem_swaplist_mutex);
> +		spin_lock(&shmem_swaplist_lock);
>   		if (atomic_dec_and_test(&info->stop_eviction))
>   			wake_up_var(&info->stop_eviction);
>   		if (error)
> @@ -1556,7 +1563,7 @@ int shmem_unuse(unsigned int type)
>   		if (!info->swapped)
>   			list_del_init(&info->swaplist);
>   	}
> -	mutex_unlock(&shmem_swaplist_mutex);
> +	spin_unlock(&shmem_swaplist_lock);
>   
>   	return error;
>   }
> @@ -1646,30 +1653,30 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   		folio_mark_uptodate(folio);
>   	}
>   
> -	/*
> -	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
> -	 * if it's not already there.  Do it now before the folio is
> -	 * moved to swap cache, when its pagelock no longer protects
> -	 * the inode from eviction.  But don't unlock the mutex until
> -	 * we've incremented swapped, because shmem_unuse_inode() will
> -	 * prune a !swapped inode from the swaplist under this mutex.
> -	 */
> -	mutex_lock(&shmem_swaplist_mutex);
> -	if (list_empty(&info->swaplist))
> -		list_add(&info->swaplist, &shmem_swaplist);
> -
>   	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
> -		shmem_recalc_inode(inode, 0, nr_pages);
> +		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
> +
> +		/*
> +		 * Add inode to shmem_unuse()'s list of swapped-out inodes,
> +		 * if it's not already there.  Do it now before the folio is
> +		 * removed from page cache, when its pagelock no longer
> +		 * protects the inode from eviction.  And do it now, after
> +		 * we've incremented swapped, because shmem_unuse() will
> +		 * prune a !swapped inode from the swaplist.
> +		 */
> +		if (first_swapped) {
> +			spin_lock(&shmem_swaplist_lock);
> +			if (list_empty(&info->swaplist))
> +				list_add(&info->swaplist, &shmem_swaplist);
> +			spin_unlock(&shmem_swaplist_lock);
> +		}
> +
>   		swap_shmem_alloc(folio->swap, nr_pages);
>   		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
>   
> -		mutex_unlock(&shmem_swaplist_mutex);
>   		BUG_ON(folio_mapped(folio));
>   		return swap_writeout(folio, plug);
>   	}
> -	if (!info->swapped)
> -		list_del_init(&info->swaplist);
> -	mutex_unlock(&shmem_swaplist_mutex);
>   	if (nr_pages > 1)
>   		goto try_split;
>   redirty:


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ