linux-kernel - Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c492b99e-b6fb-4da8-8055-1e93c6141a12@redhat.com>
Date: Wed, 3 Sep 2025 13:41:52 +0200
From: David Hildenbrand <david@...hat.com>
To: Kairui Song <kasong@...cent.com>, linux-mm@...ck.org
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Matthew Wilcox <willy@...radead.org>, Hugh Dickins <hughd@...gle.com>,
 Chris Li <chrisl@...nel.org>, Barry Song <baohua@...nel.org>,
 Baoquan He <bhe@...hat.com>, Nhat Pham <nphamcs@...il.com>,
 Kemeng Shi <shikemeng@...weicloud.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Ying Huang <ying.huang@...ux.alibaba.com>,
 Johannes Weiner <hannes@...xchg.org>, Yosry Ahmed <yosryahmed@...gle.com>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and
 switch API

On 22.08.25 21:20, Kairui Song wrote:
> From: Kairui Song <kasong@...cent.com>
> 
> Introduce basic swap table infrastructures, which are now just a
> fixed-sized flat array inside each swap cluster, with access wrappers.
> 
> Each cluster contains a swap table of 512 entries. Each table entry is
> an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> a folio type (pointer), or NULL.
> 
> In this first step, it only supports storing a folio or shadow, and it
> is a drop-in replacement for the current swap cache. Convert all swap
> cache users to use the new sets of APIs. Chris Li has been suggesting
> using a new infrastructure for swap cache for better performance, and
> that idea combined well with the swap table as the new backing
> structure. Now the lock contention range is reduced to 2M clusters,
> which is much smaller than the 64M address_space. And we can also drop
> the multiple address_space design.
> 
> All the internal works are done with swap_cache_get_* helpers. Swap
> cache lookup is still lock-less like before, and the helper's contexts
> are same with original swap cache helpers. They still require a pin
> on the swap device to prevent the backing data from being freed.
> 
> Swap cache updates are now protected by the swap cluster lock
> instead of the Xarray lock. This is mostly handled internally, but new
> __swap_cache_* helpers require the caller to lock the cluster. So, a
> few new cluster access and locking helpers are also introduced.
> 
> A fully cluster-based unified swap table can be implemented on top
> of this to take care of all count tracking and synchronization work,
> with dynamic allocation. It should reduce the memory usage while
> making the performance even better.
> 
> Co-developed-by: Chris Li <chrisl@...nel.org>
> Signed-off-by: Chris Li <chrisl@...nel.org>
> Signed-off-by: Kairui Song <kasong@...cent.com>
> ---

[...]

> @@ -4504,7 +4504,7 @@ static void filemap_cachestat(struct address_space *mapping,
>   				 * invalidation, so there might not be
>   				 * a shadow in the swapcache (yet).
>   				 */
> -				shadow = get_shadow_from_swap_cache(swp);
> +				shadow = swap_cache_get_shadow(swp);
>   				if (!shadow)
>   					goto resched;
>   			}

This looks like a cleanup that can be performed separately upfront to 
make this patch smaller.

Same applies to delete_from_swap_cache->swap_cache_del_folio

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2a47cd3bb649..209580d395a1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3721,7 +3721,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   	/* Prevent deferred_split_scan() touching ->_refcount */
>   	spin_lock(&ds_queue->split_queue_lock);
>   	if (folio_ref_freeze(folio, 1 + extra_pins)) {
> -		struct address_space *swap_cache = NULL;
> +		struct swap_cluster_info *swp_ci = NULL;

I'm wondering if we could also perform this change upfront, so we can ...


>   		struct lruvec *lruvec;
>   		int expected_refs;
>   
> @@ -3765,8 +3765,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   				goto fail;
>   			}
>   
> -			swap_cache = swap_address_space(folio->swap);
> -			xa_lock(&swap_cache->i_pages);
> +			swp_ci = swap_cluster_lock_by_folio(folio);

... perform these cleanups outside of the main patch. Just a thought.


Because this patch is rather big and touches quite some code (hard to 
review)

-- 
Cheers

David / dhildenb