linux-kernel - RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large folio in batches.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <PH7PR11MB812152287E850FA7CBAE2FD5C9E4A@PH7PR11MB8121.namprd11.prod.outlook.com>
Date: Fri, 3 Oct 2025 19:10:49 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "nphamcs@...il.com" <nphamcs@...il.com>,
	"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
	<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
	"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"senozhatsky@...omium.org" <senozhatsky@...omium.org>, "sj@...nel.org"
	<sj@...nel.org>, "kasong@...cent.com" <kasong@...cent.com>,
	"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
	"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
	<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
	"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
	<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
	"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K"
	<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
	"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large
 folio in batches.


> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > Sent: Wednesday, October 1, 2025 9:19 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > hannes@...xchg.org; nphamcs@...il.com; chengming.zhou@...ux.dev;
> > usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> > ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> > senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> > crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > <kristen.c.accardi@...el.com>; Gomes, Vinicius
> <vinicius.gomes@...el.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> > <vinodh.gopal@...el.com>
> > Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> > large folio in batches.
[...]
> >
> > > + */
> > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > **entries,
> > > +							   unsigned int
> > nr_entries,
> > > +							   gfp_t gfp)
> > > +{
> > > +	return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries,
> > entries);
> >
> > We currently use kmem_cache_alloc_node() in zswap_entry_cache_alloc()
> to
> > allocate the entry on the same node as the compressed page. We use
> > entry_to_nid() to get the node for LRU operations.
> >
> > This breaks that assumption.
> 
> You bring up a good point. I was looking at the code in slub.c and my
> understanding thus far is that both, bulk allocations and
> kmem_cache_alloc_node()
> allocations are made from a per-CPU "cpu_slab" that is allocated by SLUB.
> 
> IIUC, the concern you are raising is in the mainline, the entry is allocated on
> the same node as the compressed page, and gets added to the LRU list of that
> node. IOW, the node to which the compressed page belongs is the one to
> whose
> LRU the entry will be added.
> 
> With this patch, with kmem_cache_alloc_bulk(), the entry will be created on
> the per-CPU slab of the CPU on which zswap_store() is called and will be
> added to the LRU of that per-CPU slab's NUMA node. Hence, the end result
> could potentially be that the zswap_entry for a page could potentially be
> on a different NUMA node/memcg than the page's NUMA node.
> 
> This is my thinking as to how this will impact the zswap shrinker:
> 
> 1) memcg shrinker: if the memcg the entry ends up in is on the zswap_list_lru,
>     the entry will be written back.
> 2) Global shrinker: will cycle through all memcg's that have pages in the
>     zswap_list_lru, and the entry will be written back.
> 
> Based on this, it is not clear to me if there is a problem, and would like to
> request you, Nhat and others to provide insights as well.
> 
> Interestingly, most of the code in slub.c has unlikely(!node_match(slab,
> node)).
> Does this imply some higher level mm slab allocation requirements?
> 
> I am Ok with just calling zswap_entry_cache_alloc() for "nr_pages" if we
> think this would be more correct.

I wanted to share some updates and summarize my understanding from
looking some more at slub.c. The "zswap_entry_cache" kmem_cache
has slab memory created for each node. The main change with v12 is:

before:
The page being compressed and its zswap_entry are in the same memcg,
thus, implicitly on the same slab node.

v12:
The page being compressed and its zswap_entry could be in different
memcgs if the process that owns the page gets migrated to a different
node.

Impact to zswap LRU list vis-à-vis the memcg shrinker:

before:
The original NUMA node memcg [hierarchy] would need to face memory
pressure for the page's zswap_entry to be written back. Hence it is possible
that the node on which the process is currently running may not see the
benefit of memory being freed up.

v12:
The memcg whose node on which the process was running when the page
was compressed would have to face memory pressure for the page's zswap_entry
to be written back. This node will see the benefit of memory being freed up due
to writeback. If the process has migrated to a different node than the one
on which it was running when the page was compressed, we have the same issue
as "before".

Is my understanding correct? Please let me know if I am missing something.

The bulk allocation does improve batching performance, especially with 64K folios:

 kernel_compile, 64K folios, deflate-iaa:
 ========================================
           mm-unstable-9-18-2025              v12     without bulk alloc/free
 ------------------------------------------------------------------------------
 real_sec           874.74                 821.59       870.82
 sys_sec          3,834.35               3,791.12     3,794.06
 ------------------------------------------------------------------------------
                        
 kernel_compile, 64K folios, zstd:
 =================================
           mm-unstable-9-18-2025              v12     without bulk alloc/free
 ------------------------------------------------------------------------------
 real_sec           925.08                 853.14       899.40
 sys_sec          5,318.65               5,172.23     5,415.20
 ------------------------------------------------------------------------------
                        
 kernel_compile, PMD folios, deflate-iaa:
 ========================================
           mm-unstable-9-18-2025              v12     without bulk alloc/free
 ------------------------------------------------------------------------------
 real_sec           808.10                 794.85       809.33
 sys_sec          4,351.01               4,266.95     4,169.07
 ------------------------------------------------------------------------------
                        
 kernel_compile, PMD folios, zstd:
 =================================
           mm-unstable-9-18-2025              v12     without bulk alloc/free
 ------------------------------------------------------------------------------
 real_sec           848.06                 845.42       836.06
 sys_sec          5,898.58               5,741.31     5,795.75
 ------------------------------------------------------------------------------

Based on this, and if my understanding of the v12 change's impact on the
zswap shrinker is correct, I am not sure if the change in the assumption
is necessarily a concern.

> 
> >
> > > +}
> > > +
> > > +static __always_inline void zswap_entries_cache_free_batch(void
> > **entries,
> > > +							   unsigned int
> > nr_entries)
> > > +{
> > > +	kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
> > > +}
[...]
> > > +static bool zswap_store_pages(struct folio *folio,
> > > +			      long start,
> > > +			      long end,
> > > +			      struct obj_cgroup *objcg,
> > > +			      struct zswap_pool *pool,
> > > +			      int node_id,
> > > +			      bool folio_wb)
> > >  {
> > > -	swp_entry_t page_swpentry = page_swap_entry(page);
> > > -	struct zswap_entry *entry, *old;
> > > -
> > > -	/* allocate entry */
> > > -	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > > -	if (!entry) {
> > > -		zswap_reject_kmemcache_fail++;
> > > -		return false;
> > > +	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > > +	u8 i, store_fail_idx = 0, nr_pages = end - start;
> > > +
> > > +	VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE);
> > > +
> > > +	if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> > > +						      nr_pages, GFP_KERNEL)))
> > {
> > > +		for (i = 0; i < nr_pages; ++i) {
> > > +			entries[i] = zswap_entry_cache_alloc(GFP_KERNEL,
> > node_id);
> > > +
> > > +			if (unlikely(!entries[i])) {
> > > +				zswap_reject_kmemcache_fail++;
> > > +				/*
> > > +				 * While handling this error, we only need to
> > > +				 * call zswap_entries_cache_free_batch() for
> > > +				 * entries[0 .. i-1].
> > > +				 */
> > > +				nr_pages = i;
> > > +				goto store_pages_failed;
> > > +			}
> >
> > Is it okay to use kmem_cache_free_bulk() to free slab objects that were
> > not allocated with kmem_cache_alloc_bulk()?
> 
> I recall verifying that it should be Ok, but can check again.
> 

I verified the code again, and yes, it is Ok for slab objects allocated by
either kmem_cache_alloc_bulk() or kmem_cache_alloc_node() to be
freed by calling kmem_cache_free_bulk(). kmem_cache_free_bulk()
opportunistically looks to create freelists from the list of slab objects
to then "bulk transfer" them to the slab's freelists, with optimizations
in synchronization as compared to calling kmem_cache_free().

I verified this with usemem30 with the following code change in
zswap_store_pages(), and saw no issues with deflate-iaa/zstd:

for each of nr_pages: kmem_cache_alloc_node()

kmem_cache_free_bulk()

kmem_cache_alloc_bulk()

kmem_cache_free_bulk()

kmem_cache_alloc _bulk()

[proceed]

Yosry, please let me know how I should proceed and if there are
other experiments that I can run to verify this change to bulk alloc/free
to address any concerns.

Thanks,
Kanchana