linux-kernel - RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large folio in batches.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <PH7PR11MB81218BA4B8D8CC5F57570455C9EAA@PH7PR11MB8121.namprd11.prod.outlook.com>
Date: Mon, 13 Oct 2025 17:47:13 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "nphamcs@...il.com" <nphamcs@...il.com>,
	"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
	<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
	"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"senozhatsky@...omium.org" <senozhatsky@...omium.org>, "sj@...nel.org"
	<sj@...nel.org>, "kasong@...cent.com" <kasong@...cent.com>,
	"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
	"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
	<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
	"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
	<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
	"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K"
	<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
	"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large
 folio in batches.


> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Sent: Friday, October 3, 2025 12:11 PM
> To: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; nphamcs@...il.com; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; Gomes, Vinicius <vinicius.gomes@...el.com>;
> Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> <vinodh.gopal@...el.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@...el.com>
> Subject: RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> large folio in batches.
> 
> 
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > > Sent: Wednesday, October 1, 2025 9:19 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > > Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > > hannes@...xchg.org; nphamcs@...il.com;
> chengming.zhou@...ux.dev;
> > > usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> > > ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> > > senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> > > crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> > > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > > <kristen.c.accardi@...el.com>; Gomes, Vinicius
> > <vinicius.gomes@...el.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> > > <vinodh.gopal@...el.com>
> > > Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> > > large folio in batches.
> [...]
> > >
> > > > + */
> > > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > > **entries,
> > > > +							   unsigned int
> > > nr_entries,
> > > > +							   gfp_t gfp)
> > > > +{
> > > > +	return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries,
> > > entries);
> > >
> > > We currently use kmem_cache_alloc_node() in
> zswap_entry_cache_alloc()
> > to
> > > allocate the entry on the same node as the compressed page. We use
> > > entry_to_nid() to get the node for LRU operations.
> > >
> > > This breaks that assumption.
> >
> > You bring up a good point. I was looking at the code in slub.c and my
> > understanding thus far is that both, bulk allocations and
> > kmem_cache_alloc_node()
> > allocations are made from a per-CPU "cpu_slab" that is allocated by SLUB.
> >
> > IIUC, the concern you are raising is in the mainline, the entry is allocated on
> > the same node as the compressed page, and gets added to the LRU list of
> that
> > node. IOW, the node to which the compressed page belongs is the one to
> > whose
> > LRU the entry will be added.
> >
> > With this patch, with kmem_cache_alloc_bulk(), the entry will be created on
> > the per-CPU slab of the CPU on which zswap_store() is called and will be
> > added to the LRU of that per-CPU slab's NUMA node. Hence, the end result
> > could potentially be that the zswap_entry for a page could potentially be
> > on a different NUMA node/memcg than the page's NUMA node.
> >
> > This is my thinking as to how this will impact the zswap shrinker:
> >
> > 1) memcg shrinker: if the memcg the entry ends up in is on the
> zswap_list_lru,
> >     the entry will be written back.
> > 2) Global shrinker: will cycle through all memcg's that have pages in the
> >     zswap_list_lru, and the entry will be written back.
> >
> > Based on this, it is not clear to me if there is a problem, and would like to
> > request you, Nhat and others to provide insights as well.
> >
> > Interestingly, most of the code in slub.c has unlikely(!node_match(slab,
> > node)).
> > Does this imply some higher level mm slab allocation requirements?
> >
> > I am Ok with just calling zswap_entry_cache_alloc() for "nr_pages" if we
> > think this would be more correct.
> 
> I wanted to share some updates and summarize my understanding from
> looking some more at slub.c. The "zswap_entry_cache" kmem_cache
> has slab memory created for each node. The main change with v12 is:
> 
> before:
> The page being compressed and its zswap_entry are in the same memcg,
> thus, implicitly on the same slab node.
> 
> v12:
> The page being compressed and its zswap_entry could be in different
> memcgs if the process that owns the page gets migrated to a different
> node.
> 
> Impact to zswap LRU list vis-à-vis the memcg shrinker:
> 
> before:
> The original NUMA node memcg [hierarchy] would need to face memory
> pressure for the page's zswap_entry to be written back. Hence it is possible
> that the node on which the process is currently running may not see the
> benefit of memory being freed up.
> 
> v12:
> The memcg whose node on which the process was running when the page
> was compressed would have to face memory pressure for the page's
> zswap_entry
> to be written back. This node will see the benefit of memory being freed up
> due
> to writeback. If the process has migrated to a different node than the one
> on which it was running when the page was compressed, we have the same
> issue
> as "before".
> 
> Is my understanding correct? Please let me know if I am missing something.
> 
> The bulk allocation does improve batching performance, especially with 64K
> folios:
> 
>  kernel_compile, 64K folios, deflate-iaa:
>  ========================================
>            mm-unstable-9-18-2025              v12     without bulk alloc/free
>  ------------------------------------------------------------------------------
>  real_sec           874.74                 821.59       870.82
>  sys_sec          3,834.35               3,791.12     3,794.06
>  ------------------------------------------------------------------------------
> 
>  kernel_compile, 64K folios, zstd:
>  =================================
>            mm-unstable-9-18-2025              v12     without bulk alloc/free
>  ------------------------------------------------------------------------------
>  real_sec           925.08                 853.14       899.40
>  sys_sec          5,318.65               5,172.23     5,415.20
>  ------------------------------------------------------------------------------
> 
>  kernel_compile, PMD folios, deflate-iaa:
>  ========================================
>            mm-unstable-9-18-2025              v12     without bulk alloc/free
>  ------------------------------------------------------------------------------
>  real_sec           808.10                 794.85       809.33
>  sys_sec          4,351.01               4,266.95     4,169.07
>  ------------------------------------------------------------------------------
> 
>  kernel_compile, PMD folios, zstd:
>  =================================
>            mm-unstable-9-18-2025              v12     without bulk alloc/free
>  ------------------------------------------------------------------------------
>  real_sec           848.06                 845.42       836.06
>  sys_sec          5,898.58               5,741.31     5,795.75
>  ------------------------------------------------------------------------------
> 
> Based on this, and if my understanding of the v12 change's impact on the
> zswap shrinker is correct, I am not sure if the change in the assumption
> is necessarily a concern.
> 
> >
> > >
> > > > +}
> > > > +
> > > > +static __always_inline void zswap_entries_cache_free_batch(void
> > > **entries,
> > > > +							   unsigned int
> > > nr_entries)
> > > > +{
> > > > +	kmem_cache_free_bulk(zswap_entry_cache, nr_entries, entries);
> > > > +}
> [...]
> > > > +static bool zswap_store_pages(struct folio *folio,
> > > > +			      long start,
> > > > +			      long end,
> > > > +			      struct obj_cgroup *objcg,
> > > > +			      struct zswap_pool *pool,
> > > > +			      int node_id,
> > > > +			      bool folio_wb)
> > > >  {
> > > > -	swp_entry_t page_swpentry = page_swap_entry(page);
> > > > -	struct zswap_entry *entry, *old;
> > > > -
> > > > -	/* allocate entry */
> > > > -	entry = zswap_entry_cache_alloc(GFP_KERNEL, page_to_nid(page));
> > > > -	if (!entry) {
> > > > -		zswap_reject_kmemcache_fail++;
> > > > -		return false;
> > > > +	struct zswap_entry *entries[ZSWAP_MAX_BATCH_SIZE];
> > > > +	u8 i, store_fail_idx = 0, nr_pages = end - start;
> > > > +
> > > > +	VM_WARN_ON_ONCE(nr_pages > ZSWAP_MAX_BATCH_SIZE);
> > > > +
> > > > +	if (unlikely(!zswap_entries_cache_alloc_batch((void **)&entries[0],
> > > > +						      nr_pages, GFP_KERNEL)))
> > > {
> > > > +		for (i = 0; i < nr_pages; ++i) {
> > > > +			entries[i] = zswap_entry_cache_alloc(GFP_KERNEL,
> > > node_id);
> > > > +
> > > > +			if (unlikely(!entries[i])) {
> > > > +				zswap_reject_kmemcache_fail++;
> > > > +				/*
> > > > +				 * While handling this error, we only need to
> > > > +				 * call zswap_entries_cache_free_batch() for
> > > > +				 * entries[0 .. i-1].
> > > > +				 */
> > > > +				nr_pages = i;
> > > > +				goto store_pages_failed;
> > > > +			}
> > >
> > > Is it okay to use kmem_cache_free_bulk() to free slab objects that were
> > > not allocated with kmem_cache_alloc_bulk()?
> >
> > I recall verifying that it should be Ok, but can check again.
> >
> 
> I verified the code again, and yes, it is Ok for slab objects allocated by
> either kmem_cache_alloc_bulk() or kmem_cache_alloc_node() to be
> freed by calling kmem_cache_free_bulk(). kmem_cache_free_bulk()
> opportunistically looks to create freelists from the list of slab objects
> to then "bulk transfer" them to the slab's freelists, with optimizations
> in synchronization as compared to calling kmem_cache_free().
> 
> I verified this with usemem30 with the following code change in
> zswap_store_pages(), and saw no issues with deflate-iaa/zstd:
> 
> for each of nr_pages: kmem_cache_alloc_node()
> 
> kmem_cache_free_bulk()
> 
> kmem_cache_alloc_bulk()
> 
> kmem_cache_free_bulk()
> 
> kmem_cache_alloc _bulk()
> 
> [proceed]
> 
> Yosry, please let me know how I should proceed and if there are
> other experiments that I can run to verify this change to bulk alloc/free
> to address any concerns.

Hi Yosry,

Just wanted to check in to get your suggestions on next steps as I start
working on v13.

Thanks for taking the time to provide code review comments!

Best regards,
Kanchana

> 
> Thanks,
> Kanchana