linux-kernel - Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large folio in batches.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xnwqi5jnawvxdciqapfhhkneynsdr3dx36hmqe7jn4shm3uc3y@iyi5qqfubqey>
Date: Tue, 14 Oct 2025 15:29:49 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org" <hannes@...xchg.org>, 
	"nphamcs@...il.com" <nphamcs@...il.com>, "chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>, 
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com" <ryan.roberts@....com>, 
	"21cnbao@...il.com" <21cnbao@...il.com>, "ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>, 
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "senozhatsky@...omium.org" <senozhatsky@...omium.org>, 
	"sj@...nel.org" <sj@...nel.org>, "kasong@...cent.com" <kasong@...cent.com>, 
	"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>, "herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>, 
	"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com" <clabbe@...libre.com>, 
	"ardb@...nel.org" <ardb@...nel.org>, "ebiggers@...gle.com" <ebiggers@...gle.com>, 
	"surenb@...gle.com" <surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>, 
	"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, 
	"Gopal, Vinodh" <vinodh.gopal@...el.com>
Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large
 folio in batches.

[..]
> > > @@ -158,6 +161,8 @@ struct zswap_pool {
> > >  	struct work_struct release_work;
> > >  	struct hlist_node node;
> > >  	char tfm_name[CRYPTO_MAX_ALG_NAME];
> > > +	u8 compr_batch_size;
> > > +	u8 store_batch_size;
> > 
> > I don't think we need to store store_batch_size, seems trivial to
> > calculate at store time (perhaps in a helper).
> > 
> > Taking a step back, is there any benefit to limiting store_batch_size to
> > compr_batch_size? Is there a disadvantage to using
> > ZSWAP_MAX_BATCH_SIZE
> > even if it's higher than the HW compression batch size?
> 
> Thanks Yosry, for the code review comments. I had a discussion with
> Barry earlier on these very same topics as follow up to his review comments
> for v11, starting with [1]. Can you please go through the rationale for
> these design choices, and let me know if you have any questions:
> 
> [1]: https://patchwork.kernel.org/comment/26530319/

I am surprised that calculating the value in zswap_store() causes a
regression, but I am fine with keeping the precalculation in this case.

I think there's a bigger problem here tho, more below.

> > > + */
> > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > **entries,
> > > +							   unsigned int
> > nr_entries,
> > > +							   gfp_t gfp)
> > > +{
> > > +	return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries,
> > entries);
> > 
> > We currently use kmem_cache_alloc_node() in zswap_entry_cache_alloc() to
> > allocate the entry on the same node as the compressed page. We use
> > entry_to_nid() to get the node for LRU operations.
> > 
> > This breaks that assumption.
> 
> You bring up a good point. I was looking at the code in slub.c and my
> understanding thus far is that both, bulk allocations and kmem_cache_alloc_node()
> allocations are made from a per-CPU "cpu_slab" that is allocated by SLUB.
> 
> IIUC, the concern you are raising is in the mainline, the entry is allocated on
> the same node as the compressed page, and gets added to the LRU list of that
> node. IOW, the node to which the compressed page belongs is the one to whose
> LRU the entry will be added.
> 
> With this patch, with kmem_cache_alloc_bulk(), the entry will be created on
> the per-CPU slab of the CPU on which zswap_store() is called and will be
> added to the LRU of that per-CPU slab's NUMA node. Hence, the end result
> could potentially be that the zswap_entry for a page could potentially be
> on a different NUMA node/memcg than the page's NUMA node.
> 
> This is my thinking as to how this will impact the zswap shrinker:
> 
> 1) memcg shrinker: if the memcg the entry ends up in is on the zswap_list_lru,
>     the entry will be written back.
> 2) Global shrinker: will cycle through all memcg's that have pages in the
>     zswap_list_lru, and the entry will be written back.
> 
> Based on this, it is not clear to me if there is a problem, and would like to
> request you, Nhat and others to provide insights as well.
> 
> Interestingly, most of the code in slub.c has unlikely(!node_match(slab, node)).
> Does this imply some higher level mm slab allocation requirements?
> 
> I am Ok with just calling zswap_entry_cache_alloc() for "nr_pages" if we
> think this would be more correct.

I saw your other response as well, but I think one thing is not clear
here. The zswap entry will get written back "eventually", sure, but
that's not the problem.

If the zswap entry is on the wrong node lru, two things happen:
(a) When the "right" node is under memory pressure, we cannot free this
    entry by writing it back since it's not available in the lru.
(b) When the "wrong" node is under memory pressure, it will potentially
    writeback entries from other nodes AND report them as being freed
    from this node.

Both (a) and (b) cause less effective reclaim from the zswap shrinker.
Additionally (b) causes the shrinker to report the wrong amount of freed
memory from the node. While this may not be significant today, it's very
possible that more heuristics start relying on this number in the
future.

I don't believe we should put zswap entries on the wrong LRU, but I will
defer to Nhat for the final verdict if he has a different opinion.

> > 
> > Does it actually matter if we do the initializations here vs. right
> > before inserting to the LRU (current behavior)?
> 
> Yes, it impacts batching performance with software quite a bit.

Taking a step back, based on discussions in this thread and others, it
seems like the performance of zswap_store() is very sensitive to minor
changes like this one, calculating the store size, etc.

I don't recall this being the case before this patch series. It seems
like an artifact of batching affecting performance negatively and
a collection of micro-optimizations and "correct" placement of code/data
to fix it. This seems to be very fragile.

It is very possible that an incoming change will move the
initializations or make other changes to the code, that will seemingly
cause regressions when they really shouldn't.

Additionally, what may seem optimal on the CPU you are testing on maybe
be suboptimal for other CPUs. I think the big problem here is not where
to initialize the entry or whether to precompute the store batch size,
it's why the code has become extremely sensitive to these changes when
it shouldn't be. zswap_store() is not a fast path by any means, and
people will not expect such changes to cause meaningful regressions.

> 
> > 
> > > +	for (i = 0; i < nr_pages; ++i) {
> > > +		entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> > > +		entries[i]->pool = pool;
> > > +		entries[i]->swpentry = page_swap_entry(folio_page(folio,
> > start + i));
> > > +		entries[i]->objcg = objcg;
> > > +		entries[i]->referenced = true;
> > > +		INIT_LIST_HEAD(&entries[i]->lru);
> > > +	}
> > >
> > > -	old = xa_store(swap_zswap_tree(page_swpentry),
> > > -		       swp_offset(page_swpentry),
> > > -		       entry, GFP_KERNEL);
> > > -	if (xa_is_err(old)) {
> > > -		int err = xa_err(old);
> > > +	for (i = 0; i < nr_pages; ++i) {
> > > +		struct page *page = folio_page(folio, start + i);
> > >
> > > -		WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> > %d\n", err);
> > > -		zswap_reject_alloc_fail++;
> > > -		goto store_failed;
> > > +		if (!zswap_compress(page, entries[i], pool, folio_wb))
> > > +			goto store_pages_failed;
> > >  	}
> > >
> > > -	/*
> > > -	 * We may have had an existing entry that became stale when
> > > -	 * the folio was redirtied and now the new version is being
> > > -	 * swapped out. Get rid of the old.
> > > -	 */
> > > -	if (old)
> > > -		zswap_entry_free(old);
> > > +	for (i = 0; i < nr_pages; ++i) {
> > > +		struct zswap_entry *old, *entry = entries[i];
> > >
> > > -	/*
> > > -	 * The entry is successfully compressed and stored in the tree, there is
> > > -	 * no further possibility of failure. Grab refs to the pool and objcg,
> > > -	 * charge zswap memory, and increment zswap_stored_pages.
> > > -	 * The opposite actions will be performed by zswap_entry_free()
> > > -	 * when the entry is removed from the tree.
> > > -	 */
> > > -	zswap_pool_get(pool);
> > > -	if (objcg) {
> > > -		obj_cgroup_get(objcg);
> > > -		obj_cgroup_charge_zswap(objcg, entry->length);
> > > -	}
> > > -	atomic_long_inc(&zswap_stored_pages);
> > > -	if (entry->length == PAGE_SIZE)
> > > -		atomic_long_inc(&zswap_stored_incompressible_pages);
> > > +		old = xa_store(swap_zswap_tree(entry->swpentry),
> > > +			       swp_offset(entry->swpentry),
> > > +			       entry, GFP_KERNEL);
> > > +		if (unlikely(xa_is_err(old))) {
> > > +			int err = xa_err(old);
> > >
> > > -	/*
> > > -	 * We finish initializing the entry while it's already in xarray.
> > > -	 * This is safe because:
> > > -	 *
> > > -	 * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > -	 *
> > > -	 * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > -	 *    The publishing order matters to prevent writeback from seeing
> > > -	 *    an incoherent entry.
> > > -	 */
> > > -	entry->pool = pool;
> > > -	entry->swpentry = page_swpentry;
> > > -	entry->objcg = objcg;
> > > -	entry->referenced = true;
> > > -	if (entry->length) {
> > > -		INIT_LIST_HEAD(&entry->lru);
> > > -		zswap_lru_add(&zswap_list_lru, entry);
> > > +			WARN_ONCE(err != -ENOMEM, "unexpected xarray
> > error: %d\n", err);
> > > +			zswap_reject_alloc_fail++;
> > > +			/*
> > > +			 * Entries up to this point have been stored in the
> > > +			 * xarray. zswap_store() will erase them from the
> > xarray
> > > +			 * and call zswap_entry_free(). Local cleanup in
> > > +			 * 'store_pages_failed' only needs to happen for
> > > +			 * entries from [@i to @nr_pages).
> > > +			 */
> > > +			store_fail_idx = i;
> > > +			goto store_pages_failed;
> > > +		}
> > > +
> > > +		/*
> > > +		 * We may have had an existing entry that became stale when
> > > +		 * the folio was redirtied and now the new version is being
> > > +		 * swapped out. Get rid of the old.
> > > +		 */
> > > +		if (unlikely(old))
> > > +			zswap_entry_free(old);
> > > +
> > > +		/*
> > > +		 * The entry is successfully compressed and stored in the tree,
> > there is
> > > +		 * no further possibility of failure. Grab refs to the pool and
> > objcg,
> > > +		 * charge zswap memory, and increment
> > zswap_stored_pages.
> > > +		 * The opposite actions will be performed by
> > zswap_entry_free()
> > > +		 * when the entry is removed from the tree.
> > > +		 */
> > 
> > But there *is* further possibility of failure if a subsequent entry
> > xa_store() fails, right?
> 
> This comment is referring to this specific entry.

I think it's currently misleading in its current form. Perhaps:

The entry is successfully compressed and stored in the tree, and further
failures will be cleaned up in zswap_store().

> 
> > 
> > Seems like if xa_store() fails we do not cleanup previously charged
> > objects, pool references, zswap_stored_pages, etc. Instead of rolling
> > all this back on failure, can we do all the xarray stores first and only
> > do the rest when we're at a point where no failure can happen? Would
> > that cause a performance regression?
> 
> It would make the code more complicated and thus cause a performance
> regression. I have tried to balance code simplicity (which impacts performance)
> with correctness here. The "store_failed_idx" ensures that all partial work done
> in zswap_store_pages() for this batch, is cleaned up.
> 
> The rest is handled in zswap_store() when it sees an error returned by
> zswap_store_pages().

Right, I missed the part about zswap_store() handling this, the comment
above confused me.