linux-kernel - RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large folio in batches.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <PH7PR11MB812102541D5106A6D0E21E28C9E8A@PH7PR11MB8121.namprd11.prod.outlook.com>
Date: Wed, 15 Oct 2025 03:42:21 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Nhat Pham <nphamcs@...il.com>, Yosry Ahmed <yosry.ahmed@...ux.dev>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
	<hannes@...xchg.org>, "chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
	<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
	"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"senozhatsky@...omium.org" <senozhatsky@...omium.org>, "sj@...nel.org"
	<sj@...nel.org>, "kasong@...cent.com" <kasong@...cent.com>,
	"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
	"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
	<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
	"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
	<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
	"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K"
	<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
	"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large
 folio in batches.


> -----Original Message-----
> From: Nhat Pham <nphamcs@...il.com>
> Sent: Tuesday, October 14, 2025 9:35 AM
> To: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>; linux-
> kernel@...r.kernel.org; linux-mm@...ck.org; hannes@...xchg.org;
> chengming.zhou@...ux.dev; usamaarif642@...il.com;
> ryan.roberts@....com; 21cnbao@...il.com;
> ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; Gomes, Vinicius <vinicius.gomes@...el.com>;
> Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> large folio in batches.
> 
> On Tue, Oct 14, 2025 at 8:29 AM Yosry Ahmed <yosry.ahmed@...ux.dev>
> wrote:
> >
> > [..]
> > > > > @@ -158,6 +161,8 @@ struct zswap_pool {
> > > > >   struct work_struct release_work;
> > > > >   struct hlist_node node;
> > > > >   char tfm_name[CRYPTO_MAX_ALG_NAME];
> > > > > + u8 compr_batch_size;
> > > > > + u8 store_batch_size;
> > > >
> > > > I don't think we need to store store_batch_size, seems trivial to
> > > > calculate at store time (perhaps in a helper).
> > > >
> > > > Taking a step back, is there any benefit to limiting store_batch_size to
> > > > compr_batch_size? Is there a disadvantage to using
> > > > ZSWAP_MAX_BATCH_SIZE
> > > > even if it's higher than the HW compression batch size?
> > >
> > > Thanks Yosry, for the code review comments. I had a discussion with
> > > Barry earlier on these very same topics as follow up to his review
> comments
> > > for v11, starting with [1]. Can you please go through the rationale for
> > > these design choices, and let me know if you have any questions:
> > >
> > > [1]: https://patchwork.kernel.org/comment/26530319/
> >
> > I am surprised that calculating the value in zswap_store() causes a
> > regression, but I am fine with keeping the precalculation in this case.
> >
> > I think there's a bigger problem here tho, more below.
> >
> > > > > + */
> > > > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > > > **entries,
> > > > > +                                                    unsigned int
> > > > nr_entries,
> > > > > +                                                    gfp_t gfp)
> > > > > +{
> > > > > + return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries,
> > > > entries);
> > > >
> > > > We currently use kmem_cache_alloc_node() in
> zswap_entry_cache_alloc() to
> > > > allocate the entry on the same node as the compressed page. We use
> > > > entry_to_nid() to get the node for LRU operations.
> > > >
> > > > This breaks that assumption.
> > >
> > > You bring up a good point. I was looking at the code in slub.c and my
> > > understanding thus far is that both, bulk allocations and
> kmem_cache_alloc_node()
> > > allocations are made from a per-CPU "cpu_slab" that is allocated by SLUB.
> > >
> > > IIUC, the concern you are raising is in the mainline, the entry is allocated
> on
> > > the same node as the compressed page, and gets added to the LRU list of
> that
> > > node. IOW, the node to which the compressed page belongs is the one to
> whose
> > > LRU the entry will be added.
> > >
> > > With this patch, with kmem_cache_alloc_bulk(), the entry will be created
> on
> > > the per-CPU slab of the CPU on which zswap_store() is called and will be
> > > added to the LRU of that per-CPU slab's NUMA node. Hence, the end
> result
> > > could potentially be that the zswap_entry for a page could potentially be
> > > on a different NUMA node/memcg than the page's NUMA node.
> 
> I think only the NUMA node is the problem, not the memcg.
> 
> > >
> > > This is my thinking as to how this will impact the zswap shrinker:
> > >
> > > 1) memcg shrinker: if the memcg the entry ends up in is on the
> zswap_list_lru,
> > >     the entry will be written back.
> > > 2) Global shrinker: will cycle through all memcg's that have pages in the
> > >     zswap_list_lru, and the entry will be written back.
> > >
> > > Based on this, it is not clear to me if there is a problem, and would like to
> > > request you, Nhat and others to provide insights as well.
> > >
> > > Interestingly, most of the code in slub.c has unlikely(!node_match(slab,
> node)).
> > > Does this imply some higher level mm slab allocation requirements?
> > >
> > > I am Ok with just calling zswap_entry_cache_alloc() for "nr_pages" if we
> > > think this would be more correct.
> >
> > I saw your other response as well, but I think one thing is not clear
> > here. The zswap entry will get written back "eventually", sure, but
> > that's not the problem.
> >
> > If the zswap entry is on the wrong node lru, two things happen:
> > (a) When the "right" node is under memory pressure, we cannot free this
> >     entry by writing it back since it's not available in the lru.
> > (b) When the "wrong" node is under memory pressure, it will potentially
> >     writeback entries from other nodes AND report them as being freed
> >     from this node.
> >
> > Both (a) and (b) cause less effective reclaim from the zswap shrinker.
> > Additionally (b) causes the shrinker to report the wrong amount of freed
> > memory from the node. While this may not be significant today, it's very
> > possible that more heuristics start relying on this number in the
> > future.
> >
> > I don't believe we should put zswap entries on the wrong LRU, but I will
> > defer to Nhat for the final verdict if he has a different opinion.
> 
> Oh shoot. Yeah I missed that part.
> 
> In the past, we sort of did not care - zswap was very poorly designed
> for NUMA architecture in general, and most of our test setups have
> been single-node, so these kinds of discrepancies did not show up in
> performance numbers.
> 
> But we are getting more multi-node systems:
> 
> 1. Bigger hosts (memory-wise) tend to also have more than one nodes.
> It scales better that way (especially because a lot of structures and
> locks protecting them are node-partitioned).
> 
> 2. We have also seen different memory media that are often expressed
> to the kernel as nodes: CXL, GPU memory, etc.
> 
> This will necessitate tightening memory placement. We recently had to
> fix one such issue:
> 
> https://github.com/torvalds/linux/commit/56e5a103a721d0ef139bba7ff3d3a
> da6c8217d5b
> 
> So I'm a bit nervous about this change, which will make us use the wrong
> LRU...
> 
> Some work around:
> 
> 1. Can we squeeze an extra int field anywhere in struct zswap_entry?
> 
> 2. Can we pump nid all the way to zswap_lru_add()?
> 
> This is still not 100% ideal - the metadata (struct zswap_entry) will
> still be allocated on the wrong node. But at least the data are
> properly managed, i.e on the right LRU.

Thanks, Nhat and Yosry for the discussion. Thank you Nhat, for the
zsmalloc change log reference and for the work arounds!

Following your suggestion in (2), can we pass in the folio's nid from
zswap_store_pages() to zswap_lru_add(), as follows:

diff --git a/mm/zswap.c b/mm/zswap.c
index 263bc6d7f5c6..44665deece80 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -694,9 +694,9 @@ static inline int entry_to_nid(struct zswap_entry *entry)
        return page_to_nid(virt_to_page(entry));
 }
 
-static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
+static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry,
+                         int nid)
 {
-       int nid = entry_to_nid(entry);
        struct mem_cgroup *memcg;
 
        /*
@@ -1758,7 +1758,7 @@ static bool zswap_store_pages(struct folio *folio,
                 *    an incoherent entry.
                 */
                if (likely(entry->length))
-                       zswap_lru_add(&zswap_list_lru, entry);
+                       zswap_lru_add(&zswap_list_lru, entry, nid);
        }
 
        return true;
-- 

I believe this will add the entry to the LRU node of the folio being
compressed. If so, we may be able to avoid adding an int field to
struct zswap_entry?

> 
> >
> > > >
> > > > Does it actually matter if we do the initializations here vs. right
> > > > before inserting to the LRU (current behavior)?
> > >
> > > Yes, it impacts batching performance with software quite a bit.
> >
> > Taking a step back, based on discussions in this thread and others, it
> > seems like the performance of zswap_store() is very sensitive to minor
> > changes like this one, calculating the store size, etc.
> >
> > I don't recall this being the case before this patch series. It seems
> > like an artifact of batching affecting performance negatively and
> > a collection of micro-optimizations and "correct" placement of code/data
> > to fix it. This seems to be very fragile.
> 
> I agree. This seems troubling.
> 
> What's the variance of your benchmark results, Kanchana? I have found
> that kernel build tests can have quite a bit of variance, but is it as
> big as the performance difference of these minor changes (i.e, can it
> explain the "regression")?

I tested the performance difference by not storing "u8 store_batch_size"
in struct zswap_pool and instead calculating it dynamically in zswap_store().
I tested it with usemem 30 processes, which is less prone to variance.

It appears that there is overall an improvement -- I must have been
confused or it could have been the case with an earlier mm-unstable, not
the one used for v12. 

When I tested this change today, throughput is better for 64K/PMD,
and latency improves for PMD. There's a slight latency "regression" of
2% in sys time and 1% in elapsed time for 64K folios:

usemem30, 64K folios:

     -----------------------------------------------------------------------
                            pool->store_batch_size     store_batch_size
			    computed once and stored   computed per call to
			    in zswap_pool              zswap_store()
     -----------------------------------------------------------------------
     zswap compressor                   zstd              zstd
                                                              
     -----------------------------------------------------------------------
     Total throughput (KB/s)       6,106,539         6,147,419
     Average throughput (KB/s)       203,551           204,913
     elapsed time (sec)               100.12            101.30
     sys time (sec)                 2,458.79          2,511.54   
     -----------------------------------------------------------------------



usemem30, PMD folios:

     -----------------------------------------------------------------------
                            pool->store_batch_size     store_batch_size
			    computed once and stored   computed per call to
			    in zswap_pool              zswap_store()
     -----------------------------------------------------------------------
     zswap compressor                   zstd              zstd   
                                                              
     -----------------------------------------------------------------------
     Total throughput (KB/s)       6,715,579         6,762,097
     Average throughput (KB/s)       223,852           225,403
     elapsed time (sec)                94.68             88.55
     sys time (sec)                 2,313.08          2,100.27   
     -----------------------------------------------------------------------

I can incorporate this change in v13 if it allays your concerns about
code fragility. Again, my apologies.

I believe the new batching implementation of zswap_compress() is
not relying on any other 'micro-optimizations and "correct" placement
of code/data'. It aims to directly represent the commonality of a
batching architecture for software/sequential and batching/parallel
compressors (hardware and software batching in future as our cover letter
states); borrowing the general code flow from the mainline per-page
implementation while incorporating Herbert's SG lists based
batching crypto interface suggestion. Are there any other specific
concerns you have that I might not be comprehending? If so, please
let me know and I am happy to factor this in as I work on v13.

The performance improvements data I posted in v12 for zstd and
deflate-iaa with this batched zswap_compress() version is also
something for us to consider, if we agree that eliminating the
zswap_pool's "store_batch_size" member will fix any concerns
about code fragility.

Thanks again for your helpful critique and suggestions!

Best regards,
Kanchana

> 
> >
> > It is very possible that an incoming change will move the
> > initializations or make other changes to the code, that will seemingly
> > cause regressions when they really shouldn't.
> >
> > Additionally, what may seem optimal on the CPU you are testing on maybe
> > be suboptimal for other CPUs. I think the big problem here is not where
> > to initialize the entry or whether to precompute the store batch size,
> > it's why the code has become extremely sensitive to these changes when
> > it shouldn't be. zswap_store() is not a fast path by any means, and
> > people will not expect such changes to cause meaningful regressions.
> 
> Agree. These micro-optimizations seem very fragile.
> 
> >
> > >
> > > >
> > > > > + for (i = 0; i < nr_pages; ++i) {
> > > > > +         entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> > > > > +         entries[i]->pool = pool;
> > > > > +         entries[i]->swpentry = page_swap_entry(folio_page(folio,
> > > > start + i));
> > > > > +         entries[i]->objcg = objcg;
> > > > > +         entries[i]->referenced = true;
> > > > > +         INIT_LIST_HEAD(&entries[i]->lru);
> > > > > + }
> > > > >
> > > > > - old = xa_store(swap_zswap_tree(page_swpentry),
> > > > > -                swp_offset(page_swpentry),
> > > > > -                entry, GFP_KERNEL);
> > > > > - if (xa_is_err(old)) {
> > > > > -         int err = xa_err(old);
> > > > > + for (i = 0; i < nr_pages; ++i) {
> > > > > +         struct page *page = folio_page(folio, start + i);
> > > > >
> > > > > -         WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> > > > %d\n", err);
> > > > > -         zswap_reject_alloc_fail++;
> > > > > -         goto store_failed;
> > > > > +         if (!zswap_compress(page, entries[i], pool, folio_wb))
> > > > > +                 goto store_pages_failed;
> > > > >   }
> > > > >
> > > > > - /*
> > > > > -  * We may have had an existing entry that became stale when
> > > > > -  * the folio was redirtied and now the new version is being
> > > > > -  * swapped out. Get rid of the old.
> > > > > -  */
> > > > > - if (old)
> > > > > -         zswap_entry_free(old);
> > > > > + for (i = 0; i < nr_pages; ++i) {
> > > > > +         struct zswap_entry *old, *entry = entries[i];
> > > > >
> > > > > - /*
> > > > > -  * The entry is successfully compressed and stored in the tree, there
> is
> > > > > -  * no further possibility of failure. Grab refs to the pool and objcg,
> > > > > -  * charge zswap memory, and increment zswap_stored_pages.
> > > > > -  * The opposite actions will be performed by zswap_entry_free()
> > > > > -  * when the entry is removed from the tree.
> > > > > -  */
> > > > > - zswap_pool_get(pool);
> > > > > - if (objcg) {
> > > > > -         obj_cgroup_get(objcg);
> > > > > -         obj_cgroup_charge_zswap(objcg, entry->length);
> > > > > - }
> > > > > - atomic_long_inc(&zswap_stored_pages);
> > > > > - if (entry->length == PAGE_SIZE)
> > > > > -         atomic_long_inc(&zswap_stored_incompressible_pages);
> > > > > +         old = xa_store(swap_zswap_tree(entry->swpentry),
> > > > > +                        swp_offset(entry->swpentry),
> > > > > +                        entry, GFP_KERNEL);
> > > > > +         if (unlikely(xa_is_err(old))) {
> > > > > +                 int err = xa_err(old);
> > > > >
> > > > > - /*
> > > > > -  * We finish initializing the entry while it's already in xarray.
> > > > > -  * This is safe because:
> > > > > -  *
> > > > > -  * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > -  *
> > > > > -  * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > -  *    The publishing order matters to prevent writeback from seeing
> > > > > -  *    an incoherent entry.
> > > > > -  */
> > > > > - entry->pool = pool;
> > > > > - entry->swpentry = page_swpentry;
> > > > > - entry->objcg = objcg;
> > > > > - entry->referenced = true;
> > > > > - if (entry->length) {
> > > > > -         INIT_LIST_HEAD(&entry->lru);
> > > > > -         zswap_lru_add(&zswap_list_lru, entry);
> > > > > +                 WARN_ONCE(err != -ENOMEM, "unexpected xarray
> > > > error: %d\n", err);
> > > > > +                 zswap_reject_alloc_fail++;
> > > > > +                 /*
> > > > > +                  * Entries up to this point have been stored in the
> > > > > +                  * xarray. zswap_store() will erase them from the
> > > > xarray
> > > > > +                  * and call zswap_entry_free(). Local cleanup in
> > > > > +                  * 'store_pages_failed' only needs to happen for
> > > > > +                  * entries from [@i to @nr_pages).
> > > > > +                  */
> > > > > +                 store_fail_idx = i;
> > > > > +                 goto store_pages_failed;
> > > > > +         }
> > > > > +
> > > > > +         /*
> > > > > +          * We may have had an existing entry that became stale when
> > > > > +          * the folio was redirtied and now the new version is being
> > > > > +          * swapped out. Get rid of the old.
> > > > > +          */
> > > > > +         if (unlikely(old))
> > > > > +                 zswap_entry_free(old);
> > > > > +
> > > > > +         /*
> > > > > +          * The entry is successfully compressed and stored in the tree,
> > > > there is
> > > > > +          * no further possibility of failure. Grab refs to the pool and
> > > > objcg,
> > > > > +          * charge zswap memory, and increment
> > > > zswap_stored_pages.
> > > > > +          * The opposite actions will be performed by
> > > > zswap_entry_free()
> > > > > +          * when the entry is removed from the tree.
> > > > > +          */
> > > >
> > > > But there *is* further possibility of failure if a subsequent entry
> > > > xa_store() fails, right?
> > >
> > > This comment is referring to this specific entry.
> >
> > I think it's currently misleading in its current form. Perhaps:
> >
> > The entry is successfully compressed and stored in the tree, and further
> > failures will be cleaned up in zswap_store().
> >
> > >
> > > >
> > > > Seems like if xa_store() fails we do not cleanup previously charged
> > > > objects, pool references, zswap_stored_pages, etc. Instead of rolling
> > > > all this back on failure, can we do all the xarray stores first and only
> > > > do the rest when we're at a point where no failure can happen? Would
> > > > that cause a performance regression?
> > >
> > > It would make the code more complicated and thus cause a performance
> > > regression. I have tried to balance code simplicity (which impacts
> performance)
> > > with correctness here. The "store_failed_idx" ensures that all partial work
> done
> > > in zswap_store_pages() for this batch, is cleaned up.
> > >
> > > The rest is handled in zswap_store() when it sees an error returned by
> > > zswap_store_pages().
> >
> > Right, I missed the part about zswap_store() handling this, the comment
> > above confused me.