[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=NyUR6UE0PhXCaHOdta4=gVvRj7QgZtpPaLADpfXYyvZw@mail.gmail.com>
Date: Tue, 14 Oct 2025 09:35:24 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
"hannes@...xchg.org" <hannes@...xchg.org>, "chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com" <ryan.roberts@....com>,
"21cnbao@...il.com" <21cnbao@...il.com>,
"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"senozhatsky@...omium.org" <senozhatsky@...omium.org>, "sj@...nel.org" <sj@...nel.org>,
"kasong@...cent.com" <kasong@...cent.com>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>, "davem@...emloft.net" <davem@...emloft.net>,
"clabbe@...libre.com" <clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com" <surenb@...gle.com>,
"Accardi, Kristen C" <kristen.c.accardi@...el.com>, "Gomes, Vinicius" <vinicius.gomes@...el.com>,
"Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>
Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large
folio in batches.
On Tue, Oct 14, 2025 at 8:29 AM Yosry Ahmed <yosry.ahmed@...ux.dev> wrote:
>
> [..]
> > > > @@ -158,6 +161,8 @@ struct zswap_pool {
> > > > struct work_struct release_work;
> > > > struct hlist_node node;
> > > > char tfm_name[CRYPTO_MAX_ALG_NAME];
> > > > + u8 compr_batch_size;
> > > > + u8 store_batch_size;
> > >
> > > I don't think we need to store store_batch_size, seems trivial to
> > > calculate at store time (perhaps in a helper).
> > >
> > > Taking a step back, is there any benefit to limiting store_batch_size to
> > > compr_batch_size? Is there a disadvantage to using
> > > ZSWAP_MAX_BATCH_SIZE
> > > even if it's higher than the HW compression batch size?
> >
> > Thanks Yosry, for the code review comments. I had a discussion with
> > Barry earlier on these very same topics as follow up to his review comments
> > for v11, starting with [1]. Can you please go through the rationale for
> > these design choices, and let me know if you have any questions:
> >
> > [1]: https://patchwork.kernel.org/comment/26530319/
>
> I am surprised that calculating the value in zswap_store() causes a
> regression, but I am fine with keeping the precalculation in this case.
>
> I think there's a bigger problem here tho, more below.
>
> > > > + */
> > > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > > **entries,
> > > > + unsigned int
> > > nr_entries,
> > > > + gfp_t gfp)
> > > > +{
> > > > + return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries,
> > > entries);
> > >
> > > We currently use kmem_cache_alloc_node() in zswap_entry_cache_alloc() to
> > > allocate the entry on the same node as the compressed page. We use
> > > entry_to_nid() to get the node for LRU operations.
> > >
> > > This breaks that assumption.
> >
> > You bring up a good point. I was looking at the code in slub.c and my
> > understanding thus far is that both, bulk allocations and kmem_cache_alloc_node()
> > allocations are made from a per-CPU "cpu_slab" that is allocated by SLUB.
> >
> > IIUC, the concern you are raising is in the mainline, the entry is allocated on
> > the same node as the compressed page, and gets added to the LRU list of that
> > node. IOW, the node to which the compressed page belongs is the one to whose
> > LRU the entry will be added.
> >
> > With this patch, with kmem_cache_alloc_bulk(), the entry will be created on
> > the per-CPU slab of the CPU on which zswap_store() is called and will be
> > added to the LRU of that per-CPU slab's NUMA node. Hence, the end result
> > could potentially be that the zswap_entry for a page could potentially be
> > on a different NUMA node/memcg than the page's NUMA node.
I think only the NUMA node is the problem, not the memcg.
> >
> > This is my thinking as to how this will impact the zswap shrinker:
> >
> > 1) memcg shrinker: if the memcg the entry ends up in is on the zswap_list_lru,
> > the entry will be written back.
> > 2) Global shrinker: will cycle through all memcg's that have pages in the
> > zswap_list_lru, and the entry will be written back.
> >
> > Based on this, it is not clear to me if there is a problem, and would like to
> > request you, Nhat and others to provide insights as well.
> >
> > Interestingly, most of the code in slub.c has unlikely(!node_match(slab, node)).
> > Does this imply some higher level mm slab allocation requirements?
> >
> > I am Ok with just calling zswap_entry_cache_alloc() for "nr_pages" if we
> > think this would be more correct.
>
> I saw your other response as well, but I think one thing is not clear
> here. The zswap entry will get written back "eventually", sure, but
> that's not the problem.
>
> If the zswap entry is on the wrong node lru, two things happen:
> (a) When the "right" node is under memory pressure, we cannot free this
> entry by writing it back since it's not available in the lru.
> (b) When the "wrong" node is under memory pressure, it will potentially
> writeback entries from other nodes AND report them as being freed
> from this node.
>
> Both (a) and (b) cause less effective reclaim from the zswap shrinker.
> Additionally (b) causes the shrinker to report the wrong amount of freed
> memory from the node. While this may not be significant today, it's very
> possible that more heuristics start relying on this number in the
> future.
>
> I don't believe we should put zswap entries on the wrong LRU, but I will
> defer to Nhat for the final verdict if he has a different opinion.
Oh shoot. Yeah I missed that part.
In the past, we sort of did not care - zswap was very poorly designed
for NUMA architecture in general, and most of our test setups have
been single-node, so these kinds of discrepancies did not show up in
performance numbers.
But we are getting more multi-node systems:
1. Bigger hosts (memory-wise) tend to also have more than one nodes.
It scales better that way (especially because a lot of structures and
locks protecting them are node-partitioned).
2. We have also seen different memory media that are often expressed
to the kernel as nodes: CXL, GPU memory, etc.
This will necessitate tightening memory placement. We recently had to
fix one such issue:
https://github.com/torvalds/linux/commit/56e5a103a721d0ef139bba7ff3d3ada6c8217d5b
So I'm a bit nervous about this change, which will make us use the wrong LRU...
Some work around:
1. Can we squeeze an extra int field anywhere in struct zswap_entry?
2. Can we pump nid all the way to zswap_lru_add()?
This is still not 100% ideal - the metadata (struct zswap_entry) will
still be allocated on the wrong node. But at least the data are
properly managed, i.e on the right LRU.
>
> > >
> > > Does it actually matter if we do the initializations here vs. right
> > > before inserting to the LRU (current behavior)?
> >
> > Yes, it impacts batching performance with software quite a bit.
>
> Taking a step back, based on discussions in this thread and others, it
> seems like the performance of zswap_store() is very sensitive to minor
> changes like this one, calculating the store size, etc.
>
> I don't recall this being the case before this patch series. It seems
> like an artifact of batching affecting performance negatively and
> a collection of micro-optimizations and "correct" placement of code/data
> to fix it. This seems to be very fragile.
I agree. This seems troubling.
What's the variance of your benchmark results, Kanchana? I have found
that kernel build tests can have quite a bit of variance, but is it as
big as the performance difference of these minor changes (i.e, can it
explain the "regression")?
>
> It is very possible that an incoming change will move the
> initializations or make other changes to the code, that will seemingly
> cause regressions when they really shouldn't.
>
> Additionally, what may seem optimal on the CPU you are testing on maybe
> be suboptimal for other CPUs. I think the big problem here is not where
> to initialize the entry or whether to precompute the store batch size,
> it's why the code has become extremely sensitive to these changes when
> it shouldn't be. zswap_store() is not a fast path by any means, and
> people will not expect such changes to cause meaningful regressions.
Agree. These micro-optimizations seem very fragile.
>
> >
> > >
> > > > + for (i = 0; i < nr_pages; ++i) {
> > > > + entries[i]->handle = (unsigned long)ERR_PTR(-EINVAL);
> > > > + entries[i]->pool = pool;
> > > > + entries[i]->swpentry = page_swap_entry(folio_page(folio,
> > > start + i));
> > > > + entries[i]->objcg = objcg;
> > > > + entries[i]->referenced = true;
> > > > + INIT_LIST_HEAD(&entries[i]->lru);
> > > > + }
> > > >
> > > > - old = xa_store(swap_zswap_tree(page_swpentry),
> > > > - swp_offset(page_swpentry),
> > > > - entry, GFP_KERNEL);
> > > > - if (xa_is_err(old)) {
> > > > - int err = xa_err(old);
> > > > + for (i = 0; i < nr_pages; ++i) {
> > > > + struct page *page = folio_page(folio, start + i);
> > > >
> > > > - WARN_ONCE(err != -ENOMEM, "unexpected xarray error:
> > > %d\n", err);
> > > > - zswap_reject_alloc_fail++;
> > > > - goto store_failed;
> > > > + if (!zswap_compress(page, entries[i], pool, folio_wb))
> > > > + goto store_pages_failed;
> > > > }
> > > >
> > > > - /*
> > > > - * We may have had an existing entry that became stale when
> > > > - * the folio was redirtied and now the new version is being
> > > > - * swapped out. Get rid of the old.
> > > > - */
> > > > - if (old)
> > > > - zswap_entry_free(old);
> > > > + for (i = 0; i < nr_pages; ++i) {
> > > > + struct zswap_entry *old, *entry = entries[i];
> > > >
> > > > - /*
> > > > - * The entry is successfully compressed and stored in the tree, there is
> > > > - * no further possibility of failure. Grab refs to the pool and objcg,
> > > > - * charge zswap memory, and increment zswap_stored_pages.
> > > > - * The opposite actions will be performed by zswap_entry_free()
> > > > - * when the entry is removed from the tree.
> > > > - */
> > > > - zswap_pool_get(pool);
> > > > - if (objcg) {
> > > > - obj_cgroup_get(objcg);
> > > > - obj_cgroup_charge_zswap(objcg, entry->length);
> > > > - }
> > > > - atomic_long_inc(&zswap_stored_pages);
> > > > - if (entry->length == PAGE_SIZE)
> > > > - atomic_long_inc(&zswap_stored_incompressible_pages);
> > > > + old = xa_store(swap_zswap_tree(entry->swpentry),
> > > > + swp_offset(entry->swpentry),
> > > > + entry, GFP_KERNEL);
> > > > + if (unlikely(xa_is_err(old))) {
> > > > + int err = xa_err(old);
> > > >
> > > > - /*
> > > > - * We finish initializing the entry while it's already in xarray.
> > > > - * This is safe because:
> > > > - *
> > > > - * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > - *
> > > > - * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > - * The publishing order matters to prevent writeback from seeing
> > > > - * an incoherent entry.
> > > > - */
> > > > - entry->pool = pool;
> > > > - entry->swpentry = page_swpentry;
> > > > - entry->objcg = objcg;
> > > > - entry->referenced = true;
> > > > - if (entry->length) {
> > > > - INIT_LIST_HEAD(&entry->lru);
> > > > - zswap_lru_add(&zswap_list_lru, entry);
> > > > + WARN_ONCE(err != -ENOMEM, "unexpected xarray
> > > error: %d\n", err);
> > > > + zswap_reject_alloc_fail++;
> > > > + /*
> > > > + * Entries up to this point have been stored in the
> > > > + * xarray. zswap_store() will erase them from the
> > > xarray
> > > > + * and call zswap_entry_free(). Local cleanup in
> > > > + * 'store_pages_failed' only needs to happen for
> > > > + * entries from [@i to @nr_pages).
> > > > + */
> > > > + store_fail_idx = i;
> > > > + goto store_pages_failed;
> > > > + }
> > > > +
> > > > + /*
> > > > + * We may have had an existing entry that became stale when
> > > > + * the folio was redirtied and now the new version is being
> > > > + * swapped out. Get rid of the old.
> > > > + */
> > > > + if (unlikely(old))
> > > > + zswap_entry_free(old);
> > > > +
> > > > + /*
> > > > + * The entry is successfully compressed and stored in the tree,
> > > there is
> > > > + * no further possibility of failure. Grab refs to the pool and
> > > objcg,
> > > > + * charge zswap memory, and increment
> > > zswap_stored_pages.
> > > > + * The opposite actions will be performed by
> > > zswap_entry_free()
> > > > + * when the entry is removed from the tree.
> > > > + */
> > >
> > > But there *is* further possibility of failure if a subsequent entry
> > > xa_store() fails, right?
> >
> > This comment is referring to this specific entry.
>
> I think it's currently misleading in its current form. Perhaps:
>
> The entry is successfully compressed and stored in the tree, and further
> failures will be cleaned up in zswap_store().
>
> >
> > >
> > > Seems like if xa_store() fails we do not cleanup previously charged
> > > objects, pool references, zswap_stored_pages, etc. Instead of rolling
> > > all this back on failure, can we do all the xarray stores first and only
> > > do the rest when we're at a point where no failure can happen? Would
> > > that cause a performance regression?
> >
> > It would make the code more complicated and thus cause a performance
> > regression. I have tried to balance code simplicity (which impacts performance)
> > with correctness here. The "store_failed_idx" ensures that all partial work done
> > in zswap_store_pages() for this batch, is cleaned up.
> >
> > The rest is handled in zswap_store() when it sees an error returned by
> > zswap_store_pages().
>
> Right, I missed the part about zswap_store() handling this, the comment
> above confused me.
Powered by blists - more mailing lists