[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <wn6pnt2nm5pguceiquctilul6lrk3ckycypwnw2zfxqrlgi2cf@5qn32ewswfl7>
Date: Wed, 15 Oct 2025 22:24:17 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Cc: Nhat Pham <nphamcs@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
"hannes@...xchg.org" <hannes@...xchg.org>, "chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com" <ryan.roberts@....com>,
"21cnbao@...il.com" <21cnbao@...il.com>, "ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "senozhatsky@...omium.org" <senozhatsky@...omium.org>,
"sj@...nel.org" <sj@...nel.org>, "kasong@...cent.com" <kasong@...cent.com>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>, "herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com" <clabbe@...libre.com>,
"ardb@...nel.org" <ardb@...nel.org>, "ebiggers@...gle.com" <ebiggers@...gle.com>,
"surenb@...gle.com" <surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
"Gomes, Vinicius" <vinicius.gomes@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
"Gopal, Vinodh" <vinodh.gopal@...el.com>
Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large
folio in batches.
On Wed, Oct 15, 2025 at 10:15:12PM +0000, Sridhar, Kanchana P wrote:
>
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@...il.com>
> > Sent: Wednesday, October 15, 2025 10:04 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > Cc: Yosry Ahmed <yosry.ahmed@...ux.dev>; linux-kernel@...r.kernel.org;
> > linux-mm@...ck.org; hannes@...xchg.org; chengming.zhou@...ux.dev;
> > usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> > ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> > senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> > crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > <kristen.c.accardi@...el.com>; Gomes, Vinicius <vinicius.gomes@...el.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> > <vinodh.gopal@...el.com>
> > Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> > large folio in batches.
> >
> > On Tue, Oct 14, 2025 at 8:42 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@...el.com> wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Nhat Pham <nphamcs@...il.com>
> > > > Sent: Tuesday, October 14, 2025 9:35 AM
> > > > To: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>; linux-
> > > > kernel@...r.kernel.org; linux-mm@...ck.org; hannes@...xchg.org;
> > > > chengming.zhou@...ux.dev; usamaarif642@...il.com;
> > > > ryan.roberts@....com; 21cnbao@...il.com;
> > > > ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> > > > senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> > > > crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> > > > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > > > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > > > <kristen.c.accardi@...el.com>; Gomes, Vinicius
> > <vinicius.gomes@...el.com>;
> > > > Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> > > > <vinodh.gopal@...el.com>
> > > > Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> > > > large folio in batches.
> > > >
> > > > On Tue, Oct 14, 2025 at 8:29 AM Yosry Ahmed <yosry.ahmed@...ux.dev>
> > > > wrote:
> > > > >
> > > > > [..]
> > > > > > > > @@ -158,6 +161,8 @@ struct zswap_pool {
> > > > > > > > struct work_struct release_work;
> > > > > > > > struct hlist_node node;
> > > > > > > > char tfm_name[CRYPTO_MAX_ALG_NAME];
> > > > > > > > + u8 compr_batch_size;
> > > > > > > > + u8 store_batch_size;
> > > > > > >
> > > > > > > I don't think we need to store store_batch_size, seems trivial to
> > > > > > > calculate at store time (perhaps in a helper).
> > > > > > >
> > > > > > > Taking a step back, is there any benefit to limiting store_batch_size
> > to
> > > > > > > compr_batch_size? Is there a disadvantage to using
> > > > > > > ZSWAP_MAX_BATCH_SIZE
> > > > > > > even if it's higher than the HW compression batch size?
> > > > > >
> > > > > > Thanks Yosry, for the code review comments. I had a discussion with
> > > > > > Barry earlier on these very same topics as follow up to his review
> > > > comments
> > > > > > for v11, starting with [1]. Can you please go through the rationale for
> > > > > > these design choices, and let me know if you have any questions:
> > > > > >
> > > > > > [1]: https://patchwork.kernel.org/comment/26530319/
> > > > >
> > > > > I am surprised that calculating the value in zswap_store() causes a
> > > > > regression, but I am fine with keeping the precalculation in this case.
> > > > >
> > > > > I think there's a bigger problem here tho, more below.
> > > > >
> > > > > > > > + */
> > > > > > > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > > > > > > **entries,
> > > > > > > > + unsigned int
> > > > > > > nr_entries,
> > > > > > > > + gfp_t gfp)
> > > > > > > > +{
> > > > > > > > + return kmem_cache_alloc_bulk(zswap_entry_cache, gfp,
> > nr_entries,
> > > > > > > entries);
> > > > > > >
> > > > > > > We currently use kmem_cache_alloc_node() in
> > > > zswap_entry_cache_alloc() to
> > > > > > > allocate the entry on the same node as the compressed page. We
> > use
> > > > > > > entry_to_nid() to get the node for LRU operations.
> > > > > > >
> > > > > > > This breaks that assumption.
> > > > > >
> > > > > > You bring up a good point. I was looking at the code in slub.c and my
> > > > > > understanding thus far is that both, bulk allocations and
> > > > kmem_cache_alloc_node()
> > > > > > allocations are made from a per-CPU "cpu_slab" that is allocated by
> > SLUB.
> > > > > >
> > > > > > IIUC, the concern you are raising is in the mainline, the entry is
> > allocated
> > > > on
> > > > > > the same node as the compressed page, and gets added to the LRU list
> > of
> > > > that
> > > > > > node. IOW, the node to which the compressed page belongs is the one
> > to
> > > > whose
> > > > > > LRU the entry will be added.
> > > > > >
> > > > > > With this patch, with kmem_cache_alloc_bulk(), the entry will be
> > created
> > > > on
> > > > > > the per-CPU slab of the CPU on which zswap_store() is called and will
> > be
> > > > > > added to the LRU of that per-CPU slab's NUMA node. Hence, the end
> > > > result
> > > > > > could potentially be that the zswap_entry for a page could potentially
> > be
> > > > > > on a different NUMA node/memcg than the page's NUMA node.
> > > >
> > > > I think only the NUMA node is the problem, not the memcg.
> > > >
> > > > > >
> > > > > > This is my thinking as to how this will impact the zswap shrinker:
> > > > > >
> > > > > > 1) memcg shrinker: if the memcg the entry ends up in is on the
> > > > zswap_list_lru,
> > > > > > the entry will be written back.
> > > > > > 2) Global shrinker: will cycle through all memcg's that have pages in
> > the
> > > > > > zswap_list_lru, and the entry will be written back.
> > > > > >
> > > > > > Based on this, it is not clear to me if there is a problem, and would like
> > to
> > > > > > request you, Nhat and others to provide insights as well.
> > > > > >
> > > > > > Interestingly, most of the code in slub.c has unlikely(!node_match(slab,
> > > > node)).
> > > > > > Does this imply some higher level mm slab allocation requirements?
> > > > > >
> > > > > > I am Ok with just calling zswap_entry_cache_alloc() for "nr_pages" if
> > we
> > > > > > think this would be more correct.
> > > > >
> > > > > I saw your other response as well, but I think one thing is not clear
> > > > > here. The zswap entry will get written back "eventually", sure, but
> > > > > that's not the problem.
> > > > >
> > > > > If the zswap entry is on the wrong node lru, two things happen:
> > > > > (a) When the "right" node is under memory pressure, we cannot free
> > this
> > > > > entry by writing it back since it's not available in the lru.
> > > > > (b) When the "wrong" node is under memory pressure, it will potentially
> > > > > writeback entries from other nodes AND report them as being freed
> > > > > from this node.
> > > > >
> > > > > Both (a) and (b) cause less effective reclaim from the zswap shrinker.
> > > > > Additionally (b) causes the shrinker to report the wrong amount of freed
> > > > > memory from the node. While this may not be significant today, it's very
> > > > > possible that more heuristics start relying on this number in the
> > > > > future.
> > > > >
> > > > > I don't believe we should put zswap entries on the wrong LRU, but I will
> > > > > defer to Nhat for the final verdict if he has a different opinion.
> > > >
> > > > Oh shoot. Yeah I missed that part.
> > > >
> > > > In the past, we sort of did not care - zswap was very poorly designed
> > > > for NUMA architecture in general, and most of our test setups have
> > > > been single-node, so these kinds of discrepancies did not show up in
> > > > performance numbers.
> > > >
> > > > But we are getting more multi-node systems:
> > > >
> > > > 1. Bigger hosts (memory-wise) tend to also have more than one nodes.
> > > > It scales better that way (especially because a lot of structures and
> > > > locks protecting them are node-partitioned).
> > > >
> > > > 2. We have also seen different memory media that are often expressed
> > > > to the kernel as nodes: CXL, GPU memory, etc.
> > > >
> > > > This will necessitate tightening memory placement. We recently had to
> > > > fix one such issue:
> > > >
> > > >
> > https://github.com/torvalds/linux/commit/56e5a103a721d0ef139bba7ff3d3a
> > > > da6c8217d5b
> > > >
> > > > So I'm a bit nervous about this change, which will make us use the wrong
> > > > LRU...
> > > >
> > > > Some work around:
> > > >
> > > > 1. Can we squeeze an extra int field anywhere in struct zswap_entry?
> > > >
> > > > 2. Can we pump nid all the way to zswap_lru_add()?
> > > >
> > > > This is still not 100% ideal - the metadata (struct zswap_entry) will
> > > > still be allocated on the wrong node. But at least the data are
> > > > properly managed, i.e on the right LRU.
> > >
> > > Thanks, Nhat and Yosry for the discussion. Thank you Nhat, for the
> > > zsmalloc change log reference and for the work arounds!
> > >
> > > Following your suggestion in (2), can we pass in the folio's nid from
> > > zswap_store_pages() to zswap_lru_add(), as follows:
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 263bc6d7f5c6..44665deece80 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -694,9 +694,9 @@ static inline int entry_to_nid(struct zswap_entry
> > *entry)
> > > return page_to_nid(virt_to_page(entry));
> > > }
> > >
> > > -static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry
> > *entry)
> > > +static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry
> > *entry,
> > > + int nid)
> > > {
> > > - int nid = entry_to_nid(entry);
> > > struct mem_cgroup *memcg;
> > >
> > > /*
> > > @@ -1758,7 +1758,7 @@ static bool zswap_store_pages(struct folio *folio,
> > > * an incoherent entry.
> > > */
> > > if (likely(entry->length))
> > > - zswap_lru_add(&zswap_list_lru, entry);
> > > + zswap_lru_add(&zswap_list_lru, entry, nid);
> > > }
> > >
> > > return true;
> > > --
> > >
> > > I believe this will add the entry to the LRU node of the folio being
> > > compressed. If so, we may be able to avoid adding an int field to
> > > struct zswap_entry?
> >
> > Hmm that might not work for zswap_lru_del() :(
> >
> > zswap_entry_free() might be called in context where we do not have
> > access to the node information (zswap_load()) for e.g.
>
> I was thinking that zswap_lru_del() would follow a similar approach,
> i.e., zswap_load() would pass the folio's nid to zswap_lru_del(), but
> you're right, this would not work if the process faulting in the page
> is running on a different node than the one that stored the page.
>
> >
> > Another alternative: can we instead determine the node from the
> > compressed object's storage? i.e store zswap_entry in the LRU
> > corresponding to the node that holds the compressed data?
> >
> > You'll probably need the new zsmalloc API to get the node information.
> > And can zsmalloc migrate a backing page to a different node? This
> > seems complicated...
>
> That's a great idea! It might be worth exploring if our goal is to maintain
> parity with the current status for nodes/LRU/shrinker.
>
> Good point about zsmalloc migrating a backing page to a different node:
> although wouldn't this be a problem with the current status quo also?
>
> To summarize my understanding: the current approach ensures that the
> NUMA node in which the page was allocated, is the one that will hold the
> compressed data (the zsmalloc commit log you shared), and is the node
> which, under memory pressure, will cause the entry to be written back.
>
> The entry being allocated on the same NUMA node as the page being
> stored in zswap, imho is a "mechanism" to achieve the above. When the
> page is faulted in, it is possible that the process has migrated to a different
> node, and the folio is now assigned a different nid. IOW, there is no more
> significance to the entry's nid than to facilitate the current approach, IIUC.
>
> I think your suggestion (1) wherein we store the NUMA node as an int field
> in the entry can accomplish the same thing. The entry doesn't have to be
> allocated on the same node as the page being stored in zswap: we could
> let the slab allocator decide this (potentially more optimal system-wide?).
>
> The entry int field could also be more fail-safe than looking up the zsmalloc
> node info (which could have migrated the compressed zspage [still need to
> verify]).
>
> I think the entry int field might also be cleaner with changes encapsulated
> to zswap_lru_add()/del(). If we rely on zsmalloc node derivation, it might
> require changes in zswap_decompress(). The downside is we add an int
> member to the zswap_entry.
>
> If its Ok with you, can I evaluate the feasibility of (1) and update shortly
> after gathering data with usemem30 and kernel_compilation?
> I am trying to avoid the latency penalty of not using the bulk allocation
> API, and at the same time, ensure we don't change the NUMA node/LRU
> lists/shrinker functionality. Based on the data I had gathered recently
> in [1], reverting to use kmem_cache_alloc_node() for the batch in
> zswap_store_pages() impacts latency considerably.
>
> [1] https://patchwork.kernel.org/comment/26590517/
>
> >
> > Taking a step back though, do we have to use the bulk allocation API
> > here? Calling the single-allocation version 512 times for a PMD-sized
> > page is no worse than the status quo, correct? We can leave this part
> > unoptimized for now - in the future, if the use case justifies it, we
> > can talk to slab allocator maintainers and ask for guidance on a
> > lock-optimized cross-node bulk allocation API.
>
> Definitely. This is a sound fallback strategy if (1) doesn't work out, and
> even if it does, we feel that adding an int field to the metadata is not
> acceptable/needed. I will make sure to share before/after data with
> usemem30 and kernel_compilation with the different options (the
> int field in zswap_entry, bulk vs. single allocation).
I am against increasing the size of struct zswap_entry. On x86_64, there
is a 3 byte hole after 'referenced'. We can technically use that,
although the node id is usually an int, which is 4 bytes on x86_64.
In practice, I think 2 bytes (i.e. short) should be enough, but it will
be ugly to cast the node id to a short. We should at least WARN on
overflow.
Or we can take the simple route and drop the bulk allocation.
Powered by blists - more mailing lists