linux-kernel - Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large folio in batches.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=NkaaCkoyULr4J+5-F+L5bRhM0F8JsV2DS0Mk=xYhncRw@mail.gmail.com>
Date: Wed, 15 Oct 2025 10:04:24 -0700
From: Nhat Pham <nphamcs@...il.com>
To: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Cc: Yosry Ahmed <yosry.ahmed@...ux.dev>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>, 
	"hannes@...xchg.org" <hannes@...xchg.org>, "chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>, 
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com" <ryan.roberts@....com>, 
	"21cnbao@...il.com" <21cnbao@...il.com>, 
	"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>, 
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, 
	"senozhatsky@...omium.org" <senozhatsky@...omium.org>, "sj@...nel.org" <sj@...nel.org>, 
	"kasong@...cent.com" <kasong@...cent.com>, 
	"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>, 
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>, "davem@...emloft.net" <davem@...emloft.net>, 
	"clabbe@...libre.com" <clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>, 
	"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com" <surenb@...gle.com>, 
	"Accardi, Kristen C" <kristen.c.accardi@...el.com>, "Gomes, Vinicius" <vinicius.gomes@...el.com>, 
	"Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>
Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a large
 folio in batches.

On Tue, Oct 14, 2025 at 8:42 PM Sridhar, Kanchana P
<kanchana.p.sridhar@...el.com> wrote:
>
>
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@...il.com>
> > Sent: Tuesday, October 14, 2025 9:35 AM
> > To: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>; linux-
> > kernel@...r.kernel.org; linux-mm@...ck.org; hannes@...xchg.org;
> > chengming.zhou@...ux.dev; usamaarif642@...il.com;
> > ryan.roberts@....com; 21cnbao@...il.com;
> > ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org;
> > senozhatsky@...omium.org; sj@...nel.org; kasong@...cent.com; linux-
> > crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > <kristen.c.accardi@...el.com>; Gomes, Vinicius <vinicius.gomes@...el.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@...el.com>; Gopal, Vinodh
> > <vinodh.gopal@...el.com>
> > Subject: Re: [PATCH v12 22/23] mm: zswap: zswap_store() will process a
> > large folio in batches.
> >
> > On Tue, Oct 14, 2025 at 8:29 AM Yosry Ahmed <yosry.ahmed@...ux.dev>
> > wrote:
> > >
> > > [..]
> > > > > > @@ -158,6 +161,8 @@ struct zswap_pool {
> > > > > >   struct work_struct release_work;
> > > > > >   struct hlist_node node;
> > > > > >   char tfm_name[CRYPTO_MAX_ALG_NAME];
> > > > > > + u8 compr_batch_size;
> > > > > > + u8 store_batch_size;
> > > > >
> > > > > I don't think we need to store store_batch_size, seems trivial to
> > > > > calculate at store time (perhaps in a helper).
> > > > >
> > > > > Taking a step back, is there any benefit to limiting store_batch_size to
> > > > > compr_batch_size? Is there a disadvantage to using
> > > > > ZSWAP_MAX_BATCH_SIZE
> > > > > even if it's higher than the HW compression batch size?
> > > >
> > > > Thanks Yosry, for the code review comments. I had a discussion with
> > > > Barry earlier on these very same topics as follow up to his review
> > comments
> > > > for v11, starting with [1]. Can you please go through the rationale for
> > > > these design choices, and let me know if you have any questions:
> > > >
> > > > [1]: https://patchwork.kernel.org/comment/26530319/
> > >
> > > I am surprised that calculating the value in zswap_store() causes a
> > > regression, but I am fine with keeping the precalculation in this case.
> > >
> > > I think there's a bigger problem here tho, more below.
> > >
> > > > > > + */
> > > > > > +static __always_inline int zswap_entries_cache_alloc_batch(void
> > > > > **entries,
> > > > > > +                                                    unsigned int
> > > > > nr_entries,
> > > > > > +                                                    gfp_t gfp)
> > > > > > +{
> > > > > > + return kmem_cache_alloc_bulk(zswap_entry_cache, gfp, nr_entries,
> > > > > entries);
> > > > >
> > > > > We currently use kmem_cache_alloc_node() in
> > zswap_entry_cache_alloc() to
> > > > > allocate the entry on the same node as the compressed page. We use
> > > > > entry_to_nid() to get the node for LRU operations.
> > > > >
> > > > > This breaks that assumption.
> > > >
> > > > You bring up a good point. I was looking at the code in slub.c and my
> > > > understanding thus far is that both, bulk allocations and
> > kmem_cache_alloc_node()
> > > > allocations are made from a per-CPU "cpu_slab" that is allocated by SLUB.
> > > >
> > > > IIUC, the concern you are raising is in the mainline, the entry is allocated
> > on
> > > > the same node as the compressed page, and gets added to the LRU list of
> > that
> > > > node. IOW, the node to which the compressed page belongs is the one to
> > whose
> > > > LRU the entry will be added.
> > > >
> > > > With this patch, with kmem_cache_alloc_bulk(), the entry will be created
> > on
> > > > the per-CPU slab of the CPU on which zswap_store() is called and will be
> > > > added to the LRU of that per-CPU slab's NUMA node. Hence, the end
> > result
> > > > could potentially be that the zswap_entry for a page could potentially be
> > > > on a different NUMA node/memcg than the page's NUMA node.
> >
> > I think only the NUMA node is the problem, not the memcg.
> >
> > > >
> > > > This is my thinking as to how this will impact the zswap shrinker:
> > > >
> > > > 1) memcg shrinker: if the memcg the entry ends up in is on the
> > zswap_list_lru,
> > > >     the entry will be written back.
> > > > 2) Global shrinker: will cycle through all memcg's that have pages in the
> > > >     zswap_list_lru, and the entry will be written back.
> > > >
> > > > Based on this, it is not clear to me if there is a problem, and would like to
> > > > request you, Nhat and others to provide insights as well.
> > > >
> > > > Interestingly, most of the code in slub.c has unlikely(!node_match(slab,
> > node)).
> > > > Does this imply some higher level mm slab allocation requirements?
> > > >
> > > > I am Ok with just calling zswap_entry_cache_alloc() for "nr_pages" if we
> > > > think this would be more correct.
> > >
> > > I saw your other response as well, but I think one thing is not clear
> > > here. The zswap entry will get written back "eventually", sure, but
> > > that's not the problem.
> > >
> > > If the zswap entry is on the wrong node lru, two things happen:
> > > (a) When the "right" node is under memory pressure, we cannot free this
> > >     entry by writing it back since it's not available in the lru.
> > > (b) When the "wrong" node is under memory pressure, it will potentially
> > >     writeback entries from other nodes AND report them as being freed
> > >     from this node.
> > >
> > > Both (a) and (b) cause less effective reclaim from the zswap shrinker.
> > > Additionally (b) causes the shrinker to report the wrong amount of freed
> > > memory from the node. While this may not be significant today, it's very
> > > possible that more heuristics start relying on this number in the
> > > future.
> > >
> > > I don't believe we should put zswap entries on the wrong LRU, but I will
> > > defer to Nhat for the final verdict if he has a different opinion.
> >
> > Oh shoot. Yeah I missed that part.
> >
> > In the past, we sort of did not care - zswap was very poorly designed
> > for NUMA architecture in general, and most of our test setups have
> > been single-node, so these kinds of discrepancies did not show up in
> > performance numbers.
> >
> > But we are getting more multi-node systems:
> >
> > 1. Bigger hosts (memory-wise) tend to also have more than one nodes.
> > It scales better that way (especially because a lot of structures and
> > locks protecting them are node-partitioned).
> >
> > 2. We have also seen different memory media that are often expressed
> > to the kernel as nodes: CXL, GPU memory, etc.
> >
> > This will necessitate tightening memory placement. We recently had to
> > fix one such issue:
> >
> > https://github.com/torvalds/linux/commit/56e5a103a721d0ef139bba7ff3d3a
> > da6c8217d5b
> >
> > So I'm a bit nervous about this change, which will make us use the wrong
> > LRU...
> >
> > Some work around:
> >
> > 1. Can we squeeze an extra int field anywhere in struct zswap_entry?
> >
> > 2. Can we pump nid all the way to zswap_lru_add()?
> >
> > This is still not 100% ideal - the metadata (struct zswap_entry) will
> > still be allocated on the wrong node. But at least the data are
> > properly managed, i.e on the right LRU.
>
> Thanks, Nhat and Yosry for the discussion. Thank you Nhat, for the
> zsmalloc change log reference and for the work arounds!
>
> Following your suggestion in (2), can we pass in the folio's nid from
> zswap_store_pages() to zswap_lru_add(), as follows:
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 263bc6d7f5c6..44665deece80 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -694,9 +694,9 @@ static inline int entry_to_nid(struct zswap_entry *entry)
>         return page_to_nid(virt_to_page(entry));
>  }
>
> -static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
> +static void zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry,
> +                         int nid)
>  {
> -       int nid = entry_to_nid(entry);
>         struct mem_cgroup *memcg;
>
>         /*
> @@ -1758,7 +1758,7 @@ static bool zswap_store_pages(struct folio *folio,
>                  *    an incoherent entry.
>                  */
>                 if (likely(entry->length))
> -                       zswap_lru_add(&zswap_list_lru, entry);
> +                       zswap_lru_add(&zswap_list_lru, entry, nid);
>         }
>
>         return true;
> --
>
> I believe this will add the entry to the LRU node of the folio being
> compressed. If so, we may be able to avoid adding an int field to
> struct zswap_entry?

Hmm that might not work for zswap_lru_del() :(

zswap_entry_free() might be called in context where we do not have
access to the node information (zswap_load()) for e.g.

Another alternative: can we instead determine the node from the
compressed object's storage? i.e store zswap_entry in the LRU
corresponding to the node that holds the compressed data?

You'll probably need the new zsmalloc API to get the node information.
And can zsmalloc migrate a backing page to a different node? This
seems complicated...

Taking a step back though, do we have to use the bulk allocation API
here? Calling the single-allocation version 512 times for a PMD-sized
page is no worse than the status quo, correct? We can leave this part
unoptimized for now - in the future, if the use case justifies it, we
can talk to slab allocator maintainers and ask for guidance on a
lock-optimized cross-node bulk allocation API.