linux-kernel - Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource allocation/deletion and mutex lock usage.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z9nECMZW67F8XYoV@google.com>
Date: Tue, 18 Mar 2025 19:05:44 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"hannes@...xchg.org" <hannes@...xchg.org>,
	"nphamcs@...il.com" <nphamcs@...il.com>,
	"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
	"usamaarif642@...il.com" <usamaarif642@...il.com>,
	"ryan.roberts@....com" <ryan.roberts@....com>,
	"21cnbao@...il.com" <21cnbao@...il.com>,
	"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
	"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"clabbe@...libre.com" <clabbe@...libre.com>,
	"ardb@...nel.org" <ardb@...nel.org>,
	"ebiggers@...gle.com" <ebiggers@...gle.com>,
	"surenb@...gle.com" <surenb@...gle.com>,
	"Accardi, Kristen C" <kristen.c.accardi@...el.com>,
	"Feghali, Wajdi K" <wajdi.k.feghali@...el.com>,
	"Gopal, Vinodh" <vinodh.gopal@...el.com>
Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
 allocation/deletion and mutex lock usage.

On Tue, Mar 18, 2025 at 05:38:49PM +0000, Sridhar, Kanchana P wrote:
[..]
> > > Thanks Yosry, for this observation! You are right, when considered purely
> > > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and
> > > zswap_cpu_comp_dead() in fact run on a control CPU, because the state is
> > > registered in the PREPARE section of "enum cpuhp_state" in cpuhotplug.h.
> > >
> > > The problem however is that, in the current architecture, CPU onlining/
> > > zswap_pool creation, and CPU offlining/zswap_pool deletion have the
> > > same semantics as far as these resources are concerned. Hence, although
> > > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
> > > the "hotplug" code is called is in fact online. It is possible for the memory
> > > allocation calls in zswap_cpu_comp_prepare() to recurse into
> > > zswap_compress(), which now needs to be handled by the current pool,
> > > since the new pool has not yet been added to the zswap_pools, as you
> > > pointed out.
> > >
> > > The ref on the current pool has not yet been dropped. Could there be
> > > a potential for a deadlock at pool transition time: the new pool is blocked
> > > from allocating acomp_ctx resources, triggering reclaim, which the old
> > > pool needs to handle?
> > 
> > I am not sure how this could lead to a deadlock. The compression will be
> > happening in a different pool with a different acomp_ctx.
> 
> I was thinking about this from the perspective of comparing the trade-offs
> between these two approaches:
> a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs.
> b) Allocating acomp_ctx resources once upfront.
> 
> With (a), when the user switches zswap to use a new compressor, it is possible
> that the system is already in a low memory situation and the CPU could be doing
> a lot of swapouts. It occurred to me that in theory, the call to switch the
> compressor through the sysfs interface could never return if the acomp_ctx
> allocations trigger direct reclaim in this scenario. This was in the context of
> exploring if a better design is possible, while acknowledging that this could still
> happen today.

If the system is already in a low memory situation a lot of things will
hang. Switching the compressor is not a common operation at all and we
shouldn't really worry about that. Even if we remove the acomp_ctx
allocation, we still need to make some allocations in that path anyway.

> 
> With (b), this situation is avoided by design, and we can switch to a new pool
> without triggering additional reclaim. Sorry, I should have articulated this better.

But we have to either allocate more memory unnecessarily or add config
options and make batching a build-time decision. This is unwarranted
imo.

FWIW, the mutexes and buffers used to be per-CPU not per-acomp_ctx, but
they were changed in commit 8ba2f844f050 ("mm/zswap: change per-cpu
mutex and buffer to per-acomp_ctx"). What you're suggesting is not quite
the same as what we had before that commit, it's moving the acomp_ctx
itself to be per-CPU but not per-pool, including the mtuex and buffer.
But I thought the context may be useful.

[..]
> > > 7) To address the issue of how many reqs/buffers to allocate, there could
> > >      potentially be a memory cost for non-batching compressors, if we want
> > >      to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and buffers.
> > >      This would allow the acomp_ctx to seamlessly handle batching
> > >      compressors, non-batching compressors, and transitions among the
> > >      two compressor types in a pretty general manner, that relies only on
> > >      the ZSWAP_MAX_BATCH_SIZE, which we define anyway.
> > >
> > >      I believe we can maximize the chances of success for the allocation of
> > >      the acomp_ctx resources if this is done in zswap_setup(), but please
> > >      correct me if I am wrong.
> > >
> > >      The added memory cost for platforms without IAA would be
> > >      ~57 KB per cpu, on x86. Would this be acceptable?
> > 
> > I think that's a lot of memory to waste per-CPU, and I don't see a good
> > reason for it.
> 
> Yes, it appears so. Towards trying to see if a better design is possible
> by de-coupling the acomp_ctx from zswap_pool:
> Would this cost be acceptable if it is incurred based on a build config
> option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default OFF)?
> If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE
> acomp_ctx resources once, during zswap_setup(). If not, we allocate
> only one req/buffer in the global percpu acomp_ctx? 

We should avoid making batching a build time decision if we can help it.
A lot of kernels are shipped to different hardware that may or may not
support batching, so users will have to either decide to turn off
batching completely or eat the overhead even for hardware that does not
support batching (or for users that use SW compression).

[..] 
> > >
> > >     The only other fallback solution in lieu of the proposed simplification that
> > >     I can think of is to keep the lifespan of these resources from pool creation
> > >     to deletion, using the CPU hotplug code. Although, it is not totally clear
> > >     to me if there is potential for deadlock during pool transitions, as noted
> > above.
> > 
> > I am not sure what's the deadlock scenario you're worried about, please
> > clarify.
> 
> My apologies: I was referring to triggering direct reclaim during pool creation,
> which could, in theory, run into a scenario where the pool switching would
> have to wait for reclaim to free up enough memory for the acomp_ctx
> resources allocation: this could cause the system to hang, but not a deadlock.
> This can happen even today, hence trying to see if a better design is possible.

I think the concern here is unfounded. We shouldn't care about the
performance of zswap compressor switching, especially under memory
pressure. A lot of things will slow down under memory pressure, and
compressor switching should be the least of our concerns.