[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SA3PR11MB81206531E9B3C7F13F5740A2C9DE2@SA3PR11MB8120.namprd11.prod.outlook.com>
Date: Tue, 18 Mar 2025 17:38:49 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "hannes@...xchg.org"
<hannes@...xchg.org>, "nphamcs@...il.com" <nphamcs@...il.com>,
"chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
"usamaarif642@...il.com" <usamaarif642@...il.com>, "ryan.roberts@....com"
<ryan.roberts@....com>, "21cnbao@...il.com" <21cnbao@...il.com>,
"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"linux-crypto@...r.kernel.org" <linux-crypto@...r.kernel.org>,
"herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
"davem@...emloft.net" <davem@...emloft.net>, "clabbe@...libre.com"
<clabbe@...libre.com>, "ardb@...nel.org" <ardb@...nel.org>,
"ebiggers@...gle.com" <ebiggers@...gle.com>, "surenb@...gle.com"
<surenb@...gle.com>, "Accardi, Kristen C" <kristen.c.accardi@...el.com>,
"Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, "Gopal, Vinodh"
<vinodh.gopal@...el.com>, "Sridhar, Kanchana P"
<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
allocation/deletion and mutex lock usage.
> -----Original Message-----
> From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Sent: Tuesday, March 18, 2025 7:24 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> hannes@...xchg.org; nphamcs@...il.com; chengming.zhou@...ux.dev;
> usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org; linux-
> crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> <kristen.c.accardi@...el.com>; Feghali, Wajdi K <wajdi.k.feghali@...el.com>;
> Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> allocation/deletion and mutex lock usage.
>
> On Mon, Mar 17, 2025 at 09:15:09PM +0000, Sridhar, Kanchana P wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosry.ahmed@...ux.dev>
> > > Sent: Monday, March 10, 2025 10:31 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>
> > > Cc: linux-kernel@...r.kernel.org; linux-mm@...ck.org;
> > > hannes@...xchg.org; nphamcs@...il.com;
> chengming.zhou@...ux.dev;
> > > usamaarif642@...il.com; ryan.roberts@....com; 21cnbao@...il.com;
> > > ying.huang@...ux.alibaba.com; akpm@...ux-foundation.org; linux-
> > > crypto@...r.kernel.org; herbert@...dor.apana.org.au;
> > > davem@...emloft.net; clabbe@...libre.com; ardb@...nel.org;
> > > ebiggers@...gle.com; surenb@...gle.com; Accardi, Kristen C
> > > <kristen.c.accardi@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>;
> > > Gopal, Vinodh <vinodh.gopal@...el.com>
> > > Subject: Re: [PATCH v8 12/14] mm: zswap: Simplify acomp_ctx resource
> > > allocation/deletion and mutex lock usage.
> > >
> > > On Sat, Mar 08, 2025 at 02:47:15AM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > [..]
> > > > > > > > u8 *buffer;
> > > > > > > > + u8 nr_reqs;
> > > > > > > > + struct crypto_wait wait;
> > > > > > > > struct mutex mutex;
> > > > > > > > bool is_sleepable;
> > > > > > > > + bool __online;
> > > > > > >
> > > > > > > I don't believe we need this.
> > > > > > >
> > > > > > > If we are not freeing resources during CPU offlining, then we do
> not
> > > > > > > need a CPU offline callback and acomp_ctx->__online serves no
> > > purpose.
> > > > > > >
> > > > > > > The whole point of synchronizing between offlining and
> > > > > > > compress/decompress operations is to avoid UAF. If offlining does
> not
> > > > > > > free resources, then we can hold the mutex directly in the
> > > > > > > compress/decompress path and drop the hotunplug callback
> > > completely.
> > > > > > >
> > > > > > > I also believe nr_reqs can be dropped from this patch, as it seems
> like
> > > > > > > it's only used know when to set __online.
> > > > > >
> > > > > > All great points! In fact, that was the original solution I had
> implemented
> > > > > > (not having an offline callback). But then, I spent some time
> > > understanding
> > > > > > the v6.13 hotfix for synchronizing freeing of resources, and this
> comment
> > > > > > in zswap_cpu_comp_prepare():
> > > > > >
> > > > > > /*
> > > > > > * Only hold the mutex after completing allocations,
> otherwise we
> > > > > may
> > > > > > * recurse into zswap through reclaim and attempt to hold the
> mutex
> > > > > > * again resulting in a deadlock.
> > > > > > */
> > > > > >
> > > > > > Hence, I figured the constraint of "recurse into zswap through
> reclaim"
> > > was
> > > > > > something to comprehend in the simplification (even though I had a
> > > tough
> > > > > > time imagining how this could happen).
> > > > >
> > > > > The constraint here is about zswap_cpu_comp_prepare() holding the
> > > mutex,
> > > > > making an allocation which internally triggers reclaim, then recursing
> > > > > into zswap and trying to hold the same mutex again causing a
> deadlock.
> > > > >
> > > > > If zswap_cpu_comp_prepare() does not need to hold the mutex to
> begin
> > > > > with, the constraint naturally goes away.
> > > >
> > > > Actually, if it is possible for the allocations in
> zswap_cpu_comp_prepare()
> > > > to trigger reclaim, then I believe we need some way for reclaim to know
> if
> > > > the acomp_ctx resources are available. Hence, this seems like a
> potential
> > > > for deadlock regardless of the mutex.
> > >
> > > I took a closer look and I believe my hotfix was actually unnecessary. I
> > > sent it out in response to a syzbot report, but upon closer look it
> > > seems like it was not an actual problem. Sorry if my patch confused you.
> > >
> > > Looking at enum cpuhp_state in include/linux/cpuhotplug.h, it seems like
> > > CPUHP_MM_ZSWP_POOL_PREPARE is in the PREPARE section. The
> comment
> > > above
> > > says:
> > >
> > > * PREPARE: The callbacks are invoked on a control CPU before the
> > > * hotplugged CPU is started up or after the hotplugged CPU has died.
> > >
> > > So even if we go into reclaim during zswap_cpu_comp_prepare(), it will
> > > never be on the CPU that we are allocating resources for.
> > >
> > > The other case where zswap_cpu_comp_prepare() could race with
> > > compression/decompression is when a pool is being created. In this case,
> > > reclaim from zswap_cpu_comp_prepare() can recurse into zswap on the
> > > same
> > > CPU AFAICT. However, because the pool is still under creation, it will
> > > not be used (i.e. zswap_pool_current_get() won't find it).
> > >
> > > So I think we don't need to worry about zswap_cpu_comp_prepare()
> racing
> > > with compression or decompression for the same pool and CPU.
> >
> > Thanks Yosry, for this observation! You are right, when considered purely
> > from a CPU hotplug perspective, zswap_cpu_comp_prepare() and
> > zswap_cpu_comp_dead() in fact run on a control CPU, because the state is
> > registered in the PREPARE section of "enum cpuhp_state" in cpuhotplug.h.
> >
> > The problem however is that, in the current architecture, CPU onlining/
> > zswap_pool creation, and CPU offlining/zswap_pool deletion have the
> > same semantics as far as these resources are concerned. Hence, although
> > zswap_cpu_comp_prepare() is run on a control CPU, the CPU for which
> > the "hotplug" code is called is in fact online. It is possible for the memory
> > allocation calls in zswap_cpu_comp_prepare() to recurse into
> > zswap_compress(), which now needs to be handled by the current pool,
> > since the new pool has not yet been added to the zswap_pools, as you
> > pointed out.
> >
> > The ref on the current pool has not yet been dropped. Could there be
> > a potential for a deadlock at pool transition time: the new pool is blocked
> > from allocating acomp_ctx resources, triggering reclaim, which the old
> > pool needs to handle?
>
> I am not sure how this could lead to a deadlock. The compression will be
> happening in a different pool with a different acomp_ctx.
I was thinking about this from the perspective of comparing the trade-offs
between these two approaches:
a) Allocating acomp_ctx resources for a pool when a CPU is functional, vs.
b) Allocating acomp_ctx resources once upfront.
With (a), when the user switches zswap to use a new compressor, it is possible
that the system is already in a low memory situation and the CPU could be doing
a lot of swapouts. It occurred to me that in theory, the call to switch the
compressor through the sysfs interface could never return if the acomp_ctx
allocations trigger direct reclaim in this scenario. This was in the context of
exploring if a better design is possible, while acknowledging that this could still
happen today.
With (b), this situation is avoided by design, and we can switch to a new pool
without triggering additional reclaim. Sorry, I should have articulated this better.
>
> >
> > I see other places in the kernel that use CPU hotplug for resource allocation,
> > outside of the context of CPU onlining. IIUC, it is difficult to guarantee that
> > the startup/teardown callbacks are modifying acomp_ctx resources for a
> > dysfunctional CPU.
>
> IIUC, outside the context of CPU onlining, CPU hotplug callbacks get
> called when they are added. In this case, only the newly added callbacks
> will be executed. IOW, zswap's hotplug callback should not be randomly
> getting called when irrelevant code adds hotplug callbacks. It should
> only happen during zswap pool initialization or CPU onlining.
>
> Please correct me if I am wrong.
Yes, this is my understanding as well.
>
> >
> > Now that I think about it, the only real constraint is that the acomp_ctx
> > resources are guaranteed to exist for a functional CPU which can run zswap
> > compress/decompress.
>
> I believe this is already the case as I previously described, because
> the hotplug callback can only be called in two scenarios:
> - Zswap pool initialization, in which case compress/decompress
> operations cannot run on the pool we are initializing.
> - CPU onlining, in which case compress/decompress operations cannot run
> on the CPU we are onlining.
>
> Please correct me if I am wrong.
Agreed, this is my understanding too.
>
> >
> > I think we can simplify this as follows, and would welcome suggestions
> > to improve the proposed solution:
> >
> > 1) We dis-associate the acomp_ctx from the pool, and instead, have this
> > be a global percpu zswap resource that gets allocated once in
> zswap_setup(),
> > just like the zswap_entry_cache.
> > 2) The acomp_ctx resources will get allocated during zswap_setup(), using
> > the cpuhp_setup_state_multi callback() in zswap_setup(), that registers
> > zswap_cpu_comp_prepare(), but no teardown callback.
> > 3) We call cpuhp_state_add_instance() for_each_possible_cpu(cpu) in
> > zswap_setup().
> > 4) The acomp_ctx resources persist through subsequent "real CPU
> offline/online
> > state transitions".
> > 5) zswap_[de]compress() can go ahead and lock the mutex, and use the
> > reqs/buffers without worrying about whether these resources are
> > initialized or still exist/are being deleted.
> > 6) "struct zswap_pool" is now de-coupled from this global percpu zswap
> > acomp_ctx.
> > 7) To address the issue of how many reqs/buffers to allocate, there could
> > potentially be a memory cost for non-batching compressors, if we want
> > to always allocate ZSWAP_MAX_BATCH_SIZE acomp_reqs and buffers.
> > This would allow the acomp_ctx to seamlessly handle batching
> > compressors, non-batching compressors, and transitions among the
> > two compressor types in a pretty general manner, that relies only on
> > the ZSWAP_MAX_BATCH_SIZE, which we define anyway.
> >
> > I believe we can maximize the chances of success for the allocation of
> > the acomp_ctx resources if this is done in zswap_setup(), but please
> > correct me if I am wrong.
> >
> > The added memory cost for platforms without IAA would be
> > ~57 KB per cpu, on x86. Would this be acceptable?
>
> I think that's a lot of memory to waste per-CPU, and I don't see a good
> reason for it.
Yes, it appears so. Towards trying to see if a better design is possible
by de-coupling the acomp_ctx from zswap_pool:
Would this cost be acceptable if it is incurred based on a build config
option, saying CONFIG_ALLOC_ZSWAP_BATCHING_RESOURCES (default OFF)?
If this is set, we go ahead and allocate ZSWAP_MAX_BATCH_SIZE
acomp_ctx resources once, during zswap_setup(). If not, we allocate
only one req/buffer in the global percpu acomp_ctx?
Since we are comprehending that other compressors may want to
do batching in future, I thought a more general config option name
would be more appropriate.
>
> > If not, I don't believe this simplification would be worth it, because
> > allocating for one req/buffer, then dynamically adding more resources
> > if a newly selected compressor requires more resources, would run
> > into the same race conditions and added checks as in
> > acomp_ctx_get_cpu_lock(), which I believe, seem to be necessary
> because
> > CPU onlining/zswap_pool creation and CPU offlining/zswap_pool
> > deletion have the same semantics for these resources.
>
> Agree that using a single acomp_ctx per-CPU but making the resources
> resizable is not a win.
Yes, this makes sense: resizing is not the way to go.
>
> >
> > The only other fallback solution in lieu of the proposed simplification that
> > I can think of is to keep the lifespan of these resources from pool creation
> > to deletion, using the CPU hotplug code. Although, it is not totally clear
> > to me if there is potential for deadlock during pool transitions, as noted
> above.
>
> I am not sure what's the deadlock scenario you're worried about, please
> clarify.
My apologies: I was referring to triggering direct reclaim during pool creation,
which could, in theory, run into a scenario where the pool switching would
have to wait for reclaim to free up enough memory for the acomp_ctx
resources allocation: this could cause the system to hang, but not a deadlock.
This can happen even today, hence trying to see if a better design is possible.
Thanks,
Kanchana
Powered by blists - more mailing lists