linux-kernel - RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB56784B0D0EC2348A0ED7542CC9692@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Wed, 25 Sep 2024 19:10:04 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosryahmed@...gle.com>, Johannes Weiner <hannes@...xchg.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "nphamcs@...il.com"
	<nphamcs@...il.com>, "chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>,
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "shakeel.butt@...ux.dev"
	<shakeel.butt@...ux.dev>, "ryan.roberts@....com" <ryan.roberts@....com>,
	"Huang, Ying" <ying.huang@...el.com>, "21cnbao@...il.com"
	<21cnbao@...il.com>, "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"Zou, Nanhai" <nanhai.zou@...el.com>, "Feghali, Wajdi K"
	<wajdi.k.feghali@...el.com>, "Gopal, Vinodh" <vinodh.gopal@...el.com>,
	"Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@...gle.com>
> Sent: Wednesday, September 25, 2024 11:31 AM
> To: Johannes Weiner <hannes@...xchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@...el.com>; linux-
> kernel@...r.kernel.org; linux-mm@...ck.org; nphamcs@...il.com;
> chengming.zhou@...ux.dev; usamaarif642@...il.com;
> shakeel.butt@...ux.dev; ryan.roberts@....com; Huang, Ying
> <ying.huang@...el.com>; 21cnbao@...il.com; akpm@...ux-foundation.org;
> Zou, Nanhai <nanhai.zou@...el.com>; Feghali, Wajdi K
> <wajdi.k.feghali@...el.com>; Gopal, Vinodh <vinodh.gopal@...el.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> [..]
> > > > +       /*
> > > > +        * Check cgroup limits:
> > > > +        *
> > > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > > +        * mTHP store, and not within zswap_store_page() for each page
> > > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > > +        * However, the per-store-page zswap pool limits check should
> > > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > > +        * the cgroup exceeding the zswap limits could potentially be
> > > > +        * resolved before the next zswap_store, and if it is not, the next
> > > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > > +        */
> > >
> > > I do not really like this. Allowing going one page above the limit is
> > > one thing, but one THP above the limit seems too much. I also don't
> > > like relying on the repeated limit checking in zswap_store_page(), if
> > > anything I think that should be batched too.
> > >
> > > Is it too unreasonable to maintain the average compression ratio and
> > > use that to estimate limit checking for both memcg and global limits?
> > > Johannes, Nhat, any thoughts on this?
> >
> > I honestly don't think it's much of an issue. The global limit is
> > huge, and the cgroup limit is to the best of my knowledge only used as
> > a binary switch. Setting a non-binary limit - global or cgroup - seems
> > like a bit of an obscure usecase to me, because in the vast majority
> > of cases it's preferable to keep compresing over declaring OOM.
> >
> > And even if you do have some granular limit, the workload size scales
> > with it. It's not like you have a thousand THPs in a 10M cgroup.
> 
> The memcg limit and zswap limit can be disproportionate, although that
> shouldn't be common.
> 
> >
> > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > scheme: check the limit up front for fast-path failure if we're
> > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > atomically charge against zswap.max and unwind the store if we raced.
> >
> > For now, I would just keep the simple version we currently have: check
> > once in zswap_store() and then just go ahead for the whole folio.
> 
> I am not totally against this but I feel like this is too optimistic.
> I think we can keep it simple-ish by maintaining an ewma for the
> compression ratio, we already have primitives for this (see
> DECLARE_EWMA).
> 
> Then in zswap_store(), we can use the ewma to estimate the compressed
> size and use it to do the memcg and global limit checks once, like we
> do today. Instead of just checking if we are below the limits, we
> check if we have enough headroom for the estimated compressed size.
> Then we call zswap_store_page() to do the per-page stuff, then do
> batched charging and stats updates.
> 
> If you think that's an overkill we can keep doing the limit checks as
> we do today,
> but I would still like to see batching of all the limit checks,
> charging, and stats updates. It makes little sense otherwise.

Thanks Johannes and Yosry for these suggestions and pointers.
I think there is general agreement about the batch charging and
zswap_stored_pages/stats updates. Yosry,  does "batching of limit
checks" imply the same as a simple check for being over the cgroup
limit at the start of zswap_store and not doing this check in
zswap_store_page? Does this also imply a zswap_pool_get_many()?
Would appreciate it if you can help clarify.

The main question in my mind about using the EWMA checks is,
will it add overhead to the normal zswap reclaim path; and if so,
would a simple limit check at the start of zswap_store as suggested
by Johannes suffice. I can run a few experiments to quantify this
overhead, and maybe we can revisit this?

Thanks,
Kanchana