linux-kernel - Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkapj+Wb7BMGZ=2AGi2NTXJbZBCnoJkZuBAkkF8NDZDKjA@mail.gmail.com>
Date: Wed, 25 Sep 2024 12:49:13 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
Cc: Johannes Weiner <hannes@...xchg.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>, 
	"nphamcs@...il.com" <nphamcs@...il.com>, "chengming.zhou@...ux.dev" <chengming.zhou@...ux.dev>, 
	"usamaarif642@...il.com" <usamaarif642@...il.com>, "shakeel.butt@...ux.dev" <shakeel.butt@...ux.dev>, 
	"ryan.roberts@....com" <ryan.roberts@....com>, "Huang, Ying" <ying.huang@...el.com>, 
	"21cnbao@...il.com" <21cnbao@...il.com>, "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, 
	"Zou, Nanhai" <nanhai.zou@...el.com>, "Feghali, Wajdi K" <wajdi.k.feghali@...el.com>, 
	"Gopal, Vinodh" <vinodh.gopal@...el.com>
Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().

[..]
> > > > > +       /*
> > > > > +        * Check cgroup limits:
> > > > > +        *
> > > > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > > > +        * mTHP store, and not within zswap_store_page() for each page
> > > > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > > > +        * However, the per-store-page zswap pool limits check should
> > > > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > > > +        * the cgroup exceeding the zswap limits could potentially be
> > > > > +        * resolved before the next zswap_store, and if it is not, the next
> > > > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > > > +        */
> > > >
> > > > I do not really like this. Allowing going one page above the limit is
> > > > one thing, but one THP above the limit seems too much. I also don't
> > > > like relying on the repeated limit checking in zswap_store_page(), if
> > > > anything I think that should be batched too.
> > > >
> > > > Is it too unreasonable to maintain the average compression ratio and
> > > > use that to estimate limit checking for both memcg and global limits?
> > > > Johannes, Nhat, any thoughts on this?
> > >
> > > I honestly don't think it's much of an issue. The global limit is
> > > huge, and the cgroup limit is to the best of my knowledge only used as
> > > a binary switch. Setting a non-binary limit - global or cgroup - seems
> > > like a bit of an obscure usecase to me, because in the vast majority
> > > of cases it's preferable to keep compresing over declaring OOM.
> > >
> > > And even if you do have some granular limit, the workload size scales
> > > with it. It's not like you have a thousand THPs in a 10M cgroup.
> >
> > The memcg limit and zswap limit can be disproportionate, although that
> > shouldn't be common.
> >
> > >
> > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > scheme: check the limit up front for fast-path failure if we're
> > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > atomically charge against zswap.max and unwind the store if we raced.
> > >
> > > For now, I would just keep the simple version we currently have: check
> > > once in zswap_store() and then just go ahead for the whole folio.
> >
> > I am not totally against this but I feel like this is too optimistic.
> > I think we can keep it simple-ish by maintaining an ewma for the
> > compression ratio, we already have primitives for this (see
> > DECLARE_EWMA).
> >
> > Then in zswap_store(), we can use the ewma to estimate the compressed
> > size and use it to do the memcg and global limit checks once, like we
> > do today. Instead of just checking if we are below the limits, we
> > check if we have enough headroom for the estimated compressed size.
> > Then we call zswap_store_page() to do the per-page stuff, then do
> > batched charging and stats updates.
> >
> > If you think that's an overkill we can keep doing the limit checks as
> > we do today,
> > but I would still like to see batching of all the limit checks,
> > charging, and stats updates. It makes little sense otherwise.
>
> Thanks Johannes and Yosry for these suggestions and pointers.
> I think there is general agreement about the batch charging and
> zswap_stored_pages/stats updates. Yosry,  does "batching of limit
> checks" imply the same as a simple check for being over the cgroup
> limit at the start of zswap_store and not doing this check in
> zswap_store_page? Does this also imply a zswap_pool_get_many()?
> Would appreciate it if you can help clarify.

Yes I think we should batch as much as possible in zswap_store(), and
only do the things are truly per-page in zswap_store_page(). The limit
checks, stats updates, zswap_pool refs, charging etc. Batching all of
these things should be clear wins.

>
> The main question in my mind about using the EWMA checks is,
> will it add overhead to the normal zswap reclaim path; and if so,
> would a simple limit check at the start of zswap_store as suggested
> by Johannes suffice. I can run a few experiments to quantify this
> overhead, and maybe we can revisit this?

If you look at ewma_##name##_add() in include/linux/average.h, it's
really just a bunch of bit shifts, so I am not concerned about runtime
overhead. My discussion with Johannes is more about if the complexity
is justified, I'd wait for that discussion to settle.

Either way, we should check the limits once in zswap_store().