linux-kernel - Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkY-ayU3Ld+dKTLEEG3U72fGnCbiQgZursK+eGMXif_uzA@mail.gmail.com>
Date: Wed, 25 Sep 2024 12:39:02 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@...el.com>, linux-kernel@...r.kernel.org, 
	linux-mm@...ck.org, nphamcs@...il.com, chengming.zhou@...ux.dev, 
	usamaarif642@...il.com, shakeel.butt@...ux.dev, ryan.roberts@....com, 
	ying.huang@...el.com, 21cnbao@...il.com, akpm@...ux-foundation.org, 
	nanhai.zou@...el.com, wajdi.k.feghali@...el.com, vinodh.gopal@...el.com
Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().

On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > Johannes wrote:
> > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > scheme: check the limit up front for fast-path failure if we're
> > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > atomically charge against zswap.max and unwind the store if we raced.
> > >
> > > For now, I would just keep the simple version we currently have: check
> > > once in zswap_store() and then just go ahead for the whole folio.
> >
> > I am not totally against this but I feel like this is too optimistic.
> > I think we can keep it simple-ish by maintaining an ewma for the
> > compression ratio, we already have primitives for this (see
> > DECLARE_EWMA).
> >
> > Then in zswap_store(), we can use the ewma to estimate the compressed
> > size and use it to do the memcg and global limit checks once, like we
> > do today. Instead of just checking if we are below the limits, we
> > check if we have enough headroom for the estimated compressed size.
> > Then we call zswap_store_page() to do the per-page stuff, then do
> > batched charging and stats updates.
>
> I'm not sure what you gain from making a non-atomic check precise. You
> can get a hundred threads determining down precisely that *their*
> store will fit exactly into the last 800kB before the limit.

We just get to avoid overshooting in cases where we know we probably
can't fit it anyway. If we have 4KB left and we are trying to compress
a 2MB THP, for example. It just makes the upfront check to avoid
pointless compression a little bit more meaningful.

>
> > If you think that's an overkill we can keep doing the limit checks as
> > we do today,
>
> I just don't see how it would make a practical difference.
>
> What would make a difference is atomic transactional charging of the
> compressed size, and unwinding on failure - with the upfront check to
> avoid pointlessly compressing (outside of race conditions).
>
> And I'm not against doing that in general, I am just against doing it
> per default.
>
> It's a lot of complexity, and like I said, the practical usecase for
> limiting zswap memory to begin with is quite unclear to me. Zswap is
> not a limited resource. It's just memory. And you already had the
> memory for the uncompressed copy. So it's a bit strange to me to say
> "you have compressed your memory enough, so now you get sent to disk
> (or we declare OOM)". What would be a reason to limit it?

Technically speaking if we have a global zswap limit, it becomes a
limited resource and distributing it across workloads can make sense.
That being said, I am not aware of any existing use cases for that.

The other use case is controlling when writeback kicks in for
different workloads. It may not make sense for limit-based reclaim,
because as you mentioned the memory is limited anyway and workloads
should be free to compress their own memory within their limit as they
please. But it may make sense for proactive reclaim, controlling how
much memory we compress vs how much memory we completely evict to
disk.

Again, not aware of any existing use cases for this as well.

>
> It sort of makes sense as a binary switch, but I don't get the usecase
> for a granular limit. (And I blame my own cowardice for making the
> cgroup knob a limit, to keep options open, instead of a switch.)
>
> All that to say, this would be better in a follow-up patch. We allow
> overshooting now, it's not clear how overshooting by a larger amount
> makes a categorical difference.

I am not against making this a follow-up, if/when the need arises. My
whole point was that using EWMA (or similar) we can make the upfront
check a little bit more meaningful than "We have 1 byte of headroom,
let's go compress a 2MB THP!". I think it's not a lot of complexity to
check for headroom based on an estimated compression size, but I
didn't try to code it, so maybe I am wrong :)

>
> > but I would still like to see batching of all the limit checks,
> > charging, and stats updates. It makes little sense otherwise.
>
> Definitely. One check, one charge, one stat update per folio.