linux-kernel - Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for compressed data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKEwX=PJ-LQO3xffXx5-MpRBXWaCC+QQdM1kCY0=NFX4sNEhgA@mail.gmail.com>
Date: Wed, 2 Apr 2025 13:25:39 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org, yosry.ahmed@...ux.dev, 
	chengming.zhou@...ux.dev, sj@...nel.org, kernel-team@...a.com, 
	linux-kernel@...r.kernel.org, gourry@...rry.net, ying.huang@...ux.alibaba.com, 
	jonathan.cameron@...wei.com, dan.j.williams@...el.com, 
	linux-cxl@...r.kernel.org, minchan@...nel.org, senozhatsky@...omium.org
Subject: Re: [PATCH] zswap/zsmalloc: prefer the the original page's node for
 compressed data

On Wed, Apr 2, 2025 at 1:24 PM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Wed, Apr 02, 2025 at 01:09:29PM -0700, Nhat Pham wrote:
> > On Wed, Apr 2, 2025 at 12:57 PM Johannes Weiner <hannes@...xchg.org> wrote:
> > >
> > > On Wed, Apr 02, 2025 at 12:11:45PM -0700, Nhat Pham wrote:
> > > > Currently, zsmalloc, zswap's backend memory allocator, does not enforce
> > > > any policy for the allocation of memory for the compressed data,
> > > > instead just adopting the memory policy of the task entering reclaim,
> > > > or the default policy (prefer local node) if no such policy is
> > > > specified. This can lead to several pathological behaviors in
> > > > multi-node NUMA systems:
> > > >
> > > > 1. Systems with CXL-based memory tiering can encounter the following
> > > >    inversion with zswap: the coldest pages demoted to the CXL tier
> > > >    can return to the high tier when they are zswapped out, creating
> > > >    memory pressure on the high tier.
> > > >
> > > > 2. Consider a direct reclaimer scanning nodes in order of allocation
> > > >    preference. If it ventures into remote nodes, the memory it
> > > >    compresses there should stay there. Trying to shift those contents
> > > >    over to the reclaiming thread's preferred node further *increases*
> > > >    its local pressure, and provoking more spills. The remote node is
> > > >    also the most likely to refault this data again. This undesirable
> > > >    behavior was pointed out by Johannes Weiner in [1].
> > > >
> > > > 3. For zswap writeback, the zswap entries are organized in
> > > >    node-specific LRUs, based on the node placement of the original
> > > >    pages, allowing for targeted zswap writeback for specific nodes.
> > > >
> > > >    However, the compressed data of a zswap entry can be placed on a
> > > >    different node from the LRU it is placed on. This means that reclaim
> > > >    targeted at one node might not free up memory used for zswap entries
> > > >    in that node, but instead reclaiming memory in a different node.
> > > >
> > > > All of these issues will be resolved if the compressed data go to the
> > > > same node as the original page. This patch encourages this behavior by
> > > > having zswap pass the node of the original page to zsmalloc, and have
> > > > zsmalloc prefer the specified node if we need to allocate new (zs)pages
> > > > for the compressed data.
> > > >
> > > > Note that we are not strictly binding the allocation to the preferred
> > > > node. We still allow the allocation to fall back to other nodes when
> > > > the preferred node is full, or if we have zspages with slots available
> > > > on a different node. This is OK, and still a strict improvement over
> > > > the status quo:
> > > >
> > > > 1. On a system with demotion enabled, we will generally prefer
> > > >    demotions over zswapping, and only zswap when pages have
> > > >    already gone to the lowest tier. This patch should achieve the
> > > >    desired effect for the most part.
> > > >
> > > > 2. If the preferred node is out of memory, letting the compressed data
> > > >    going to other nodes can be better than the alternative (OOMs,
> > > >    keeping cold memory unreclaimed, disk swapping, etc.).
> > > >
> > > > 3. If the allocation go to a separate node because we have a zspage
> > > >    with slots available, at least we're not creating extra immediate
> > > >    memory pressure (since the space is already allocated).
> > > >
> > > > 3. While there can be mixings, we generally reclaim pages in
> > > >    same-node batches, which encourage zspage grouping that is more
> > > >    likely to go to the right node.
> > > >
> > > > 4. A strict binding would require partitioning zsmalloc by node, which
> > > >    is more complicated, and more prone to regression, since it reduces
> > > >    the storage density of zsmalloc. We need to evaluate the tradeoff
> > > >    and benchmark carefully before adopting such an involved solution.
> > > >
> > > > This patch does not fix zram, leaving its memory allocation behavior
> > > > unchanged. We leave this effort to future work.
> > >
> > > zram's zs_malloc() calls all have page context. It seems a lot easier
> > > to just fix the bug for them as well than to have two allocation APIs
> > > and verbose commentary?
> >
> > I think the recompress path doesn't quite have the context at the callsite:
> >
> > static int recompress_slot(struct zram *zram, u32 index, struct page *page,
> >    u64 *num_recomp_pages, u32 threshold, u32 prio,
> >    u32 prio_max)
> >
> > Note that the "page" argument here is allocated by zram internally,
> > and not the original page. We can get the original page's node by
> > asking zsmalloc to return it when it returns the compressed data, but
> > that's quite involved, and potentially requires further zsmalloc API
> > change.
>
> Yeah, that path currently allocates the target page on the node of
> whoever is writing to the "recompress" file.
>
> I think it's fine to use page_to_nid() on that one. It's no worse than
> the current behavior. Add an /* XXX */ to recompress_store() and
> should somebody care to make that path generally NUMA-aware they can
> do so without having to garbage-collect dependencies in zsmalloc code.

SGTM. I'll fix that.