lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250329110230.2459730-1-nphamcs@gmail.com>
Date: Sat, 29 Mar 2025 04:02:28 -0700
From: Nhat Pham <nphamcs@...il.com>
To: linux-mm@...ck.org
Cc: akpm@...ux-foundation.org,
	hannes@...xchg.org,
	yosry.ahmed@...ux.dev,
	chengming.zhou@...ux.dev,
	sj@...nel.org,
	kernel-team@...a.com,
	linux-kernel@...r.kernel.org,
	gourry@...rry.net,
	willy@...radead.org,
	ying.huang@...ux.alibaba.com,
	jonathan.cameron@...wei.com,
	dan.j.williams@...el.com,
	linux-cxl@...r.kernel.org,
	minchan@...nel.org,
	senozhatsky@...omium.org
Subject: [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems

Currently, systems with CXL-based memory tiering can encounter the
following inversion with zswap: the coldest pages demoted to the CXL
tier can return to the high tier when they are zswapped out,
creating memory pressure on the high tier.

This happens because zsmalloc, zswap's backend memory allocator, does
not enforce any memory policy. If the task reclaiming memory follows
the local-first policy for example, the memory requested for zswap can
be served by the upper tier, leading to the aformentioned inversion.

This RFC fixes this inversion by adding a new memory allocation mode
for zswap (exposed through a zswap sysfs knob), intended for
hosts with CXL, where the memory for the compressed object is requested
preferentially from the same node that the original page resides on.

With the new zswap allocation mode enabled, we should observe the
following dynamics:

1. When demotion is turned on, under reasonable conditions, zswap will
   prefer CXL memory by default, since top-tier memory being reclaimed
   will typically be demoted instead of swapped.

2. This should prevent reclaim on the lower tier from causing high-tier
   memory pressure due to new allocations.

3. This should avoid a quiet promotion of cold memory (memory being
   zswapped is cold, but is promoted when put into the zswap pool
   because the memory allocated for the compressed copy comes from the
   high tier).
   
4. However, this may actually cause pressure on the CXL tier, which may
   actually result in further demotion (to swap, etc). This needs to be
   tested.

I'm still testing and collecting more data, but figure I should send
this out as an RFC to spark the discussion:

1. Is this the right policy? Do we need a more complicated policy?
   Should we instead go for the "lowest" node (which would require new
   memory tiering API)? Or maybe trying each node from current node
   to the lowest node in the hierarchy?

   Also, I hack together this fix with CXL in mind, but if there are
   other cases that I should also address we can explore a more general
   memory allocation strategy or interface.

2. Similarly, is this the right zsmalloc API? For instance, we can build
   build a full-fledged mempolicy-based API for zsmalloc, but I haven't
   found a use case for it yet.

3. Assuming this is the right policy, what should be the semantics? Not
   very good at naming things, so same_node_mode might not be it :)

Nhat Pham (2):
  zsmalloc: let callers select NUMA node to store the compressed objects
  zswap: add sysfs knob for same node mode

 Documentation/admin-guide/mm/zswap.rst |  9 +++++++++
 include/linux/zpool.h                  |  4 ++--
 mm/zpool.c                             |  8 +++++---
 mm/zsmalloc.c                          | 28 +++++++++++++++++++-------
 mm/zswap.c                             | 10 +++++++--
 5 files changed, 45 insertions(+), 14 deletions(-)


base-commit: 4135040c342ba080328891f1b7e523c8f2f04c58
-- 
2.47.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ