[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a881c1f8-aaa4-4c34-9883-a1d07e117968@linux.dev>
Date: Thu, 3 Apr 2025 11:18:35 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: Nhat Pham <nphamcs@...il.com>, akpm@...ux-foundation.org
Cc: hannes@...xchg.org, yosry.ahmed@...ux.dev, sj@...nel.org,
linux-mm@...ck.org, kernel-team@...a.com, linux-kernel@...r.kernel.org,
gourry@...rry.net, ying.huang@...ux.alibaba.com,
jonathan.cameron@...wei.com, dan.j.williams@...el.com,
linux-cxl@...r.kernel.org, minchan@...nel.org, senozhatsky@...omium.org
Subject: Re: [PATCH v2] zsmalloc: prefer the the original page's node for
compressed data
On 2025/4/3 04:44, Nhat Pham wrote:
> Currently, zsmalloc, zswap's and zram's backend memory allocator, does
> not enforce any policy for the allocation of memory for the compressed
> data, instead just adopting the memory policy of the task entering
> reclaim, or the default policy (prefer local node) if no such policy is
> specified. This can lead to several pathological behaviors in
> multi-node NUMA systems:
>
> 1. Systems with CXL-based memory tiering can encounter the following
> inversion with zswap/zram: the coldest pages demoted to the CXL tier
> can return to the high tier when they are reclaimed to compressed
> swap, creating memory pressure on the high tier.
>
> 2. Consider a direct reclaimer scanning nodes in order of allocation
> preference. If it ventures into remote nodes, the memory it
> compresses there should stay there. Trying to shift those contents
> over to the reclaiming thread's preferred node further *increases*
> its local pressure, and provoking more spills. The remote node is
> also the most likely to refault this data again. This undesirable
> behavior was pointed out by Johannes Weiner in [1].
>
> 3. For zswap writeback, the zswap entries are organized in
> node-specific LRUs, based on the node placement of the original
> pages, allowing for targeted zswap writeback for specific nodes.
>
> However, the compressed data of a zswap entry can be placed on a
> different node from the LRU it is placed on. This means that reclaim
> targeted at one node might not free up memory used for zswap entries
> in that node, but instead reclaiming memory in a different node.
>
> All of these issues will be resolved if the compressed data go to the
> same node as the original page. This patch encourages this behavior by
> having zswap and zram pass the node of the original page to zsmalloc,
> and have zsmalloc prefer the specified node if we need to allocate new
> (zs)pages for the compressed data.
>
> Note that we are not strictly binding the allocation to the preferred
> node. We still allow the allocation to fall back to other nodes when
> the preferred node is full, or if we have zspages with slots available
> on a different node. This is OK, and still a strict improvement over
> the status quo:
>
> 1. On a system with demotion enabled, we will generally prefer
> demotions over compressed swapping, and only swap when pages have
> already gone to the lowest tier. This patch should achieve the
> desired effect for the most part.
>
> 2. If the preferred node is out of memory, letting the compressed data
> going to other nodes can be better than the alternative (OOMs,
> keeping cold memory unreclaimed, disk swapping, etc.).
>
> 3. If the allocation go to a separate node because we have a zspage
> with slots available, at least we're not creating extra immediate
> memory pressure (since the space is already allocated).
>
> 3. While there can be mixings, we generally reclaim pages in
> same-node batches, which encourage zspage grouping that is more
> likely to go to the right node.
>
> 4. A strict binding would require partitioning zsmalloc by node, which
> is more complicated, and more prone to regression, since it reduces
> the storage density of zsmalloc. We need to evaluate the tradeoff
> and benchmark carefully before adopting such an involved solution.
>
> [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/
>
> Suggested-by: Gregory Price <gourry@...rry.net>
> Signed-off-by: Nhat Pham <nphamcs@...il.com>
Looks good to me:
Reviewed-by: Chengming Zhou <chengming.zhou@...ux.dev>
Thanks!
> ---
> drivers/block/zram/zram_drv.c | 11 ++++++++---
> include/linux/zpool.h | 4 ++--
> include/linux/zsmalloc.h | 3 ++-
> mm/zpool.c | 8 +++++---
> mm/zsmalloc.c | 18 ++++++++++--------
> mm/zswap.c | 2 +-
> 6 files changed, 28 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index fda7d8624889..0ba18277ed7b 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1694,7 +1694,7 @@ static int write_incompressible_page(struct zram *zram, struct page *page,
> */
> handle = zs_malloc(zram->mem_pool, PAGE_SIZE,
> GFP_NOIO | __GFP_NOWARN |
> - __GFP_HIGHMEM | __GFP_MOVABLE);
> + __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
> if (IS_ERR_VALUE(handle))
> return PTR_ERR((void *)handle);
>
> @@ -1761,7 +1761,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>
> handle = zs_malloc(zram->mem_pool, comp_len,
> GFP_NOIO | __GFP_NOWARN |
> - __GFP_HIGHMEM | __GFP_MOVABLE);
> + __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
> if (IS_ERR_VALUE(handle)) {
> zcomp_stream_put(zstrm);
> return PTR_ERR((void *)handle);
> @@ -1981,10 +1981,15 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
> * We are holding per-CPU stream mutex and entry lock so better
> * avoid direct reclaim. Allocation error is not fatal since
> * we still have the old object in the mem_pool.
> + *
> + * XXX: technically, the node we really want here is the node that holds
> + * the original compressed data. But that would require us to modify
> + * zsmalloc API to return this information. For now, we will make do with
> + * the node of the page allocated for recompression.
> */
> handle_new = zs_malloc(zram->mem_pool, comp_len_new,
> GFP_NOIO | __GFP_NOWARN |
> - __GFP_HIGHMEM | __GFP_MOVABLE);
> + __GFP_HIGHMEM | __GFP_MOVABLE, page_to_nid(page));
> if (IS_ERR_VALUE(handle_new)) {
> zcomp_stream_put(zstrm);
> return PTR_ERR((void *)handle_new);
> diff --git a/include/linux/zpool.h b/include/linux/zpool.h
> index 52f30e526607..697525cb00bd 100644
> --- a/include/linux/zpool.h
> +++ b/include/linux/zpool.h
> @@ -22,7 +22,7 @@ const char *zpool_get_type(struct zpool *pool);
> void zpool_destroy_pool(struct zpool *pool);
>
> int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
> - unsigned long *handle);
> + unsigned long *handle, const int nid);
>
> void zpool_free(struct zpool *pool, unsigned long handle);
>
> @@ -64,7 +64,7 @@ struct zpool_driver {
> void (*destroy)(void *pool);
>
> int (*malloc)(void *pool, size_t size, gfp_t gfp,
> - unsigned long *handle);
> + unsigned long *handle, const int nid);
> void (*free)(void *pool, unsigned long handle);
>
> void *(*obj_read_begin)(void *pool, unsigned long handle,
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index c26baf9fb331..13e9cc5490f7 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -26,7 +26,8 @@ struct zs_pool;
> struct zs_pool *zs_create_pool(const char *name);
> void zs_destroy_pool(struct zs_pool *pool);
>
> -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags,
> + const int nid);
> void zs_free(struct zs_pool *pool, unsigned long obj);
>
> size_t zs_huge_class_size(struct zs_pool *pool);
> diff --git a/mm/zpool.c b/mm/zpool.c
> index 6d6d88930932..b99a7c03e735 100644
> --- a/mm/zpool.c
> +++ b/mm/zpool.c
> @@ -226,20 +226,22 @@ const char *zpool_get_type(struct zpool *zpool)
> * @size: The amount of memory to allocate.
> * @gfp: The GFP flags to use when allocating memory.
> * @handle: Pointer to the handle to set
> + * @nid: The preferred node id.
> *
> * This allocates the requested amount of memory from the pool.
> * The gfp flags will be used when allocating memory, if the
> * implementation supports it. The provided @handle will be
> - * set to the allocated object handle.
> + * set to the allocated object handle. The allocation will
> + * prefer the NUMA node specified by @nid.
> *
> * Implementations must guarantee this to be thread-safe.
> *
> * Returns: 0 on success, negative value on error.
> */
> int zpool_malloc(struct zpool *zpool, size_t size, gfp_t gfp,
> - unsigned long *handle)
> + unsigned long *handle, const int nid)
> {
> - return zpool->driver->malloc(zpool->pool, size, gfp, handle);
> + return zpool->driver->malloc(zpool->pool, size, gfp, handle, nid);
> }
>
> /**
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 961b270f023c..8ba6cdf222ac 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -243,9 +243,9 @@ static inline void zpdesc_dec_zone_page_state(struct zpdesc *zpdesc)
> dec_zone_page_state(zpdesc_page(zpdesc), NR_ZSPAGES);
> }
>
> -static inline struct zpdesc *alloc_zpdesc(gfp_t gfp)
> +static inline struct zpdesc *alloc_zpdesc(gfp_t gfp, const int nid)
> {
> - struct page *page = alloc_page(gfp);
> + struct page *page = alloc_pages_node(nid, gfp, 0);
>
> return page_zpdesc(page);
> }
> @@ -462,9 +462,9 @@ static void zs_zpool_destroy(void *pool)
> }
>
> static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
> - unsigned long *handle)
> + unsigned long *handle, const int nid)
> {
> - *handle = zs_malloc(pool, size, gfp);
> + *handle = zs_malloc(pool, size, gfp, nid);
>
> if (IS_ERR_VALUE(*handle))
> return PTR_ERR((void *)*handle);
> @@ -1044,7 +1044,7 @@ static void create_page_chain(struct size_class *class, struct zspage *zspage,
> */
> static struct zspage *alloc_zspage(struct zs_pool *pool,
> struct size_class *class,
> - gfp_t gfp)
> + gfp_t gfp, const int nid)
> {
> int i;
> struct zpdesc *zpdescs[ZS_MAX_PAGES_PER_ZSPAGE];
> @@ -1061,7 +1061,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
> for (i = 0; i < class->pages_per_zspage; i++) {
> struct zpdesc *zpdesc;
>
> - zpdesc = alloc_zpdesc(gfp);
> + zpdesc = alloc_zpdesc(gfp, nid);
> if (!zpdesc) {
> while (--i >= 0) {
> zpdesc_dec_zone_page_state(zpdescs[i]);
> @@ -1336,12 +1336,14 @@ static unsigned long obj_malloc(struct zs_pool *pool,
> * @pool: pool to allocate from
> * @size: size of block to allocate
> * @gfp: gfp flags when allocating object
> + * @nid: The preferred node id to allocate new zspage (if needed)
> *
> * On success, handle to the allocated object is returned,
> * otherwise an ERR_PTR().
> * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
> */
> -unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp,
> + const int nid)
> {
> unsigned long handle;
> struct size_class *class;
> @@ -1376,7 +1378,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>
> spin_unlock(&class->lock);
>
> - zspage = alloc_zspage(pool, class, gfp);
> + zspage = alloc_zspage(pool, class, gfp, nid);
> if (!zspage) {
> cache_free_handle(pool, handle);
> return (unsigned long)ERR_PTR(-ENOMEM);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 204fb59da33c..455e9425c5f5 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -981,7 +981,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>
> zpool = pool->zpool;
> gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE;
> - alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle);
> + alloc_ret = zpool_malloc(zpool, dlen, gfp, &handle, page_to_nid(page));
> if (alloc_ret)
> goto unlock;
>
>
> base-commit: 8c65b3b82efb3b2f0d1b6e3b3e73c6f0fd367fb5
Powered by blists - more mailing lists