linux-kernel - RE: [PATCH v2] mm: zswap: properly synchronize freeing resources during CPU hotunplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB56788DAADC493DB60B36ECA9C9132@SJ0PR11MB5678.namprd11.prod.outlook.com>
Date: Thu, 9 Jan 2025 00:12:30 +0000
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@...el.com>
To: Yosry Ahmed <yosryahmed@...gle.com>, Andrew Morton
	<akpm@...ux-foundation.org>
CC: Johannes Weiner <hannes@...xchg.org>, Nhat Pham <nphamcs@...il.com>,
	Chengming Zhou <chengming.zhou@...ux.dev>, Vitaly Wool
	<vitalywool@...il.com>, Barry Song <baohua@...nel.org>, Sam Sun
	<samsun1006219@...il.com>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"stable@...r.kernel.org" <stable@...r.kernel.org>, "Sridhar, Kanchana P"
	<kanchana.p.sridhar@...el.com>
Subject: RE: [PATCH v2] mm: zswap: properly synchronize freeing resources
 during CPU hotunplug


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@...gle.com>
> Sent: Wednesday, January 8, 2025 2:25 PM
> To: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Johannes Weiner <hannes@...xchg.org>; Nhat Pham
> <nphamcs@...il.com>; Chengming Zhou <chengming.zhou@...ux.dev>;
> Vitaly Wool <vitalywool@...il.com>; Barry Song <baohua@...nel.org>; Sam
> Sun <samsun1006219@...il.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@...el.com>; linux-mm@...ck.org; linux-
> kernel@...r.kernel.org; Yosry Ahmed <yosryahmed@...gle.com>;
> stable@...r.kernel.org
> Subject: [PATCH v2] mm: zswap: properly synchronize freeing resources
> during CPU hotunplug
> 
> In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of
> the
> current CPU at the beginning of the operation is retrieved and used
> throughout.  However, since neither preemption nor migration are
> disabled, it is possible that the operation continues on a different
> CPU.
> 
> If the original CPU is hotunplugged while the acomp_ctx is still in use,
> we run into a UAF bug as some of the resources attached to the acomp_ctx
> are freed during hotunplug in zswap_cpu_comp_dead() (i.e.
> acomp_ctx.buffer, acomp_ctx.req, or acomp_ctx.acomp).
> 
> The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to
> use crypto_acomp API for hardware acceleration") when the switch to the
> crypto_acomp API was made.  Prior to that, the per-CPU crypto_comp was
> retrieved using get_cpu_ptr() which disables preemption and makes sure
> the CPU cannot go away from under us.  Preemption cannot be disabled
> with the crypto_acomp API as a sleepable context is needed.
> 
> Use the acomp_ctx.mutex to synchronize CPU hotplug callbacks allocating
> and freeing resources with compression/decompression paths. Make sure
> that acomp_ctx.req is NULL when the resources are freed. In the
> compression/decompression paths, check if acomp_ctx.req is NULL after
> acquiring the mutex (meaning the CPU was offlined) and retry on the new
> CPU.
> 
> The initialization of acomp_ctx.mutex is moved from the CPU hotplug
> callback to the pool initialization where it belongs (where the mutex is
> allocated). In addition to adding clarity, this makes sure that CPU
> hotplug cannot reinitialize a mutex that is already locked by
> compression/decompression.
> 
> Previously a fix was attempted by holding cpus_read_lock() [1]. This
> would have caused a potential deadlock as it is possible for code
> already holding the lock to fall into reclaim and enter zswap (causing a
> deadlock). A fix was also attempted using SRCU for synchronization, but
> Johannes pointed out that synchronize_srcu() cannot be used in CPU
> hotplug notifiers [2].
> 
> Alternative fixes that were considered/attempted and could have worked:
> - Refcounting the per-CPU acomp_ctx. This involves complexity in
>   handling the race between the refcount dropping to zero in
>   zswap_[de]compress() and the refcount being re-initialized when the
>   CPU is onlined.
> - Disabling migration before getting the per-CPU acomp_ctx [3], but
>   that's discouraged and is a much bigger hammer than needed, and could
>   result in subtle performance issues.
> 
> [1]https://lkml.kernel.org/20241219212437.2714151-1-
> yosryahmed@...gle.com/
> [2]https://lkml.kernel.org/20250107074724.1756696-2-
> yosryahmed@...gle.com/
> [3]https://lkml.kernel.org/20250107222236.2715883-2-
> yosryahmed@...gle.com/
> 
> Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for
> hardware acceleration")
> Cc: <stable@...r.kernel.org>
> Signed-off-by: Yosry Ahmed <yosryahmed@...gle.com>
> Reported-by: Johannes Weiner <hannes@...xchg.org>
> Closes:
> https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/
> Reported-by: Sam Sun <samsun1006219@...il.com>
> Closes:
> https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4O
> curuL4tPg6OaQ@...l.gmail.com/
> ---
> 
> This applies on top of the latest mm-hotfixes-unstable on top of 'Revert
> "mm: zswap: fix race between [de]compression and CPU hotunplug"' and
> after 'mm: zswap: disable migration while using per-CPU acomp_ctx' was
> dropped.
> 
> v1 -> v2:
> - Move the initialization of the mutex to pool initialization.
> - Use the mutex to also synchronize with the CPU hotplug callback (i.e.
>   zswap_cpu_comp_prep()).
> - Naming cleanups.
> 
> ---
>  mm/zswap.c | 60 +++++++++++++++++++++++++++++++++++++++++---------
> ----
>  1 file changed, 46 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f6316b66fb236..4d7e564732267 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -251,7 +251,7 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
>  	struct zswap_pool *pool;
>  	char name[38]; /* 'zswap' + 32 char (max) num + \0 */
>  	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN |
> __GFP_KSWAPD_RECLAIM;
> -	int ret;
> +	int ret, cpu;
> 
>  	if (!zswap_has_pool) {
>  		/* if either are unset, pool initialization failed, and we
> @@ -285,6 +285,9 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
>  		goto error;
>  	}
> 
> +	for_each_possible_cpu(cpu)
> +		mutex_init(&per_cpu_ptr(pool->acomp_ctx, cpu)->mutex);
> +
>  	ret =
> cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
>  				       &pool->node);
>  	if (ret)
> @@ -821,11 +824,12 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
>  	struct acomp_req *req;
>  	int ret;
> 
> -	mutex_init(&acomp_ctx->mutex);
> -
> +	mutex_lock(&acomp_ctx->mutex);
>  	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL,
> cpu_to_node(cpu));
> -	if (!acomp_ctx->buffer)
> -		return -ENOMEM;
> +	if (!acomp_ctx->buffer) {
> +		ret = -ENOMEM;
> +		goto buffer_fail;
> +	}
> 
>  	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0,
> cpu_to_node(cpu));
>  	if (IS_ERR(acomp)) {
> @@ -844,6 +848,8 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
>  		ret = -ENOMEM;
>  		goto req_fail;
>  	}
> +
> +	/* acomp_ctx->req must be NULL if the acomp_ctx is not fully
> initialized */
>  	acomp_ctx->req = req;

For this to happen, shouldn't we directly assign:
 acomp_ctx->req = acomp_request_alloc(acomp_ctx->acomp);
 if (!acomp_ctx->req) { ...}

I was wondering how error conditions encountered in zswap_cpu_comp_prepare()
will impact zswap_[de]compress(). This is probably unrelated to this patch itself,
but is my understanding correct that an error in this procedure will cause
zswap_enabled to be set to false, which will cause any zswap_stores() to fail early?

Thanks,
Kanchana

> 
>  	crypto_init_wait(&acomp_ctx->wait);
> @@ -855,12 +861,15 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
>  	acomp_request_set_callback(req,
> CRYPTO_TFM_REQ_MAY_BACKLOG,
>  				   crypto_req_done, &acomp_ctx->wait);
> 
> +	mutex_unlock(&acomp_ctx->mutex);
>  	return 0;
> 
>  req_fail:
>  	crypto_free_acomp(acomp_ctx->acomp);
>  acomp_fail:
>  	kfree(acomp_ctx->buffer);
> +buffer_fail:
> +	mutex_unlock(&acomp_ctx->mutex);
>  	return ret;
>  }
> 
> @@ -869,17 +878,45 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
> struct hlist_node *node)
>  	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
>  	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool-
> >acomp_ctx, cpu);
> 
> +	mutex_lock(&acomp_ctx->mutex);
>  	if (!IS_ERR_OR_NULL(acomp_ctx)) {
>  		if (!IS_ERR_OR_NULL(acomp_ctx->req))
>  			acomp_request_free(acomp_ctx->req);
> +		acomp_ctx->req = NULL;
>  		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
>  			crypto_free_acomp(acomp_ctx->acomp);
>  		kfree(acomp_ctx->buffer);
>  	}
> +	mutex_unlock(&acomp_ctx->mutex);
> 
>  	return 0;
>  }
> 
> +static struct crypto_acomp_ctx *acomp_ctx_get_cpu_lock(struct
> zswap_pool *pool)
> +{
> +	struct crypto_acomp_ctx *acomp_ctx;
> +
> +	for (;;) {
> +		acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> +		mutex_lock(&acomp_ctx->mutex);
> +		if (likely(acomp_ctx->req))
> +			return acomp_ctx;
> +		/*
> +		 * It is possible that we were migrated to a different CPU
> after
> +		 * getting the per-CPU ctx but before the mutex was
> acquired. If
> +		 * the old CPU got offlined, zswap_cpu_comp_dead() could
> have
> +		 * already freed ctx->req (among other things) and set it to
> +		 * NULL. Just try again on the new CPU that we ended up on.
> +		 */
> +		mutex_unlock(&acomp_ctx->mutex);
> +	}
> +}
> +
> +static void acomp_ctx_put_unlock(struct crypto_acomp_ctx *acomp_ctx)
> +{
> +	mutex_unlock(&acomp_ctx->mutex);
> +}
> +
>  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  			   struct zswap_pool *pool)
>  {
> @@ -893,10 +930,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry,
>  	gfp_t gfp;
>  	u8 *dst;
> 
> -	acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> -
> -	mutex_lock(&acomp_ctx->mutex);
> -
> +	acomp_ctx = acomp_ctx_get_cpu_lock(pool);
>  	dst = acomp_ctx->buffer;
>  	sg_init_table(&input, 1);
>  	sg_set_page(&input, page, PAGE_SIZE, 0);
> @@ -949,7 +983,7 @@ static bool zswap_compress(struct page *page, struct
> zswap_entry *entry,
>  	else if (alloc_ret)
>  		zswap_reject_alloc_fail++;
> 
> -	mutex_unlock(&acomp_ctx->mutex);
> +	acomp_ctx_put_unlock(acomp_ctx);
>  	return comp_ret == 0 && alloc_ret == 0;
>  }
> 
> @@ -960,9 +994,7 @@ static void zswap_decompress(struct zswap_entry
> *entry, struct folio *folio)
>  	struct crypto_acomp_ctx *acomp_ctx;
>  	u8 *src;
> 
> -	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> -	mutex_lock(&acomp_ctx->mutex);
> -
> +	acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool);
>  	src = zpool_map_handle(zpool, entry->handle, ZPOOL_MM_RO);
>  	/*
>  	 * If zpool_map_handle is atomic, we cannot reliably utilize its
> mapped buffer
> @@ -986,10 +1018,10 @@ static void zswap_decompress(struct
> zswap_entry *entry, struct folio *folio)
>  	acomp_request_set_params(acomp_ctx->req, &input, &output,
> entry->length, PAGE_SIZE);
>  	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx-
> >req), &acomp_ctx->wait));
>  	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> -	mutex_unlock(&acomp_ctx->mutex);
> 
>  	if (src != acomp_ctx->buffer)
>  		zpool_unmap_handle(zpool, entry->handle);
> +	acomp_ctx_put_unlock(acomp_ctx);
>  }
> 
>  /*********************************
> --
> 2.47.1.613.gc27f4b7a9f-goog