linux-kernel - Re: [PATCH] mm, swap: fix allocation and scanning race with swapoff

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87cyizzgt5.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Wed, 13 Nov 2024 16:19:50 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Kairui Song <ryncsn@...il.com>
Cc: linux-mm@...ck.org,  Kairui Song <kasong@...cent.com>,  Andrew Morton
 <akpm@...ux-foundation.org>,  Chris Li <chrisl@...nel.org>,  Barry Song
 <v-songbaohua@...o.com>,  Ryan Roberts <ryan.roberts@....com>,  Hugh
 Dickins <hughd@...gle.com>,  Kalesh Singh <kaleshsingh@...gle.com>,
  linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm, swap: fix allocation and scanning race with swapoff

Kairui Song <ryncsn@...il.com> writes:

> From: Kairui Song <kasong@...cent.com>
>
> There are two flags used to synchronize allocation and scanning with
> swapoff: SWP_WRITEOK and SWP_SCANNING.
>
> SWP_WRITEOK: Swapoff will first unset this flag, at this point any
> further swap allocation or scanning on this device should just abort
> so no more new entries will be referencing this device. Swapoff
> will then unuse all existing swap entries.
>
> SWP_SCANNING: This flag is set when device is being scanned. Swapoff
> will wait for all scanner to stop before the final release of the swap
> device structures to avoid UAF. Note this flag is the highest used bit
> of si->flags so it could be added up arithmetically, if there are
> multiple scanner.
>
> commit 5f843a9a3a1e ("mm: swap: separate SSD allocation from
> scan_swap_map_slots()") ignored SWP_SCANNING and SWP_WRITEOK flags while
> separating cluster allocation path from the old allocation path. Add
> the flags back to fix swapoff race. The race is hard to trigger as
> si->lock prevents most parallel operations, but si->lock could be
> dropped for reclaim or discard. This issue is found during code review.
>
> This commit fixes this problem. For SWP_SCANNING, Just like before,
> set the flag before scan and remove it afterwards.
>
> For SWP_WRITEOK, there are several places where si->lock could
> be dropped, it will be error-prone and make the code hard to follow
> if we try to cover these places one by one. So just do one check before
> the real allocation, which is also very similar like before.
> With new cluster allocator it may waste a bit of time iterating
> the clusters but won't take long, and swapoff is not performance
> sensitive.
>
> Reported-by: "Huang, Ying" <ying.huang@...el.com>
> Closes: https://lore.kernel.org/linux-mm/87a5es3f1f.fsf@yhuang6-desk2.ccr.corp.intel.com/
> Fixes: 5f843a9a3a1e ("mm: swap: separate SSD allocation from scan_swap_map_slots()")
> Signed-off-by: Kairui Song <kasong@...cent.com>
> ---
>  mm/swapfile.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9c85bd46ab7f..b0a9071cfe1d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -664,12 +664,15 @@ static bool cluster_scan_range(struct swap_info_struct *si,
>  	return true;
>  }
>  
> -static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
> +static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
>  				unsigned int start, unsigned char usage,
>  				unsigned int order)
>  {
>  	unsigned int nr_pages = 1 << order;
>  
> +	if (!(si->flags & SWP_WRITEOK))
> +		return false;
> +
>  	if (cluster_is_free(ci)) {
>  		if (nr_pages < SWAPFILE_CLUSTER) {
>  			list_move_tail(&ci->list, &si->nonfull_clusters[order]);
> @@ -690,6 +693,8 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>  		list_move_tail(&ci->list, &si->full_clusters);
>  		ci->flags = CLUSTER_FLAG_FULL;
>  	}
> +
> +	return true;
>  }
>  
>  static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset,
> @@ -713,7 +718,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
>  
>  	while (offset <= end) {
>  		if (cluster_scan_range(si, ci, offset, nr_pages)) {
> -			cluster_alloc_range(si, ci, offset, usage, order);
> +			if (!cluster_alloc_range(si, ci, offset, usage, order)) {
> +				offset = SWAP_NEXT_INVALID;

We set offset to SWAP_NEXT_INVALID for 3 times in this function.  Can we
use a local variable to remove the duplication?  For example,

        unsigned long ret = SWAP_NEXT_INVALID;

And set ret if we allocate swap entry successfully.  We can do this in a
separate patch.

Otherwise, LGTM, Thanks!  Feel free to add,

Reviewed-by: "Huang, Ying" <ying.huang@...el.com>

> +				goto done;
> +			}
>  			*foundp = offset;
>  			if (ci->count == SWAPFILE_CLUSTER) {
>  				offset = SWAP_NEXT_INVALID;
> @@ -805,7 +813,11 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  	if (!list_empty(&si->free_clusters)) {
>  		ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>  		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> -		VM_BUG_ON(!found);
> +		/*
> +		 * Either we didn't touch the cluster due to swapoff,
> +		 * or the allocation must success.
> +		 */
> +		VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
>  		goto done;
>  	}
>  
> @@ -1041,6 +1053,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
>  
>  	VM_BUG_ON(!si->cluster_info);
>  
> +	si->flags += SWP_SCANNING;
> +
>  	while (n_ret < nr) {
>  		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
>  
> @@ -1049,6 +1063,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
>  		slots[n_ret++] = swp_entry(si->type, offset);
>  	}
>  
> +	si->flags -= SWP_SCANNING;
> +
>  	return n_ret;
>  }

--
Best Regards,
Huang, Ying