linux-kernel - Re: [PATCH v2 RESEND] mm/vmscan: Fix hard LOCKUP in function isolate_lru

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241129192228.6f08e74a555bedcad71d32f4@linux-foundation.org>
Date: Fri, 29 Nov 2024 19:22:28 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: liuye <liuye@...inos.cn>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, Mel Gorman
 <mgorman@...hsingularity.net>, Hugh Dickins <hughd@...gle.com>, Yang Shi
 <yang@...amperecomputing.com>
Subject: Re: [PATCH v2 RESEND] mm/vmscan: Fix hard LOCKUP in function
 isolate_lru_folios

On Tue, 19 Nov 2024 14:08:42 +0800 liuye <liuye@...inos.cn> wrote:

> This fixes the following hard lockup in function isolate_lru_folios
> when memory reclaim.If the LRU mostly contains ineligible folios
> May trigger watchdog.
> 
> watchdog: Watchdog detected hard LOCKUP on cpu 173
> RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0
> Call Trace:
> 	_raw_spin_lock_irqsave+0x31/0x40
> 	folio_lruvec_lock_irqsave+0x5f/0x90
> 	folio_batch_move_lru+0x91/0x150
> 	lru_add_drain_per_cpu+0x1c/0x40
> 	process_one_work+0x17d/0x350
> 	worker_thread+0x27b/0x3a0
> 	kthread+0xe8/0x120
> 	ret_from_fork+0x34/0x50
> 	ret_from_fork_asm+0x1b/0x30
> 
> lruvec->lru_lock owner：
> 
> PID: 2865     TASK: ffff888139214d40  CPU: 40   COMMAND: "kswapd0"
>  #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555
>  #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171
>  #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920
>  #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4
>  #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde
>     [exception RIP: isolate_lru_folios+403]
>     RIP: ffffffffa597df53  RSP: ffffc90006fb7c28  RFLAGS: 00000002
>     RAX: 0000000000000001  RBX: ffffc90006fb7c60  RCX: ffffea04a2196f88
>     RDX: ffffc90006fb7c60  RSI: ffffc90006fb7c60  RDI: ffffea04a2197048
>     RBP: ffff88812cbd3010   R8: ffffea04a2197008   R9: 0000000000000001
>     R10: 0000000000000000  R11: 0000000000000001  R12: ffffea04a2197008
>     R13: ffffea04a2197048  R14: ffffc90006fb7de8  R15: 0000000003e3e937
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>     <NMI exception stack>
>  #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
>  #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788
>  #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0
>  #8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354
>  #9 [ffffc90006fb7ef8] kthread at ffffffffa5748238
> crash>
> 
> Scenario:
> User processe are requesting a large amount of memory and keep page active.
> Then a module continuously requests memory from ZONE_DMA32 area.
> Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached.
> However pages in the LRU(active_anon) list are mostly from
> the ZONE_NORMAL area.
> 
> Reproduce:
> Terminal 1: Construct to continuously increase pages active(anon).
> mkdir /tmp/memory
> mount -t tmpfs -o size=1024000M tmpfs /tmp/memory
> dd if=/dev/zero of=/tmp/memory/block bs=4M
> tail /tmp/memory/block
> 
> Terminal 2:
> vmstat -a 1
> active will increase.
> procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ...
>  r  b   swpd   free  inact active   si   so    bi    bo
>  1  0   0 1445623076 45898836 83646008    0    0     0
>  1  0   0 1445623076 43450228 86094616    0    0     0
>  1  0   0 1445623076 41003480 88541364    0    0     0
>  1  0   0 1445623076 38557088 90987756    0    0     0
>  1  0   0 1445623076 36109688 93435156    0    0     0
>  1  0   0 1445619552 33663256 95881632    0    0     0
>  1  0   0 1445619804 31217140 98327792    0    0     0
>  1  0   0 1445619804 28769988 100774944    0    0     0
>  1  0   0 1445619804 26322348 103222584    0    0     0
>  1  0   0 1445619804 23875592 105669340    0    0     0
> 
> cat /proc/meminfo | head
> Active(anon) increase.
> MemTotal:       1579941036 kB
> MemFree:        1445618500 kB
> MemAvailable:   1453013224 kB
> Buffers:            6516 kB
> Cached:         128653956 kB
> SwapCached:            0 kB
> Active:         118110812 kB
> Inactive:       11436620 kB
> Active(anon):   115345744 kB
> Inactive(anon):   945292 kB
> 
> When the Active(anon) is 115345744 kB, insmod module triggers
> the ZONE_DMA32 watermark.
> 
> perf record -e vmscan:mm_vmscan_lru_isolate -aR
> perf script
> isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2
> nr_skipped=2 nr_taken=0 lru=active_anon
> isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0
> nr_skipped=0 nr_taken=0 lru=active_anon
> isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844
> nr_skipped=28835844 nr_taken=0 lru=active_anon
> isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844
> nr_skipped=28835844 nr_taken=0 lru=active_anon
> isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29
> nr_skipped=29 nr_taken=0 lru=active_anon
> isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0
> nr_skipped=0 nr_taken=0 lru=active_anon
> 
> See nr_scanned=28835844.
> 28835844 * 4k = 115343376KB approximately equal to 115345744 kB.
> 
> If increase Active(anon) to 1000G then insmod module triggers
> the ZONE_DMA32 watermark. hard lockup will occur.
> 
> In my device nr_scanned = 0000000003e3e937 when hard lockup.
> Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB.
> 
>    [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
>     ffffc90006fb7c30: 0000000000000020 0000000000000000
>     ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000
>     ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8
>     ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48
>     ffffc90006fb7c70: 0000000000000000 0000000000000000
>     ffffc90006fb7c80: 0000000000000000 0000000000000000
>     ffffc90006fb7c90: 0000000000000000 0000000000000000
>     ffffc90006fb7ca0: 0000000000000000 0000000003e3e937
>     ffffc90006fb7cb0: 0000000000000000 0000000000000000
>     ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000
> 
> About the Fixes:
> Why did it take eight years to be discovered?
> 
> The problem requires the following conditions to occur:
> 1. The device memory should be large enough.
> 2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area.
> 3. The memory in ZONE_DMA32 needs to reach the watermark.
> 
> If the memory is not large enough, or if the usage design of ZONE_DMA32
> area memory is reasonable, this problem is difficult to detect.
> 
> notes:
> The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL,
> but other suitable scenarios may also trigger the problem.
>
> Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
>

Thanks.

This is old code.  I agree on b2e18757f2c9 and thanks for digging that
out.

I'll add a cc:stable and shall queue it for testing, pending review
from others (please).  It may be that the -stable tree maintainers ask
for a backport of this change into pre-folio-conversion kernels.  But
given the obscurity of the workload, I'm not sure this would be worth
doing.  Opinions are sought?

> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -223,6 +223,7 @@ enum {
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> +#define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)
>  #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>  
>  /* Bit flag in swap_map */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 28ba2b06fc7d..0bdfae413b4c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1657,6 +1657,7 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>  	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
>  	unsigned long skipped = 0;
>  	unsigned long scan, total_scan, nr_pages;
> +	unsigned long max_nr_skipped = 0;
>  	LIST_HEAD(folios_skipped);
>  
>  	total_scan = 0;
> @@ -1671,9 +1672,12 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>  		nr_pages = folio_nr_pages(folio);
>  		total_scan += nr_pages;
>  
> -		if (folio_zonenum(folio) > sc->reclaim_idx) {
> +		/* Using max_nr_skipped to prevent hard LOCKUP*/
> +		if (max_nr_skipped < SWAP_CLUSTER_MAX_SKIPPED &&
> +		    (folio_zonenum(folio) > sc->reclaim_idx)) {
>  			nr_skipped[folio_zonenum(folio)] += nr_pages;
>  			move_to = &folios_skipped;
> +			max_nr_skipped++;
>  			goto move;
>  		}
>  
> -- 
> 2.25.1