linux-kernel - Re: [PATCH v2] mm/vmscan: Fix hard LOCKUP in function isolate_lru

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <e878653e-d380-81c2-90a8-fd2d1d4e7287@kylinos.cn>
Date: Mon, 23 Sep 2024 14:03:46 +0800
From: liuye <liuye@...inos.cn>
To: Bharata B Rao <bharata@....com>, akpm@...ux-foundation.org
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Johannes Weiner <hannes@...xchg.org>,
 "Dadhania, Nikunj" <nikunj.dadhania@....com>,
 Usama Arif <usamaarif642@...il.com>, Yu Zhao <yuzhao@...gle.com>,
 Zhaoyang Huang <huangzhaoyang@...il.com>, Breno Leitao <leitao@...ian.org>
Subject: Re: [PATCH v2] mm/vmscan: Fix hard LOCKUP in function
 isolate_lru_folios



On 2024/9/20 下午2:31, Bharata B Rao wrote:
> On 19-Sep-24 7:44 AM, liuye wrote:
>> This fixes the following hard lockup in function isolate_lru_folios
>> when memory reclaim.If the LRU mostly contains ineligible folios
>> May trigger watchdog.
>>
>> watchdog: Watchdog detected hard LOCKUP on cpu 173
>> RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0
>> Call Trace:
>>     _raw_spin_lock_irqsave+0x31/0x40
>>     folio_lruvec_lock_irqsave+0x5f/0x90
>>     folio_batch_move_lru+0x91/0x150
>>     lru_add_drain_per_cpu+0x1c/0x40
>>     process_one_work+0x17d/0x350
>>     worker_thread+0x27b/0x3a0
>>     kthread+0xe8/0x120
>>     ret_from_fork+0x34/0x50
>>     ret_from_fork_asm+0x1b/0x30
>>
>> lruvec->lru_lock owner：
>>
>> PID: 2865     TASK: ffff888139214d40  CPU: 40   COMMAND: "kswapd0"
>>   #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555
>>   #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171
>>   #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920
>>   #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4
>>   #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde
>>      [exception RIP: isolate_lru_folios+403]
>>      RIP: ffffffffa597df53  RSP: ffffc90006fb7c28  RFLAGS: 00000002
>>      RAX: 0000000000000001  RBX: ffffc90006fb7c60  RCX: ffffea04a2196f88
>>      RDX: ffffc90006fb7c60  RSI: ffffc90006fb7c60  RDI: ffffea04a2197048
>>      RBP: ffff88812cbd3010   R8: ffffea04a2197008   R9: 0000000000000001
>>      R10: 0000000000000000  R11: 0000000000000001  R12: ffffea04a2197008
>>      R13: ffffea04a2197048  R14: ffffc90006fb7de8  R15: 0000000003e3e937
>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>      <NMI exception stack>
>>   #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
>>   #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788
>>   #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0
>>   #8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354
>>   #9 [ffffc90006fb7ef8] kthread at ffffffffa5748238
>> crash>
>>
>> Scenario:
>> User processe are requesting a large amount of memory and keep page active.
>> Then a module continuously requests memory from ZONE_DMA32 area.
>> Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached.
>> However pages in the LRU(active_anon) list are mostly from
>> the ZONE_NORMAL area.
>>
>> Reproduce:
>> Terminal 1: Construct to continuously increase pages active(anon).
>> mkdir /tmp/memory
>> mount -t tmpfs -o size=1024000M tmpfs /tmp/memory
>> dd if=/dev/zero of=/tmp/memory/block bs=4M
>> tail /tmp/memory/block
>>
>> Terminal 2:
>> vmstat -a 1
>> active will increase.
>> procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ...
>>   r  b   swpd   free  inact active   si   so    bi    bo
>>   1  0   0 1445623076 45898836 83646008    0    0     0
>>   1  0   0 1445623076 43450228 86094616    0    0     0
>>   1  0   0 1445623076 41003480 88541364    0    0     0
>>   1  0   0 1445623076 38557088 90987756    0    0     0
>>   1  0   0 1445623076 36109688 93435156    0    0     0
>>   1  0   0 1445619552 33663256 95881632    0    0     0
>>   1  0   0 1445619804 31217140 98327792    0    0     0
>>   1  0   0 1445619804 28769988 100774944    0    0     0
>>   1  0   0 1445619804 26322348 103222584    0    0     0
>>   1  0   0 1445619804 23875592 105669340    0    0     0
>>
>> cat /proc/meminfo | head
>> Active(anon) increase.
>> MemTotal:       1579941036 kB
>> MemFree:        1445618500 kB
>> MemAvailable:   1453013224 kB
>> Buffers:            6516 kB
>> Cached:         128653956 kB
>> SwapCached:            0 kB
>> Active:         118110812 kB
>> Inactive:       11436620 kB
>> Active(anon):   115345744 kB
>> Inactive(anon):   945292 kB
>>
>> When the Active(anon) is 115345744 kB, insmod module triggers
>> the ZONE_DMA32 watermark.
>>
>> perf record -e vmscan:mm_vmscan_lru_isolate -aR
>> perf script
>> isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2
>> nr_skipped=2 nr_taken=0 lru=active_anon
>> isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0
>> nr_skipped=0 nr_taken=0 lru=active_anon
>> isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844
>> nr_skipped=28835844 nr_taken=0 lru=active_anon
>> isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844
>> nr_skipped=28835844 nr_taken=0 lru=active_anon
>> isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29
>> nr_skipped=29 nr_taken=0 lru=active_anon
>> isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0
>> nr_skipped=0 nr_taken=0 lru=active_anon
>>
>> See nr_scanned=28835844.
>> 28835844 * 4k = 115343376KB approximately equal to 115345744 kB.
>>
>> If increase Active(anon) to 1000G then insmod module triggers
>> the ZONE_DMA32 watermark. hard lockup will occur.
>>
>> In my device nr_scanned = 0000000003e3e937 when hard lockup.
>> Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB.
>>
>>     [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
>>      ffffc90006fb7c30: 0000000000000020 0000000000000000
>>      ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000
>>      ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8
>>      ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48
>>      ffffc90006fb7c70: 0000000000000000 0000000000000000
>>      ffffc90006fb7c80: 0000000000000000 0000000000000000
>>      ffffc90006fb7c90: 0000000000000000 0000000000000000
>>      ffffc90006fb7ca0: 0000000000000000 0000000003e3e937
>>      ffffc90006fb7cb0: 0000000000000000 0000000000000000
>>      ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000
>>
>> About the Fixes:
>> Why did it take eight years to be discovered?
>>
>> The problem requires the following conditions to occur:
>> 1. The device memory should be large enough.
>> 2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area.
>> 3. The memory in ZONE_DMA32 needs to reach the watermark.
>>
>> If the memory is not large enough, or if the usage design of ZONE_DMA32
>> area memory is reasonable, this problem is difficult to detect.
>>
>> notes:
>> The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL,
>> but other suitable scenarios may also trigger the problem.
> 
> This problem appears very similar to the one we reported sometime back at
> 
> https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
> 
> where ~150 million folios were being skipped to isolate a few ZONE_DMA folios.
> 

Yes, similar to this scenario.

>>
>> Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
>> Signed-off-by: liuye <liuye@...inos.cn>
>>
>> ---
>> V1->V2 : Adjust code format and add scenario description, reproduction method.
>> ---
>> ---
>>   include/linux/swap.h | 1 +
>>   mm/vmscan.c          | 6 +++++-
>>   2 files changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index ba7ea95d1c57..afb3274c90ef 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -223,6 +223,7 @@ enum {
>>   };
>>     #define SWAP_CLUSTER_MAX 32UL
>> +#define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)
>>   #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>>     /* Bit flag in swap_map */
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index bd489c1af228..d2e436a4f47d 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1636,6 +1636,7 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>>       unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
>>       unsigned long skipped = 0;
>>       unsigned long scan, total_scan, nr_pages;
>> +    unsigned long max_nr_skipped = 0;
>>       LIST_HEAD(folios_skipped);
>>         total_scan = 0;
>> @@ -1650,9 +1651,12 @@ static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
>>           nr_pages = folio_nr_pages(folio);
>>           total_scan += nr_pages;
>>   -        if (folio_zonenum(folio) > sc->reclaim_idx) {
>> +        /* Using max_nr_skipped to prevent hard LOCKUP*/
>> +        if (max_nr_skipped < SWAP_CLUSTER_MAX_SKIPPED &&
>> +            (folio_zonenum(folio) > sc->reclaim_idx)) {
>>               nr_skipped[folio_zonenum(folio)] += nr_pages;
>>               move_to = &folios_skipped;
>> +            max_nr_skipped++;
>>               goto move;
>>           }
> 
> I am not sure if the above would help in all scenarios as limiting the skipped folios list to 1 million entries couldn't fix the soft/hard lockup issue.
> 

This value should not be too large, the earliest value is 32, before b2e18757f2c9.

 #define SWAP_CLUSTER_MAX 32UL
+#define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)

To prevent lock contention and lockup, this value should be neither too small nor too large. 
Depending on the CPU frequency, the time to trigger the lockup will vary.
Not sure if this value of SWAP_CLUSTER_MAX_SKIPPED is the most appropriate, but it does work.
My patch works for all scenarios and does not change the earlier code logic.

> In fact what helped was the fix by Yu Zhao which released the lruvec lock. This was posted for consideration at
> 
> https://lore.kernel.org/lkml/ZsTOwBffg5xSCUbP@gmail.com/T/
> 
> However this posting eventually resulted in the revert of
> 5da226dbfce3a2. Also some concerns about hoarding large number of folios in skipped list and effect (on compaction) of releasing of lruvec spinlock without clearing LRU flag were raised by Johannes.
> 

Regarding Yu Zhao's patch, unlocking and releasing the scheduler may cause changes in the lru list and more likely cause data corruption. And there are some other concerns you mentioned.
Of course, this method would be great if all the problems in all scenarios could be solved.

Please also let me know about other emails regarding this discussion. Cc me.

Thanks,
Liuye