linux-kernel - Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6bcb4883-03d0-88eb-4c42-84fff0a9a141@loongson.cn>
Date:   Thu, 15 Sep 2022 09:19:48 +0800
From:   Hongchen Zhang <zhanghongchen@...ngson.cn>
To:     Andrew Morton <akpm@...ux-foundation.org>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not
 kswapd

Hi Andrew,

On 2022/9/15 am 6:51, Andrew Morton wrote:
> On Wed, 14 Sep 2022 10:33:18 +0800 Hongchen Zhang <zhanghongchen@...ngson.cn> wrote:
> 
>> when a process falls into page fault and there is not enough free
>> memory,it will do direct reclaim. At the same time,it is holding
>> mmap_lock.So in case of multi-thread,it should exit from page fault
>> ASAP.
>> When reclaim memory,we do scan adjust between anon and file lru which
>> may cost too much time and trigger hung task for other thread.So for a
>> process which is not kswapd,it should just do a little scan adjust.
> 
> Well, that's a pretty nasty bug.  Before diving into a possible fix,
> can you please tell us more about how this happens?  What sort of
> machine, what sort of workload.  Can you suggest why others are not
> experiencing this?
>We got a hung task panic originally by doing ltpstress on a Loongarch 
3A5000+71000 machine.Then, we found the same problem on a X86 machine as 
following:
[ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120 
seconds.
[ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
[ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid: 
77327 flags:0x00004002
[ 3748.483358] Call Trace:
[ 3748.485964]  <TASK>
[ 3748.488150]  __schedule+0x23d/0x590
[ 3748.491804]  schedule+0x4e/0xc0
[ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
[ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
[ 3748.505181]  down_read+0x43/0xa0
[ 3748.508518]  do_user_addr_fault+0x41c/0x670
[ 3748.512799]  exc_page_fault+0x77/0x170
[ 3748.516673]  asm_exc_page_fault+0x26/0x30
[ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
[ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 
0f 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 
d1 <f3> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
[ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
[ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX: 
0000000000000010
[ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI: 
00007f99faa1a000
[ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09: 
0000000000000014
[ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12: 
0000000000001000
[ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15: 
ffffaa9248fffd98
[ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
[ 3748.592614]  filemap_read+0x174/0x3e0
[ 3748.596354]  ? ima_file_check+0x6a/0xa0
[ 3748.600301]  generic_file_read_iter+0xe5/0x150
[ 3748.604884]  ext4_file_read_iter+0x5b/0x190
[ 3748.609164]  ? aa_file_perm+0x102/0x250
[ 3748.613125]  new_sync_read+0x10d/0x1a0
[ 3748.617009]  vfs_read+0x103/0x1a0
[ 3748.620423]  ksys_read+0x67/0xf0
[ 3748.623743]  __x64_sys_read+0x19/0x20
[ 3748.627511]  do_syscall_64+0x59/0xc0
[ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
[ 3748.636144]  ? do_syscall_64+0x69/0xc0
[ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
[ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
[ 3748.649872]  ? irqentry_exit+0x1d/0x30
[ 3748.653737]  ? exc_page_fault+0x89/0x170
[ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 3748.663030] RIP: 0033:0x7f9a852989cc
[ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX: 
0000000000000000
[ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX: 
00007f9a852989cc
[ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI: 
0000000000000061
[ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09: 
00007f99faa18010
[ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12: 
00007f99faa18010
[ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15: 
0000000000027100
[ 3748.711199]  </TASK>
...
...
[ 3750.943278] Kernel panic - not syncing: hung_task: blocked tasks
[ 3750.949399] CPU: 1 PID: 39 Comm: khungtaskd Not tainted 
5.15.0-46-generic #49-Ubuntu
[ 3750.957305] Hardware name: LENOVO 90DWCTO1WW/30D9, BIOS M05KT70A 
03/09/2017
[ 3750.964410] Call Trace:
[ 3750.966897]  <TASK>
[ 3750.969031]  show_stack+0x52/0x5c
[ 3750.972409]  dump_stack_lvl+0x4a/0x63
[ 3750.976129]  dump_stack+0x10/0x16
[ 3750.979491]  panic+0x149/0x321
[ 3750.982612]  check_hung_uninterruptible_tasks.cold+0x34/0x48
[ 3750.988383]  watchdog+0xad/0xb0
[ 3750.991562]  ? check_hung_uninterruptible_tasks+0x300/0x300
[ 3750.997285]  kthread+0x127/0x150
[ 3751.000587]  ? set_kthread_struct+0x50/0x50
[ 3751.004878]  ret_from_fork+0x1f/0x30
[ 3751.008527]  </TASK>
[ 3751.010794] Kernel Offset: 0x34600000 from 0xffffffff81000000 
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 3751.034481] ---[ end Kernel panic - not syncing: hung_task: blocked 
tasks ]---
The difference with normal ltpstress test is we use a very large swap 
partition,so the swap pressure is bigger than normal,and this problem 
becomes more likely to occur.

>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3042,11 +3042,17 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>   		nr[lru] = targets[lru] * (100 - percentage) / 100;
>>   		nr[lru] -= min(nr[lru], nr_scanned);
>>   
>> +		if (!current_is_kswapd())
>> +			nr[lru] = min(nr[lru], nr_to_reclaim);
>> +
>>   		lru += LRU_ACTIVE;
>>   		nr_scanned = targets[lru] - nr[lru];
>>   		nr[lru] = targets[lru] * (100 - percentage) / 100;
>>   		nr[lru] -= min(nr[lru], nr_scanned);
>>   
>> +		if (!current_is_kswapd())
>> +			nr[lru] = min(nr[lru], nr_to_reclaim);
>> +
>>   		scan_adjusted = true;
>>   	}
>>   	blk_finish_plug(&plug);
> 
> It would be better if these additions had code comments explaining why
> they're there.  But let's more fully understand the problem before
> altering your patch.
Thanks,
Hongchen Zhang