lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1564597080.11067.40.camel@lca.pw>
Date:   Wed, 31 Jul 2019 14:18:00 -0400
From:   Qian Cai <cai@....pw>
To:     Minchan Kim <minchan@...nel.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...e.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: "mm: account nr_isolated_xxx in [isolate|putback]_lru_page"
 breaks OOM with swap

On Wed, 2019-07-31 at 12:09 -0400, Qian Cai wrote:
> On Wed, 2019-07-31 at 14:34 +0900, Minchan Kim wrote:
> > On Tue, Jul 30, 2019 at 12:25:28PM -0400, Qian Cai wrote:
> > > OOM workloads with swapping is unable to recover with linux-next since
> > > next-
> > > 20190729 due to the commit "mm: account nr_isolated_xxx in
> > > [isolate|putback]_lru_page" breaks OOM with swap" [1]
> > > 
> > > [1] https://lore.kernel.org/linux-mm/20190726023435.214162-4-minchan@kerne
> > > l.
> > > org/
> > > T/#mdcd03bcb4746f2f23e6f508c205943726aee8355
> > > 
> > > For example, LTP oom01 test case is stuck for hours, while it finishes in
> > > a
> > > few
> > > minutes here after reverted the above commit. Sometimes, it prints those
> > > message
> > > while hanging.
> > > 
> > > [  509.983393][  T711] INFO: task oom01:5331 blocked for more than 122
> > > seconds.
> > > [  509.983431][  T711]       Not tainted 5.3.0-rc2-next-20190730 #7
> > > [  509.983447][  T711] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [  509.983477][  T711] oom01           D24656  5331   5157 0x00040000
> > > [  509.983513][  T711] Call Trace:
> > > [  509.983538][  T711] [c00020037d00f880] [0000000000000008] 0x8
> > > (unreliable)
> > > [  509.983583][  T711] [c00020037d00fa60] [c000000000023724]
> > > __switch_to+0x3a4/0x520
> > > [  509.983615][  T711] [c00020037d00fad0] [c0000000008d17bc]
> > > __schedule+0x2fc/0x950
> > > [  509.983647][  T711] [c00020037d00fba0] [c0000000008d1e68]
> > > schedule+0x58/0x150
> > > [  509.983684][  T711] [c00020037d00fbd0] [c0000000008d7614]
> > > rwsem_down_read_slowpath+0x4b4/0x630
> > > [  509.983727][  T711] [c00020037d00fc90] [c0000000008d7dfc]
> > > down_read+0x12c/0x240
> > > [  509.983758][  T711] [c00020037d00fd20] [c00000000005fb28]
> > > __do_page_fault+0x6f8/0xee0
> > > [  509.983801][  T711] [c00020037d00fe20] [c00000000000a364]
> > > handle_page_fault+0x18/0x38
> > 
> > Thanks for the testing! No surprise the patch make some bugs because
> > it's rather tricky.
> > 
> > Could you test this patch?
> 
> It does help the situation a bit, but the recover speed is still way slower
> than
> just reverting the commit "mm: account nr_isolated_xxx in
> [isolate|putback]_lru_page". For example, on this powerpc system, it used to
> take 4-min to finish oom01 while now still take 13-min.
> 
> The oom02 (testing NUMA mempolicy) takes even longer and I gave up after 26-
> min
> with several hang tasks below.

Also, oom02 is stuck on an x86 machine.

[10327.974285][  T197] INFO: task oom02:29546 can't die for more than 122
seconds.
[10327.981654][  T197] oom02           D22576 29546  29536 0x00004006
[10327.987928][  T197] Call Trace:
[10327.991237][  T197]  __schedule+0x495/0xb50
[10327.995481][  T197]  ? __sched_text_start+0x8/0x8
[10328.000230][  T197]  ? __debug_check_no_obj_freed+0x250/0x250
[10328.006036][  T197]  schedule+0x5d/0x140
[10328.009994][  T197]  schedule_timeout+0x23f/0x380
[10328.014752][  T197]  ? mem_cgroup_uncharge+0x110/0x110
[10328.020103][  T197]  ? usleep_range+0x100/0x100
[10328.024691][  T197]  ? del_timer_sync+0xa0/0xa0
[10328.029257][  T197]  ? shrink_active_list+0x825/0x9d0
[10328.034362][  T197]  ? msleep+0x23/0x70
[10328.038228][  T197]  msleep+0x58/0x70
[10328.042090][  T197]  shrink_inactive_list+0x5cf/0x730
[10328.047197][  T197]  ? move_pages_to_lru+0xc70/0xc70
[10328.052205][  T197]  ? cpumask_next+0x35/0x40
[10328.056611][  T197]  ? lruvec_lru_size+0x12d/0x3a0
[10328.061445][  T197]  ? __kasan_check_read+0x11/0x20
[10328.066530][  T197]  ? inactive_list_is_low+0x2b9/0x410
[10328.071796][  T197]  shrink_node_memcg+0x4ff/0x1560
[10328.076740][  T197]  ? shrink_active_list+0x9d0/0x9d0
[10328.081834][  T197]  ? f_getown+0x70/0x70
[10328.085900][  T197]  ? mem_cgroup_iter+0x135/0x840
[10328.090874][  T197]  ? mem_cgroup_iter+0x18e/0x840
[10328.095726][  T197]  ? __kasan_check_read+0x11/0x20
[10328.100641][  T197]  ? mem_cgroup_protected+0x215/0x260
[10328.105929][  T197]  shrink_node+0x1d3/0xa30
[10328.110233][  T197]  ? shrink_node_memcg+0x1560/0x1560
[10328.115671][  T197]  ? __kasan_check_read+0x11/0x20
[10328.120586][  T197]  do_try_to_free_pages+0x22f/0x820
[10328.125693][  T197]  ? shrink_node+0xa30/0xa30
[10328.130173][  T197]  ? __kasan_check_read+0x11/0x20
[10328.135113][  T197]  ? check_chain_key+0x1df/0x2e0
[10328.139942][  T197]  try_to_free_pages+0x242/0x4d0
[10328.144938][  T197]  ? do_try_to_free_pages+0x820/0x820
[10328.150209][  T197]  __alloc_pages_nodemask+0x9ce/0x1bc0
[10328.155589][  T197]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
[10328.160853][  T197]  ? __kasan_check_read+0x11/0x20
[10328.166007][  T197]  ? check_chain_key+0x1df/0x2e0
[10328.170839][  T197]  ? do_anonymous_page+0x33c/0xde0
[10328.175869][  T197]  alloc_pages_vma+0x89/0x2c0
[10328.180439][  T197]  do_anonymous_page+0x3d8/0xde0
[10328.185288][  T197]  ? finish_fault+0x120/0x120
[10328.189857][  T197]  ? alloc_pages_vma+0x9a/0x2c0
[10328.194746][  T197]  handle_pte_fault+0x457/0x12c0
[10328.199577][  T197]  __handle_mm_fault+0x79a/0xa50
[10328.204431][  T197]  ? vmf_insert_mixed_mkwrite+0x20/0x20
[10328.209876][  T197]  ? __kasan_check_read+0x11/0x20
[10328.214816][  T197]  ? __count_memcg_events+0x56/0x1d0
[10328.220201][  T197]  handle_mm_fault+0x17f/0x370
[10328.224881][  T197]  __do_page_fault+0x25b/0x5d0
[10328.229538][  T197]  do_page_fault+0x50/0x2d3
[10328.233957][  T197]  page_fault+0x2c/0x40
[10328.238004][  T197] RIP: 0033:0x410c50
[10328.241951][  T197] Code: Bad RIP value.
[10328.245927][  T197] RSP: 002b:00007f27f0afcec0 EFLAGS: 00010206
[10328.251892][  T197] RAX: 0000000000001000 RBX: 00000000c0000000 RCX:
00007f2d34bfd497
[10328.259792][  T197] RDX: 00000000224ed000 RSI: 00000000c0000000 RDI:
0000000000000000
[10328.267845][  T197] RBP: 00007f266fafc000 R08: 00000000ffffffff R09:
0000000000000000
[10328.275752][  T197] R10: 0000000000000022 R11: 0000000000000246 R12:
0000000000000001
[10328.283635][  T197] R13: 00007fff5d124f9f R14: 0000000000000000 R15:
00007f27f0afcfc0
[10328.291696][  T197] INFO: task oom02:29554 can't die for more than 123
seconds.
[10328.299088][  T197] oom02           D22576 29554  29536 0x00004006
[10328.305348][  T197] Call Trace:
[10328.308519][  T197]  __schedule+0x495/0xb50
[10328.312737][  T197]  ? __sched_text_start+0x8/0x8
[10328.317706][  T197]  ? __debug_check_no_obj_freed+0x250/0x250
[10328.323497][  T197]  schedule+0x5d/0x140
[10328.327475][  T197]  schedule_timeout+0x23f/0x380
[10328.332217][  T197]  ? mem_cgroup_uncharge+0x110/0x110
[10328.337421][  T197]  ? usleep_range+0x100/0x100
[10328.342184][  T197]  ? del_timer_sync+0xa0/0xa0
[10328.346778][  T197]  ? shrink_active_list+0x825/0x9d0
[10328.351874][  T197]  ? msleep+0x23/0x70
[10328.355766][  T197]  msleep+0x58/0x70
[10328.359460][  T197]  shrink_inactive_list+0x5cf/0x730
[10328.364576][  T197]  ? move_pages_to_lru+0xc70/0xc70
[10328.369748][  T197]  ? cpumask_next+0x35/0x40
[10328.374158][  T197]  ? lruvec_lru_size+0x12d/0x3a0
[10328.378986][  T197]  ? __kasan_check_read+0x11/0x20
[10328.383927][  T197]  ? inactive_list_is_low+0x2b9/0x410
[10328.389195][  T197]  shrink_node_memcg+0x4ff/0x1560
[10328.394309][  T197]  ? shrink_active_list+0x9d0/0x9d0
[10328.399400][  T197]  ? f_getown+0x70/0x70
[10328.403445][  T197]  ? mem_cgroup_iter+0x135/0x840
[10328.408298][  T197]  ? mem_cgroup_iter+0x18e/0x840
[10328.413127][  T197]  ? __kasan_check_read+0x11/0x20
[10328.418306][  T197]  ? mem_cgroup_protected+0x215/0x260
[10328.423572][  T197]  shrink_node+0x1d3/0xa30
[10328.427899][  T197]  ? shrink_node_memcg+0x1560/0x1560
[10328.433080][  T197]  ? __kasan_check_read+0x11/0x20
[10328.438019][  T197]  do_try_to_free_pages+0x22f/0x820
[10328.443233][  T197]  ? shrink_node+0xa30/0xa30
[10328.447739][  T197]  ? __kasan_check_read+0x11/0x20
[10328.452655][  T197]  ? check_chain_key+0x1df/0x2e0
[10328.457507][  T197]  try_to_free_pages+0x242/0x4d0
[10328.462334][  T197]  ? do_try_to_free_pages+0x820/0x820
[10328.467848][  T197]  __alloc_pages_nodemask+0x9ce/0x1bc0
[10328.473205][  T197]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
[10328.478494][  T197]  ? __kasan_check_read+0x11/0x20
[10328.483410][  T197]  ? check_chain_key+0x1df/0x2e0
[10328.488266][  T197]  ? do_anonymous_page+0x33c/0xde0
[10328.493409][  T197]  alloc_pages_vma+0x89/0x2c0
[10328.498004][  T197]  do_anonymous_page+0x3d8/0xde0
[10328.502834][  T197]  ? finish_fault+0x120/0x120
[10328.507424][  T197]  ? alloc_pages_vma+0x9a/0x2c0
[10328.512167][  T197]  handle_pte_fault+0x457/0x12c0
[10328.517261][  T197]  __handle_mm_fault+0x79a/0xa50
[10328.522093][  T197]  ? vmf_insert_mixed_mkwrite+0x20/0x20
[10328.527556][  T197]  ? __kasan_check_read+0x11/0x20
[10328.532473][  T197]  ? __count_memcg_events+0x56/0x1d0
[10328.537678][  T197]  handle_mm_fault+0x17f/0x370
[10328.542484][  T197]  __do_page_fault+0x25b/0x5d0
[10328.547164][  T197]  do_page_fault+0x50/0x2d3
[10328.551557][  T197]  page_fault+0x2c/0x40
[10328.555624][  T197] RIP: 0033:0x410c50
[10328.559405][  T197] Code: Bad RIP value.
[10328.563358][  T197] RSP: 002b:00007f21ecaf4ec0 EFLAGS: 00010206
[10328.569438][  T197] RAX: 0000000000001000 RBX: 00000000c0000000 RCX:
00007f2d34bfd497
[10328.577349][  T197] RDX: 000000001aeb4000 RSI: 00000000c0000000 RDI:
0000000000000000
[10328.585253][  T197] RBP: 00007f206baf4000 R08: 00000000ffffffff R09:
0000000000000000
[10328.593292][  T197] R10: 0000000000000022 R11: 0000000000000246 R12:
0000000000000001
[10328.601201][  T197] R13: 00007fff5d124f9f R14: 0000000000000000 R15:
00007f21ecaf4fc0
[10328.609120][  T197] 
[10328.609120][  T197] Showing all locks held in the system:
[10328.617052][  T197] 1 lock held by khungtaskd/197:
[10328.621878][  T197]  #0: 000000002d9f974d (rcu_read_lock){....}, at:
debug_show_all_locks+0x33/0x165
[10328.631211][  T197] 2 locks held by oom02/29546:
[10328.635888][  T197]  #0: 0000000031e5d1a8 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x166/0x5d0
[10328.645093][  T197]  #1: 00000000e060a0f6 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
[10328.654418][  T197] 2 locks held by oom02/29554:
[10328.659070][  T197]  #0: 0000000031e5d1a8 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x166/0x5d0
[10328.668286][  T197]  #1: 00000000e060a0f6 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
[10328.677608][  T197] 
[10328.679812][  T197] =============================================
[10328.679812][  T197] 
[10450.864064][  T197] INFO: task oom02:29546 can't die for more than 245
seconds.
[10450.871642][  T197] oom02           D22576 29546  29536 0x00004006
[10450.877912][  T197] Call Trace:
[10450.881087][  T197]  __schedule+0x495/0xb50
[10450.885330][  T197]  ? __sched_text_start+0x8/0x8
[10450.890072][  T197]  ? __debug_check_no_obj_freed+0x250/0x250
[10450.896031][  T197]  schedule+0x5d/0x140
[10450.899989][  T197]  schedule_timeout+0x23f/0x380
[10450.904753][  T197]  ? mem_cgroup_uncharge+0x110/0x110
[10450.909936][  T197]  ? usleep_range+0x100/0x100
[10450.914526][  T197]  ? del_timer_sync+0xa0/0xa0
[10450.919314][  T197]  ? shrink_active_list+0x825/0x9d0
[10450.924428][  T197]  ? msleep+0x23/0x70
[10450.928296][  T197]  msleep+0x58/0x70
[10450.931991][  T197]  shrink_inactive_list+0x5cf/0x730
[10450.937103][  T197]  ? move_pages_to_lru+0xc70/0xc70
[10450.942254][  T197]  ? cpumask_next+0x35/0x40
[10450.946678][  T197]  ? lruvec_lru_size+0x12d/0x3a0
[10450.951512][  T197]  ? __kasan_check_read+0x11/0x20
[10450.956444][  T197]  ? inactive_list_is_low+0x2b9/0x410
[10450.961711][  T197]  shrink_node_memcg+0x4ff/0x1560
[10450.966650][  T197]  ? shrink_active_list+0x9d0/0x9d0
[10450.971929][  T197]  ? f_getown+0x70/0x70
[10450.975988][  T197]  ? mem_cgroup_iter+0x135/0x840
[10450.980821][  T197]  ? mem_cgroup_iter+0x18e/0x840
[10450.985672][  T197]  ? __kasan_check_read+0x11/0x20
[10450.990591][  T197]  ? mem_cgroup_protected+0x215/0x260
[10450.996050][  T197]  shrink_node+0x1d3/0xa30
[10451.000361][  T197]  ? shrink_node_memcg+0x1560/0x1560
[10451.005561][  T197]  ? __kasan_check_read+0x11/0x20
[10451.010477][  T197]  do_try_to_free_pages+0x22f/0x820
[10451.015589][  T197]  ? shrink_node+0xa30/0xa30
[10451.020293][  T197]  ? __kasan_check_read+0x11/0x20
[10451.025232][  T197]  ? check_chain_key+0x1df/0x2e0
[10451.030059][  T197]  try_to_free_pages+0x242/0x4d0
[10451.034910][  T197]  ? do_try_to_free_pages+0x820/0x820
[10451.040180][  T197]  __alloc_pages_nodemask+0x9ce/0x1bc0
[10451.045732][  T197]  ? gfp_pfmemalloc_allowed+0xc0/0xc0
[10451.050999][  T197]  ? __kasan_check_read+0x11/0x20
[10451.055936][  T197]  ? check_chain_key+0x1df/0x2e0
[10451.060767][  T197]  ? do_anonymous_page+0x33c/0xde0
[10451.065796][  T197]  alloc_pages_vma+0x89/0x2c0
[10451.070521][  T197]  do_anonymous_page+0x3d8/0xde0
[10451.075372][  T197]  ? finish_fault+0x120/0x120
[10451.079941][  T197]  ? alloc_pages_vma+0x9a/0x2c0
[10451.084703][  T197]  handle_pte_fault+0x457/0x12c0
[10451.089536][  T197]  __handle_mm_fault+0x79a/0xa50
[10451.094557][  T197]  ? vmf_insert_mixed_mkwrite+0x20/0x20
[10451.100001][  T197]  ? __kasan_check_read+0x11/0x20
[10451.104938][  T197]  ? __count_memcg_events+0x56/0x1d0
[10451.110118][  T197]  handle_mm_fault+0x17f/0x370
[10451.114789][  T197]  __do_page_fault+0x25b/0x5d0
[10451.119661][  T197]  do_page_fault+0x50/0x2d3
[10451.124077][  T197]  page_fault+0x2c/0x40
[10451.128118][  T197] RIP: 0033:0x410c50
[10451.131901][  T197] Code: Bad RIP value.
[10451.135871][  T197] RSP: 002b:00007f27f0afcec0 EFLAGS: 00010206
[10451.141979][  T197] RAX: 0000000000001000 RBX: 00000000c0000000 RCX:
00007f2d34bfd497
[10451.149881][  T197] RDX: 00000000224ed000 RSI: 00000000c0000000 RDI:
0000000000000000
[10451.157786][  T197] RBP: 00007f266fafc000 R08: 00000000ffffffff R09:
0000000000000000
[10451.165694][  T197] R10: 0000000000000022 R11: 0000000000000246 R12:
0000000000000001
[10451.173741][  T197] R13: 00007fff5d124f9f R14: 0000000000000000 R15:
00007f27f0afcfc0
[10451.181656][  T197] 
[10451.181656][  T197] Showing all locks held in the system:
[10451.189350][  T197] 1 lock held by khungtaskd/197:
[10451.194369][  T197]  #0: 000000002d9f974d (rcu_read_lock){....}, at:
debug_show_all_locks+0x33/0x165
[10451.203670][  T197] 2 locks held by oom02/29546:
[10451.208344][  T197]  #0: 0000000031e5d1a8 (&mm->mmap_sem#2){....}, at:
__do_page_fault+0x166/0x5d0
[10451.217583][  T197]  #1: 00000000e060a0f6 (fs_reclaim){....}, at:
fs_reclaim_acquire.part.15+0x5/0x30
[10451.226908][  T197] 
[10451.229112][  T197] =============================================
[10451.229112][  T197] 
[10758.054022][T29393] kworker/dying (29393) used greatest stack depth: 16928
bytes left

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ