linux-kernel - Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <201706102057.GGG13003.OtFMJSQOVLFOHF@I-love.SAKURA.ne.jp>
Date:   Sat, 10 Jun 2017 20:57:46 +0900
From:   Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:     mhocko@...nel.org, hannes@...xchg.org
Cc:     akpm@...ux-foundation.org, guro@...com, vdavydov.dev@...il.com,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 2/2] mm, oom: do not trigger out_of_memory from the#PF

Michal Hocko wrote:
> And just to clarify a bit. The OOM killer should be invoked whenever
> appropriate from the allocation context. If we decide to fail the
> allocation in the PF path then we can safely roll back and retry the
> whole PF. This has an advantage that any locks held while doing the
> allocation will be released and that alone can help to make a further
> progress. Moreover we can relax retry-for-ever _inside_ the allocator
> semantic for the PF path and fail allocations when we cannot make
> further progress even after we hit the OOM condition or we do stall for
> too long.

What!? Are you saying that leave the allocator loop rather than invoke
the OOM killer if it is from page fault event without __GFP_FS set?
With below patch applied (i.e. ignore __GFP_FS for emulation purpose),
I can trivially observe systemwide lockup where the OOM killer is
never called.

----------
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8ad91a0..d01b671 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1495,6 +1495,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
 {
 	unsigned long address = read_cr2(); /* Get the faulting address */
 	enum ctx_state prev_state;
+	current->in_pagefault = 1;
 
 	/*
 	 * We must have this function tagged with __kprobes, notrace and call
@@ -1507,6 +1508,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs)
 	prev_state = exception_enter();
 	__do_page_fault(regs, error_code, address);
 	exception_exit(prev_state);
+	current->in_pagefault = 0;
 }
 NOKPROBE_SYMBOL(do_page_fault);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7704110..5a40e3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -627,6 +627,7 @@ struct task_struct {
 	/* disallow userland-initiated cgroup migration */
 	unsigned			no_cgroup_migration:1;
 #endif
+	unsigned			in_pagefault:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e2c925..d786a84 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1056,25 +1056,22 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
+	if (fatal_signal_pending(current))
 		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b896897..c79dfd5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3255,6 +3255,9 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 	*did_some_progress = 0;
 
+	if (current->in_pagefault)
+		return NULL;
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.
----------

(From http://I-love.SAKURA.ne.jp/tmp/serial-20170610.txt :)
----------
[   72.043747] tuned: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   72.046391] tuned cpuset=/ mems_allowed=0
[   72.047685] CPU: 2 PID: 2236 Comm: tuned Not tainted 4.12.0-rc4-next-20170609+ #614
[   72.049568] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   72.052408] Call Trace:
[   72.053394]  dump_stack+0x67/0x9e
[   72.054522]  warn_alloc+0x10f/0x1b0
[   72.055957]  ? find_next_bit+0xb/0x10
[   72.057233]  __alloc_pages_nodemask+0xc7c/0xeb0
[   72.058527]  alloc_pages_current+0x65/0xb0
[   72.059649]  __page_cache_alloc+0x10b/0x140
[   72.060775]  filemap_fault+0x3d6/0x690
[   72.061834]  ? filemap_fault+0x2a6/0x690
[   72.062978]  xfs_filemap_fault+0x34/0x50
[   72.064065]  __do_fault+0x1b/0x120
[   72.065108]  __handle_mm_fault+0xa86/0x1260
[   72.066309]  ? native_sched_clock+0x5e/0xa0
[   72.067478]  handle_mm_fault+0x182/0x360
[   72.068784]  ? handle_mm_fault+0x44/0x360
[   72.070345]  __do_page_fault+0x1a2/0x580
[   72.071573]  do_page_fault+0x34/0x90
[   72.072726]  page_fault+0x22/0x30
[   72.073797] RIP: 0033:0x7f9fadbfcbd3
[   72.074931] RSP: 002b:00007f9f9e7cfda0 EFLAGS: 00010293
[   72.076467] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f9fadbfcbd3
[   72.078318] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   72.080156] RBP: 00000000009429a0 R08: 00007f9f9e7cfdd0 R09: 00007f9f9e7cfb80
[   72.082017] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f9fa6356750
[   72.083898] R13: 0000000000000001 R14: 00007f9f9400d640 R15: 00007f9faec5add0
[   72.087057] Mem-Info:
[   72.088256] active_anon:356433 inactive_anon:2094 isolated_anon:0
[   72.088256]  active_file:92 inactive_file:82 isolated_file:15
[   72.088256]  unevictable:0 dirty:50 writeback:38 unstable:0
[   72.088256]  slab_reclaimable:0 slab_unreclaimable:88
[   72.088256]  mapped:423 shmem:2160 pagetables:8398 bounce:0
[   72.088256]  free:12922 free_pcp:175 free_cma:0
[   72.097601] Node 0 active_anon:1425732kB inactive_anon:8376kB active_file:368kB inactive_file:204kB unevictable:0kB isolated(anon):0kB isolated(file):60kB mapped:1692kB dirty:200kB writeback:152kB shmem:8640kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1163264kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   72.105838] Node 0 DMA free:7120kB min:412kB low:512kB high:612kB active_anon:8712kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:32kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   72.112517] lowmem_reserve[]: 0 1677 1677 1677
[   72.114210] Node 0 DMA32 free:44568kB min:44640kB low:55800kB high:66960kB active_anon:1416780kB inactive_anon:8376kB active_file:448kB inactive_file:696kB unevictable:0kB writepending:352kB present:2080640kB managed:1717500kB mlocked:0kB kernel_stack:20048kB pagetables:33560kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB
[   72.125094] lowmem_reserve[]: 0 0 0 0
[   72.126920] Node 0 DMA: 0*4kB 2*8kB (UM) 2*16kB (UM) 1*32kB (M) 2*64kB (UM) 2*128kB (UM) 2*256kB (UM) 0*512kB 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 7120kB
[   72.130681] Node 0 DMA32: 571*4kB (UM) 293*8kB (UMH) 161*16kB (UMEH) 164*32kB (UMEH) 52*64kB (UME) 22*128kB (MEH) 11*256kB (MEH) 7*512kB (EH) 4*1024kB (EH) 2*2048kB (E) 3*4096kB (M) = 45476kB
[   72.136101] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   72.138790] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   72.141271] 2346 total pagecache pages
[   72.142820] 0 pages in swap cache
[   72.144280] Swap cache stats: add 0, delete 0, find 0/0
[   72.146419] Free swap  = 0kB
[   72.148095] Total swap = 0kB
[   72.149577] 524157 pages RAM
[   72.150991] 0 pages HighMem/MovableOnly
[   72.152681] 90806 pages reserved
[   72.154258] 0 pages hwpoisoned
[   72.156018] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[   72.213133] a.out: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
[   72.216147] a.out cpuset=/ mems_allowed=0
[   72.217833] CPU: 0 PID: 2690 Comm: a.out Not tainted 4.12.0-rc4-next-20170609+ #614
[   72.220325] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   72.223090] Call Trace:
[   72.224552]  dump_stack+0x67/0x9e
[   72.226267]  warn_alloc+0x10f/0x1b0
[   72.227741]  ? find_next_bit+0xb/0x10
[   72.229278]  __alloc_pages_nodemask+0xc7c/0xeb0
[   72.231055]  alloc_pages_vma+0x76/0x1a0
[   72.232602]  __handle_mm_fault+0xe61/0x1260
[   72.236363]  ? native_sched_clock+0x5e/0xa0
[   72.238953]  handle_mm_fault+0x182/0x360
[   72.240626]  ? handle_mm_fault+0x44/0x360
[   72.242680]  __do_page_fault+0x1a2/0x580
[   72.244480]  do_page_fault+0x34/0x90
[   72.246333]  page_fault+0x22/0x30
[   72.247924] RIP: 0033:0x40084f
[   72.250262] RSP: 002b:00007ffc26c2d110 EFLAGS: 00010246
[   72.252464] RAX: 00000000521b2000 RBX: 0000000080000000 RCX: 00007f7412938650
[   72.255103] RDX: 0000000000000000 RSI: 00007ffc26c2cf30 RDI: 00007ffc26c2cf30
[   72.257500] RBP: 00007f7312a6f010 R08: 00007ffc26c2d040 R09: 00007ffc26c2ce80
[   72.259824] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000521b2000
[   72.262136] R13: 00007f7312a6f010 R14: 0000000000000000 R15: 0000000000000000
[   72.279511] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[   72.406573] a.out: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   72.408233] a.out: page allocation failure: order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[   72.408240] a.out cpuset=/ mems_allowed=0
[   72.408247] CPU: 3 PID: 2782 Comm: a.out Not tainted 4.12.0-rc4-next-20170609+ #614
[   72.408248] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   72.408249] Call Trace:
[   72.408254]  dump_stack+0x67/0x9e
[   72.408259]  warn_alloc+0x10f/0x1b0
[   72.408266]  ? find_next_bit+0xb/0x10
[   72.408270]  __alloc_pages_nodemask+0xc7c/0xeb0
[   72.408284]  alloc_pages_current+0x65/0xb0
[   72.408287]  __page_cache_alloc+0x10b/0x140
[   72.408291]  filemap_fault+0x3d6/0x690
[   72.408293]  ? filemap_fault+0x2a6/0x690
[   72.408300]  xfs_filemap_fault+0x34/0x50
[   72.408302]  __do_fault+0x1b/0x120
[   72.408305]  __handle_mm_fault+0xa86/0x1260
[   72.408308]  ? native_sched_clock+0x5e/0xa0
[   72.408317]  handle_mm_fault+0x182/0x360
[   72.408318]  ? handle_mm_fault+0x44/0x360
[   72.408322]  __do_page_fault+0x1a2/0x580
[   72.408328]  do_page_fault+0x34/0x90
[   72.408332]  page_fault+0x22/0x30
[   72.408334] RIP: 0033:0x7f7412968d60
[   72.408336] RSP: 002b:00007ffc26c2d108 EFLAGS: 00010246
[   72.408337] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 00007f7412968d60
[   72.408338] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003
[   72.408339] RBP: 0000000000000003 R08: 00007f74128c2938 R09: 000000000000000e
[   72.408340] R10: 00007ffc26c2ce90 R11: 0000000000000246 R12: 0000000000400935
[   72.408341] R13: 00007ffc26c2d210 R14: 0000000000000000 R15: 0000000000000000
[   72.408353] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
(...snipped...)
[  118.340309] a.out cpuset=/ mems_allowed=0
[  118.341908] CPU: 2 PID: 2794 Comm: a.out Not tainted 4.12.0-rc4-next-20170609+ #614
[  118.344090] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  118.347215] Call Trace:
[  118.348535]  dump_stack+0x67/0x9e
[  118.354648]  warn_alloc+0x10f/0x1b0
[  118.356160]  ? wake_all_kswapds+0x56/0x8e
[  118.357515]  __alloc_pages_nodemask+0xacf/0xeb0
[  118.358960]  alloc_pages_current+0x65/0xb0
[  118.360310]  xfs_buf_allocate_memory+0x15b/0x292
[  118.361745]  xfs_buf_get_map+0xf4/0x150
[  118.363038]  xfs_buf_read_map+0x29/0xd0
[  118.364342]  xfs_trans_read_buf_map+0x9a/0x1a0
[  118.366352]  xfs_imap_to_bp+0x69/0xe0
[  118.367754]  xfs_iread+0x79/0x400
[  118.368983]  xfs_iget+0x42f/0x8a0
[  118.370606]  ? xfs_iget+0x15b/0x8a0
[  118.372530]  ? kfree+0x12a/0x180
[  118.374061]  xfs_lookup+0x8f/0xb0
[  118.375377]  xfs_vn_lookup+0x6b/0xb0
[  118.376609]  lookup_open+0x5a8/0x800
[  118.377795]  ? _raw_spin_unlock_irq+0x32/0x50
[  118.379173]  path_openat+0x437/0xa70
[  118.380367]  do_filp_open+0x8c/0x100
[  118.381608]  ? _raw_spin_unlock+0x2c/0x50
[  118.383121]  ? __alloc_fd+0xf2/0x210
[  118.384603]  do_sys_open+0x13a/0x200
[  118.386121]  SyS_open+0x19/0x20
[  118.387356]  do_syscall_64+0x61/0x1a0
[  118.388578]  entry_SYSCALL64_slow_path+0x25/0x25
[  118.389987] RIP: 0033:0x7f7412962a40
[  118.391148] RSP: 002b:00007ffc26c2d108 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  118.393118] RAX: ffffffffffffffda RBX: 0000000000000067 RCX: 00007f7412962a40
[  118.395035] RDX: 0000000000000180 RSI: 0000000000000441 RDI: 00000000006010c0
[  118.396978] RBP: 0000000000000003 R08: 00007f74128c2938 R09: 000000000000000e
[  118.398952] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400935
[  118.401126] R13: 00007ffc26c2d210 R14: 0000000000000000 R15: 0000000000000000
[  118.425345] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.425371] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.425441] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.428177] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.429258] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.430359] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[  118.431457] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
----------