[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20170804074212.GA26029@dhcp22.suse.cz>
Date: Fri, 4 Aug 2017 09:42:12 +0200
From: Michal Hocko <mhocko@...nel.org>
To: Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Wenwei Tao <wenwei.tww@...baba-inc.com>,
Oleg Nesterov <oleg@...hat.com>,
David Rientjes <rientjes@...gle.com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm, oom: fix potential data corruption when oom_reaper
races with writer
On Fri 04-08-17 15:46:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > So there is a race window when some threads
> > won't have fatal_signal_pending while the oom_reaper could start
> > unmapping the address space. generic_perform_write could then write
> > zero page to the page cache and corrupt data.
>
> Oh, simple generic_perform_write() ?
>
> >
> > The race window is rather small and close to impossible to happen but it
> > would be better to have it covered.
>
> OK, I confirmed that this problem is easily reproducible using below reproducer.
Yeah, I can imagine this could be triggered artificially. I am somehow
more skeptical about real life oom scenarios to trigger this though.
Anyway, thanks for your test case!
> Applying your patch seems to avoid this problem, but as far as I tested
> your patch seems to trivially trigger something lock related problem.
> Is your patch really safe?
> ----------
> [ 58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
> [ 58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
> [ 58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
> [ 58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
> [ 58.557480] ------------[ cut here ]------------
> [ 58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
> [ 58.569076] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon vmw_vmci shpchp sg i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod sd_mod cdrom ata_generic pata_acpi serio_raw mptspi scsi_transport_spi mptscsih ahci e1000 libahci ata_piix mptbase libata
> [ 58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
> [ 58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [ 58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
> [ 58.613944] RIP: 0010:lock_release+0x172/0x1e0
> [ 58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
> [ 58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
> [ 58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
> [ 58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
> [ 58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
> [ 58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
> [ 58.644200] FS: 00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
> [ 58.648989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
> [ 58.657280] Call Trace:
> [ 58.659989] up_read+0x1a/0x40
> [ 58.662825] __do_page_fault+0x28e/0x4c0
> [ 58.665946] do_page_fault+0x30/0x80
> [ 58.668911] page_fault+0x28/0x30
OK, I know what is going on here. The page fault must have returned with
VM_FAULT_RETRY when the caller drops mmap_sem. My patch overwrites the
this error code so the page fault path doesn't know that the lock is no
longer held and releases is unconditionally. This is a preexisting
problem introduced by 3f70dc38cec2 ("mm: make sure that kthreads will
not refault oom reaped memory"). I should have considered this option.
I believe the easiest way around this is the following patch
---
>From dd31779f763bbe2aa86100f804656ac680c49d35 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@...e.com>
Date: Fri, 4 Aug 2017 09:36:34 +0200
Subject: [PATCH] mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced
SIGBUS
Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
handle_mm_fault causes a lockdep splat
[ 58.539455] Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
[ 58.543943] Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
[ 58.544245] a.out (1169) used greatest stack depth: 11664 bytes left
[ 58.557471] DEBUG_LOCKS_WARN_ON(depth <= 0)
[ 58.557480] ------------[ cut here ]------------
[ 58.564407] WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
[ 58.599401] CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
[ 58.604126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 58.609790] task: ffff9d90df888040 task.stack: ffffa07084854000
[ 58.613944] RIP: 0010:lock_release+0x172/0x1e0
[ 58.617622] RSP: 0000:ffffa07084857e58 EFLAGS: 00010082
[ 58.621533] RAX: 000000000000001f RBX: ffff9d90df888040 RCX: 0000000000000000
[ 58.626074] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa30d4ba4
[ 58.630572] RBP: ffffa07084857e98 R08: 0000000000000000 R09: 0000000000000001
[ 58.635016] R10: 0000000000000000 R11: 000000000000001f R12: ffffa07084857f58
[ 58.639694] R13: ffff9d90f60d6cd0 R14: 0000000000000000 R15: ffffffffa305cb6e
[ 58.644200] FS: 00007fb932730740(0000) GS:ffff9d90f9f80000(0000) knlGS:0000000000000000
[ 58.648989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 58.652903] CR2: 000000000040092f CR3: 0000000135229000 CR4: 00000000000606e0
[ 58.657280] Call Trace:
[ 58.659989] up_read+0x1a/0x40
[ 58.662825] __do_page_fault+0x28e/0x4c0
[ 58.665946] do_page_fault+0x30/0x80
[ 58.668911] page_fault+0x28/0x30
The reason is that the page fault path might have dropped the mmap_sem
and returned with VM_FAULT_RETRY. MMF_UNSTABLE check however rewrites
the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
that path. Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
the MMF_UNSTABLE path. We cannot simply add VM_FAULT_SIGBUS to the
existing error code because all arch specific page fault handlers and
g-u-p would have to learn a new error code combination.
Reported-by: Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Fixes: 3f70dc38cec2 ("mm: make sure that kthreads will not refault oom reaped memory")
Cc: stable # 4.9+
Signed-off-by: Michal Hocko <mhocko@...e.com>
---
mm/memory.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be91a89..4fe5b6254688 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3881,8 +3881,18 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
* further.
*/
if (unlikely((current->flags & PF_KTHREAD) && !(ret & VM_FAULT_ERROR)
- && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags)))
+ && test_bit(MMF_UNSTABLE, &vma->vm_mm->flags))) {
+
+ /*
+ * We are going to enforce SIGBUS but the PF path might have
+ * dropped the mmap_sem already so take it again so that
+ * we do not break expectations of all arch specific PF paths
+ * and g-u-p
+ */
+ if (ret & VM_FAULT_RETRY)
+ down_read(&vma->vm_mm->mmap_sem);
ret = VM_FAULT_SIGBUS;
+ }
return ret;
}
--
2.13.2
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists