linux-kernel - Re: srcu: BUG in __synchronize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2116497188.3400.1489242314445.JavaMail.zimbra@efficios.com>
Date:   Sat, 11 Mar 2017 14:25:14 +0000 (UTC)
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:     Andrey Konovalov <andreyknvl@...gle.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Josh Triplett <josh@...htriplett.org>,
        rostedt <rostedt@...dmis.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        syzkaller <syzkaller@...glegroups.com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Kostya Serebryany <kcc@...gle.com>
Subject: Re: srcu: BUG in __synchronize_srcu

----- On Mar 10, 2017, at 5:26 PM, Paul E. McKenney paulmck@...ux.vnet.ibm.com wrote:

> On Fri, Mar 10, 2017 at 08:29:55PM +0100, Andrey Konovalov wrote:
>> On Fri, Mar 10, 2017 at 8:28 PM, Andrey Konovalov <andreyknvl@...gle.com> wrote:
>> > Hi,
>> >
>> > I've got the following error report while fuzzing the kernel with
>> > syzkaller on an arm64 board.
>> 
>> This also happened on x86 a few times during fuzzing, however it
>> wasn't reproducible.
>> 
>> >
>> > On linux-next commit 56b8bad5e066c23e8fa273ef5fba50bd3da2ace8 (Mar 8).
>> >
>> > A reproducer and .config are attached.
>> >
>> > The bug happens while executing the following syzkaller program in a loop.
>> > While it looks kvm-related, it might be that kvm just stresses the
>> > srcu subsystem.
>> >
>> > mmap(&(0x7f0000000000/0xfff000)=nil, (0xfff000), 0x3, 0x32,
>> > 0xffffffffffffffff, 0x0)
>> > r0 = openat$kvm(0xffffffffffffff9c,
>> > &(0x7f0000a05000)="2f6465762f6b766d00", 0x0, 0x0)
>> > ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
>> >
>> > ------------[ cut here ]------------
>> > kernel BUG at kernel/rcu/srcu.c:436!
> 
> This is on v4.10, correct?

No, as stated above, this is on linux-next as of March 8, 2017.

I'm currently looking at the diff of kernel/rcu/srcu.c between v4.10 and
next-20170310. There are a few changes there.

One of them is the introduction of the new srcu algorithm with separate
lock vs unlock counters, which simplifies read lock/unlock by removing the
sequence counter.

Another change that directly impacts __synchronize_rcu() (near the
BUG()) is bc138e7a "srcu: Allow mid-boot use of synchronize_srcu()".

Odds are that this change introduce an unforeseen scenario that
skips awaiting for completion later after boot than expected. It
might be worthwhile to review it closely once more.

Thoughts ?

Thanks,

Mathieu

> 
>> > Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>> > Modules linked in: meson_drm drm_kms_helper drm dwmac_generic realtek
>> > dwmac_meson8b stmmac_platform stmmac meson_rng rng_core meson_gxbb_wdt
>> > ipv6
>> > CPU: 3 PID: 4250 Comm: a.out Not tainted 4.11.0-rc1-next-20170308-xc2-dirty #3
>> > Hardware name: Hardkernel ODROID-C2 (DT)
>> > task: ffff800063699700 task.stack: ffff800063cfc000
>> > PC is at[<        none        >] __synchronize_srcu+0x3d0/0x470
>> > kernel/rcu/srcu.c:412
>> > LR is at[<        none        >] __synchronize_srcu+0x130/0x470
>> > kernel/rcu/srcu.c:434
>> > pc : [<ffff20000821a3b8>] lr : [<ffff20000821a118>] pstate: 80000145
>> > sp : ffff800063cffb00
>> > x29: ffff800063cffb00 x28: ffff80005b1d6e00
>> > x27: 1fffe4000156b242 x26: ffff800063cffb70
>> > x25: 1fffe4000156b23b x24: ffff20000ab591d8
>> > x23: ffff200009dbf000 x22: ffff20000ab591a0
>> > x21: ffff20000ab59210 x20: ffff800063cffb70
>> > x19: ffff20000ab59190 x18: 0000000000000a03
>> > x17: 0000ffff944f3950 x16: ffff20000811f818
>> > x15: 0000000000000000 x14: 0000000000000007
>> > x13: 0000000000000002 x12: 0000000000000000
>> > x11: 0000000000000040 x10: 1fffe400014b568c
>> > x9 : ffff20000ab29000 x8 : 0000000000000007
>> > x7 : 0000000000000001 x6 : 0000000000000000
>> > x5 : 0000000000000040 x4 : 0000000000000003
>> > x3 : ffff20000ab59208 x2 : 1fffe4000156b243
>> > x1 : 0000000000000000 x0 : ffff80005e71fb70
>> >
>> > Process a.out (pid: 4250, stack limit = 0xffff800063cfc000)
>> > Stack: (0xffff800063cffb00 to 0xffff800063d00000)
>> > fb00: ffff800063cffbd0 ffff20000821a480 ffff20000ab59190 1ffff0000b63adc0
>> > fb20: dfff200000000000 ffff20000ab59190 ffff80004b9a8a00 1ffff000097351bc
>> > fb40: ffff80004b9a8de0 0000000000000000 ffff800060cad328 ffff80005b1d6e00
>> > fb60: ffff80004b9a8a00 1ffff000097351bc ffff80004f5e7b70 ffff200008217968
>> > fb80: ffff800000000001 dead4ead00010001 dfff2000ffffffff ffffffffffffffff
>> > fba0: ffff20000ab4c4b0 0000000000000000 0000000000000000 ffff200009b0b358
>> > fbc0: ffff800063cffbc0 ffff800063cffbc0 ffff800063cffbf0 ffff2000083ffd20
>> > fbe0: ffff80005b1d6e00 0000000000000140 ffff800063cffc50 ffff2000083aedfc
>> > fc00: ffff80004b9a8a00 ffff80004b9a8a00 ffff80004b9a8d78 0000000000000001
>> > fc20: 00000000000002a6 ffff80004f406780 ffff80004b9a8aa0 ffff800063699ac8
>> > fc40: 1ffff0000c6d3359 ffff800063699700 ffff800063cffd20 ffff20000810caec
>> > fc60: ffff80004b9a8a00 ffff80004b9a8b20 ffff80004b9a8d78 0000000000000001
>> > fc80: ffff80005ebae2d8 ffff800063699ac8 ffff800063cffca0 ffff20000840fc08
>> > fca0: ffff800063cffce0 ffff20000843327c ffff80005ebae2d8 ffff80004b9a8a00
>> > fcc0: ffff80005ebae0f0 ffff200009de8000 ffff800063cffce0 ffff2000084332c4
>> > fce0: ffff800063cffd20 ffff20000810cc64 ffff80004b9a8a00 ffff80004b9a8a48
>> > fd00: ffff80004b9a8d78 0000000000000001 ffff800063cffd20 ffff20000810cae0
>> > fd20: ffff800063cffd60 ffff20000811db88 ffff800063699700 ffff800063699700
>> > fd40: ffff80004b9a8a00 0000000000000001 00000000000002a6 ffff80004f406780
>> > fd60: ffff800063cffe40 ffff20000811f694 ffff80004f406780 0000000000000000
>> > fd80: ffff80004f40681c 0000000000000004 1ffff00009e80d03 1ffff00009e80d02
>> > fda0: ffff80004f406810 ffff800063699bd0 ffff800063699700 ffff800063699700
>> > fdc0: ffff800063cffe80 ffff200008490868 0000000000000000 ffff80005fd45000
>> > fde0: ffff80006369972c ffff800063699d48 1ffff0000c6d32e5 0000000000000004
>> > fe00: 0000000000000123 000000000000001d 1ffff0000c6d33a9 ffff800063699700
>> > fe20: ffff800063cffe30 ffff200008813c5c ffff800063cffe40 ffff20000811f688
>> > fe40: ffff800063cffea0 ffff20000811f838 0000000000000000 000060006d24d000
>> > fe60: ffffffffffffffff 0000ffff944f3974 0000000000000000 0000000000000015
>> > fe80: 0000000000000123 000000000000005e ffff200009852000 ffff20000811f82c
>> > fea0: 0000000000000000 ffff200008083f70 0000000000000000 0000000000000015
>> > fec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> > fee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> > ff00: 000000000000005e ffffff80ffffffd0 0101010101010101 0000000000000020
>> > ff20: 0000000000000018 0000000056bcb768 0000000000000000 0000ffff945be000
>> > ff40: 0000000000413110 0000ffff944f3950 0000000000000a03 00000000004020f8
>> > ff60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> > ff80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> > ffa0: 0000000000000000 0000ffffc0cc99b0 0000000000401248 0000ffffc0cc99b0
>> > ffc0: 0000ffff944f3974 0000000000000000 0000000000000000 000000000000005e
>> > ffe0: 0000000000000000 0000000000000000 0403030303030100 0807060606060605
>> > Call trace:
>> > Exception stack(0xffff800063cff910 to 0xffff800063cffa40)
>> > f900:                                   ffff20000ab59190 0001000000000000
>> > f920: ffff800063cffb00 ffff20000821a3b8 0000000080000145 000000000000003d
>> > f940: 1fffe4000156b23b ffff800063cffb70 ffff800063cff980 0001000000000000
>> > f960: ffff800063cff9d0 ffff2000081da8e8 ffff800063699ec0 ffff200009df9000
>> > f980: ffff800063cff990 ffff20000891e1e0 ffff800063cff9d0 ffff20000891e23c
>> > f9a0: ffff200009dbfe18 0000000000000040 0000000000000004 0000000000000001
>> > f9c0: 00000000000008ac 00000000000008ac ffff80005e71fb70 0000000000000000
>> > f9e0: 1fffe4000156b243 ffff20000ab59208 0000000000000003 0000000000000040
>> > fa00: 0000000000000000 0000000000000001 0000000000000007 ffff20000ab29000
>> > fa20: 1fffe400014b568c 0000000000000040 0000000000000000 0000000000000002
>> > [<ffff20000821a3b8>] __synchronize_srcu+0x3d0/0x470 kernel/rcu/srcu.c:412
>> > [<ffff20000821a480>] synchronize_srcu+0x28/0x60 kernel/rcu/srcu.c:516
>> > [<ffff2000083ffd20>] __mmu_notifier_release+0x268/0x3e0 mm/mmu_notifier.c:102
>> > [<     inline     >] mmu_notifier_release ./include/linux/mmu_notifier.h:235
>> > [<ffff2000083aedfc>] exit_mmap+0x21c/0x288 mm/mmap.c:2941
>> > [<     inline     >] __mmput kernel/fork.c:881
>> > [<ffff20000810caec>] mmput+0xdc/0x2e0 kernel/fork.c:903
>> > [<     inline     >] exit_mm kernel/exit.c:557
>> > [<ffff20000811db88>] do_exit+0x648/0x2020 kernel/exit.c:865
>> > [<ffff20000811f694>] do_group_exit+0xdc/0x260 kernel/exit.c:982
>> > [<     inline     >] SYSC_exit_group kernel/exit.c:993
>> > [<ffff20000811f838>] __wake_up_parent+0x0/0x60 kernel/exit.c:991
>> > [<ffff200008083f70>] el0_svc_naked+0x24/0x28 arch/arm64/kernel/entry.S:813
>> > Code: 97feee10 35fff680 17ffff1c d503201f (d4210000)
>> > ---[ end trace b727e9858bfac1ff ]---
>> > Kernel panic - not syncing: Fatal exception
> 
> So the theory is that if !sp->running, all of SRCU's queues must be empty.
> So if you are holding ->queue_lock (with irqs disabled) and you see
> !sp->running, and then you enqueue a callback on ->batch_check0, then
> that callback must be the first in the list.  And the code preceding
> the WARN_ON() you triggered does in fact check and enqueue shile holding
> ->queue_lock with irqs disabled.
> 
> And rcu_batch_queue() does operate FIFO as required.  (Otherwise,
> srcu_barrier() would not work.)
> 
> There are only three calls to rcu_batch_queue(), and the one involved with
> the WARN_ON() enqueues to ->batch_check0.  The other two enqueue to
> ->batch_queue.  Callbacks move from ->batch_queue to ->batch_check0 to
> ->batch_check1 to ->batch_done, so nothing should slip in front.
> 
> Of course, if ->running were ever set to false with any of ->batch_check0,
> ->batch_check1, or ->batch_done non-empty, this WARN_ON() would trigger.
> But srcu_reschedule() sets it to false only if all four batches are
> empty (and checks and sets under ->queue_lock()), and all other cases
> where it is set to false happen at initialization time, and also clear
> out the queues.  Of course, if someone raced an init_srcu_struct() with
> either a call_srcu() or synchronize_srcu(), all bets are off.  Now,
> mmu_notifier.c does invoke init_srcu_struct() manually, but it does
> so at subsys_initcall() time.  Which -might- be after other things are
> happening, so one "hail Mary" attempted fix is to remove mmu_notifier_init()
> and replace the "static struct srcu_struct srcu" with:
> 
>	DEFINE_STATIC_SRCU(srcu);
> 
> But this might require changing the name -- I vaguely recall some
> strangeness where the names of statically defined per-CPU variables need
> to be globally unique even when static.  Easy enough to do, though.
> Might need a similar change to the "srcu" instances defined in vmd.c
> and kvm_host.h -- assuming that this change helps.
> 
> Another possibility is that something in SRCU is messing with either the
> queues or the ->running field without holding ->queue_lock.  And that does
> seem to be happening -- srcu_advance_batches() invokes rcu_batch_move()
> without holding anything.  Which seems like it could cause trouble
> if someone else was invoking synchronize_srcu() concurrently.  Those
> particular invocations might be safe due to access only by a single
> kthread/workqueue, given that all updates to ->batch_queue are protected
> by ->queue_lock (aside from initialization).
> 
> But ->batch_check0 is updated by __synchronize_srcu(), though protected
> by ->queue_lock, and only if ->running is false, and with both the
> check and the update protected by the same ->queue_lock critical section.
> If ->running is false, then the workqueue cannot be running, so it remains
> to see if all other updates to ->batch_check0 are either with ->queue_lock
> held and ->running false on the one hand or from the workqueue handler
> on the other:
> 
> srcu_collect_new() updates with ->queue_lock held, but does not check
>	->running.  It is invoked only from process_srcu(), which in
>	turn is invoked only as a workqueue handler.  The work is queued
>	from:
> 
>	call_srcu(), which does so with ->queue_lock held having just
>		set ->running to true.
> 
>	srcu_reschedule(), which invokes it if there are non-empty
>		queues.  This is invoked from __synchronize_srcu()
>		in the case where it has set ->running to true
>		after finding the queues empty, which should imply
>		no other instances.
> 
>		It is also invoked from process_srcu(), which is
>		invoked only as a workqueue handler.  (Yay
>		recursive inquiry!)
> 
> srcu_advance_batches() updates without locks held.  It is invoked as
>	follows:
> 
>	__synchronize_srcu() in the case where ->running was set, which
>		as noted before excludes workqueue handlers.
> 
>	process_srcu() which as noted before is only invoked from
>		a workqueue handler.
> 
> So an SRCU workqueue is invoked only from a workqueue handler, or from
> some other task that transitioned ->running from false to true while
> holding ->queuelock.  There should therefore only be one SRCU workqueue
> per srcu_struct, so this should be safe.  Though I hope that it can
> be simplified a bit.  :-/
> 
> So the only suggestion I have at the moment is static definition of
> the "srcu" variable.  Lai, Josh, Steve, Mathieu, anything I missed?
> 
> 						Thanx, Paul

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com