linux-kernel - Re: srcu: BUG in __synchronize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACT4Y+YcQNEvR0pFD97JXPJ2hyZiOR6-QHdbw=9-XerTRapsfQ@mail.gmail.com>
Date:   Mon, 27 Mar 2017 14:36:35 +0200
From:   Dmitry Vyukov <dvyukov@...gle.com>
To:     Paul McKenney <paulmck@...ux.vnet.ibm.com>
Cc:     Lance Roy <ldr709@...il.com>,
        Andrey Konovalov <andreyknvl@...gle.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Josh Triplett <josh@...htriplett.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        LKML <linux-kernel@...r.kernel.org>,
        syzkaller <syzkaller@...glegroups.com>,
        Kostya Serebryany <kcc@...gle.com>
Subject: Re: srcu: BUG in __synchronize_srcu

On Tue, Mar 14, 2017 at 5:21 PM, Paul E. McKenney
<paulmck@...ux.vnet.ibm.com> wrote:
> On Tue, Mar 14, 2017 at 12:47:02AM -0700, Lance Roy wrote:
>> I am not sure how the rcu_scheduler_active changes in __synchronize_srcu work,
>> but there seem to be a few problems in them. First,
>> "if (done && likely(!driving))" on line 453 doesn't appear to ever happen,
>> as driving doesn't get set to false when srcu_reschedule is called. This seems
>> like it could cause a race condition if another thread notices that ->running is
>> false, adds itself to the queue, set ->running to true, and starts on its own
>> grace period before the first thread acquires the lock again on line 469. Then
>> the first thread will then acquire the lock, set running to false, and release
>> the lock, resulting in an invalid state where ->running is false but the second
>> thread is still trying to finish its grace period.
>>
>> Second, the while loop on line 455 seems to violate to rule that ->running
>> shouldn't be false when there are entries in the queue. If a second thread adds
>> itself to the queue while the first thread is driving SRCU inside that loop, and
>> then the first thread finishes its own grace period and quits the loop, it will
>> set ->running to false even though there is still a thread on the queue.
>>
>> The second issue requires rcu_scheduler_active to be RCU_SCHEDULER_INIT to
>> occur, and as I don't what the assumptions during RCU_SCHEDULER_INIT are I don't
>> know if it is actually a problem, but the first issue looks like it could occur
>> at any time.
>
> Thank you for looking into this!
>
> I determined that my patch-order strategy was flawed, as it required
> me to rewrite the mid-boot functionality several times.  I therefore
> removed the mid-boot commits.  I will add them in later, but they will
> use a rather different approach based on a grace-period sequence number
> similar to that used by the expedited grace periods.
>
> Which should also teach me to be less aggressive about pushing new code
> to -next.  For a few weeks, anyway.  ;-)
>
>                                                         Thanx, Paul
>
>> Thanks,
>> Lance
>>
>> On Fri, 10 Mar 2017 14:26:09 -0800
>> "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com> wrote:
>> > On Fri, Mar 10, 2017 at 08:29:55PM +0100, Andrey Konovalov wrote:
>> > > On Fri, Mar 10, 2017 at 8:28 PM, Andrey Konovalov <andreyknvl@...gle.com>
>> > > wrote:
>> > > > Kernel panic - not syncing: Fatal exception
>> >
>> > So the theory is that if !sp->running, all of SRCU's queues must be empty.
>> > So if you are holding ->queue_lock (with irqs disabled) and you see
>> > !sp->running, and then you enqueue a callback on ->batch_check0, then
>> > that callback must be the first in the list.  And the code preceding
>> > the WARN_ON() you triggered does in fact check and enqueue shile holding
>> > ->queue_lock with irqs disabled.
>> >
>> > And rcu_batch_queue() does operate FIFO as required.  (Otherwise,
>> > srcu_barrier() would not work.)
>> >
>> > There are only three calls to rcu_batch_queue(), and the one involved with
>> > the WARN_ON() enqueues to ->batch_check0.  The other two enqueue to
>> > ->batch_queue.  Callbacks move from ->batch_queue to ->batch_check0 to
>> > ->batch_check1 to ->batch_done, so nothing should slip in front.
>> >
>> > Of course, if ->running were ever set to false with any of ->batch_check0,
>> > ->batch_check1, or ->batch_done non-empty, this WARN_ON() would trigger.
>> > But srcu_reschedule() sets it to false only if all four batches are
>> > empty (and checks and sets under ->queue_lock()), and all other cases
>> > where it is set to false happen at initialization time, and also clear
>> > out the queues.  Of course, if someone raced an init_srcu_struct() with
>> > either a call_srcu() or synchronize_srcu(), all bets are off.  Now,
>> > mmu_notifier.c does invoke init_srcu_struct() manually, but it does
>> > so at subsys_initcall() time.  Which -might- be after other things are
>> > happening, so one "hail Mary" attempted fix is to remove mmu_notifier_init()
>> > and replace the "static struct srcu_struct srcu" with:
>>
>> >
>> >     DEFINE_STATIC_SRCU(srcu);
>> >
>> > But this might require changing the name -- I vaguely recall some
>> > strangeness where the names of statically defined per-CPU variables need
>> > to be globally unique even when static.  Easy enough to do, though.
>> > Might need a similar change to the "srcu" instances defined in vmd.c
>> > and kvm_host.h -- assuming that this change helps.
>> >
>> > Another possibility is that something in SRCU is messing with either the
>> > queues or the ->running field without holding ->queue_lock.  And that does
>> > seem to be happening -- srcu_advance_batches() invokes rcu_batch_move()
>> > without holding anything.  Which seems like it could cause trouble
>> > if someone else was invoking synchronize_srcu() concurrently.  Those
>> > particular invocations might be safe due to access only by a single
>> > kthread/workqueue, given that all updates to ->batch_queue are protected
>> > by ->queue_lock (aside from initialization).
>> >
>> > But ->batch_check0 is updated by __synchronize_srcu(), though protected
>> > by ->queue_lock, and only if ->running is false, and with both the
>> > check and the update protected by the same ->queue_lock critical section.
>> > If ->running is false, then the workqueue cannot be running, so it remains
>> > to see if all other updates to ->batch_check0 are either with ->queue_lock
>> > held and ->running false on the one hand or from the workqueue handler
>> > on the other:
>> >
>> > srcu_collect_new() updates with ->queue_lock held, but does not check
>> >     ->running.  It is invoked only from process_srcu(), which in
>> >     turn is invoked only as a workqueue handler.  The work is queued
>> >     from:
>> >
>> >     call_srcu(), which does so with ->queue_lock held having just
>> >             set ->running to true.
>> >
>> >     srcu_reschedule(), which invokes it if there are non-empty
>> >             queues.  This is invoked from __synchronize_srcu()
>> >             in the case where it has set ->running to true
>> >             after finding the queues empty, which should imply
>> >             no other instances.
>> >
>> >             It is also invoked from process_srcu(), which is
>> >             invoked only as a workqueue handler.  (Yay
>> >             recursive inquiry!)
>> >
>> > srcu_advance_batches() updates without locks held.  It is invoked as
>> >     follows:
>> >
>> >     __synchronize_srcu() in the case where ->running was set, which
>> >             as noted before excludes workqueue handlers.
>> >
>> >     process_srcu() which as noted before is only invoked from
>> >             a workqueue handler.
>> >
>> > So an SRCU workqueue is invoked only from a workqueue handler, or from
>> > some other task that transitioned ->running from false to true while
>> > holding ->queuelock.  There should therefore only be one SRCU workqueue
>> > per srcu_struct, so this should be safe.  Though I hope that it can
>> > be simplified a bit.  :-/
>> >
>> > So the only suggestion I have at the moment is static definition of
>> > the "srcu" variable.  Lai, Josh, Steve, Mathieu, anything I missed?
>> >
>> >                                             Thanx, Paul



This happened on linux-next/65b2dc38291f9f27e5ec3b804d6eb3b5f79a3ce4
and may be related.
The report says that srcu subsystem still uses the srcu object after
it has been freed. It can be a kvm fault as well.


==================================================================
BUG: KASAN: use-after-free in debug_spin_unlock
kernel/locking/spinlock_debug.c:97 [inline]
BUG: KASAN: use-after-free in do_raw_spin_unlock+0x2ea/0x320
kernel/locking/spinlock_debug.c:134
Read of size 4 at addr ffff88014158a564 by task kworker/1:1/5712

CPU: 1 PID: 5712 Comm: kworker/1:1 Not tainted 4.11.0-rc3-next-20170324+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Workqueue: events_power_efficient process_srcu
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x2fb/0x40f lib/dump_stack.c:52
 print_address_description+0x7f/0x260 mm/kasan/report.c:250
 kasan_report_error mm/kasan/report.c:349 [inline]
 kasan_report.part.3+0x21f/0x310 mm/kasan/report.c:372
 kasan_report mm/kasan/report.c:392 [inline]
 __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:392
 debug_spin_unlock kernel/locking/spinlock_debug.c:97 [inline]
 do_raw_spin_unlock+0x2ea/0x320 kernel/locking/spinlock_debug.c:134
 __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:167 [inline]
 _raw_spin_unlock_irq+0x22/0x70 kernel/locking/spinlock.c:199
 spin_unlock_irq include/linux/spinlock.h:349 [inline]
 srcu_reschedule+0x1a1/0x260 kernel/rcu/srcu.c:582
 process_srcu+0x63c/0x11c0 kernel/rcu/srcu.c:600
 process_one_work+0xac0/0x1b00 kernel/workqueue.c:2097
 worker_thread+0x1b4/0x1300 kernel/workqueue.c:2231
 kthread+0x36c/0x440 kernel/kthread.c:231
 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430

Allocated by task 20961:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:515
 set_track mm/kasan/kasan.c:527 [inline]
 kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:619
 kmem_cache_alloc_trace+0x10b/0x670 mm/slab.c:3635
 kmalloc include/linux/slab.h:492 [inline]
 kzalloc include/linux/slab.h:665 [inline]
 kvm_arch_alloc_vm include/linux/kvm_host.h:773 [inline]
 kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:610 [inline]
 kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 [inline]
 kvm_dev_ioctl+0x1bf/0x1460 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3205
 vfs_ioctl fs/ioctl.c:45 [inline]
 do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
 SYSC_ioctl fs/ioctl.c:700 [inline]
 SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Freed by task 20960:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:515
 set_track mm/kasan/kasan.c:527 [inline]
 kasan_slab_free+0x6e/0xc0 mm/kasan/kasan.c:592
 __cache_free mm/slab.c:3511 [inline]
 kfree+0xd3/0x250 mm/slab.c:3828
 kvm_arch_free_vm include/linux/kvm_host.h:778 [inline]
 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:732 [inline]
 kvm_put_kvm+0x709/0x9a0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:747
 kvm_vm_release+0x42/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:758
 __fput+0x332/0x800 fs/file_table.c:209
 ____fput+0x15/0x20 fs/file_table.c:245
 task_work_run+0x197/0x260 kernel/task_work.c:116
 exit_task_work include/linux/task_work.h:21 [inline]
 do_exit+0x1a53/0x27c0 kernel/exit.c:878
 do_group_exit+0x149/0x420 kernel/exit.c:982
 get_signal+0x7d8/0x1820 kernel/signal.c:2318
 do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808
 exit_to_usermode_loop+0x21c/0x2d0 arch/x86/entry/common.c:157
 prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
 syscall_return_slowpath+0x4d3/0x570 arch/x86/entry/common.c:263
 entry_SYSCALL_64_fastpath+0xbc/0xbe

The buggy address belongs to the object at ffff880141581640
 which belongs to the cache kmalloc-65536 of size 65536
The buggy address is located 36644 bytes inside of
 65536-byte region [ffff880141581640, ffff880141591640)
The buggy address belongs to the page:
page:ffffea000464b400 count:1 mapcount:0 mapping:ffff880141581640
index:0x0 compound_mapcount: 0
flags: 0x200000000008100(slab|head)
raw: 0200000000008100 ffff880141581640 0000000000000000 0000000100000001
raw: ffffea00064b1f20 ffffea000640fa20 ffff8801db800d00
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88014158a400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88014158a480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88014158a500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                       ^
 ffff88014158a580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88014158a600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================