lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20130529185300.403107512@goodmis.org>
Date:	Wed, 29 May 2013 14:53:00 -0400
From:	Steven Rostedt <rostedt@...dmis.org>
To:	linux-kernel@...r.kernel.org
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Ingo Molnar <mingo@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Subject: [PATCH 0/2] [GIT PULL] rcu/ftrace: Fix livelock from overhead of RCU debugging


Linus,

[ If you don't care to read the background of these patches,
  just jump down to the "Please pull" line. I included this
  description in the tag as well. ]

I was testing a config for one bug and triggered a livelock lockup. A
NMI watchdog dump showed what was happening:

[   65.972000] NMI backtrace for cpu 2
[   65.972000] CPU: 2 PID: 48 Comm: kworker/2:1 Not tainted 3.10.0-rc2-test+ #151
[   65.972000] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[   65.972000] task: ffff880078dc2140 ti: ffff880078dc4000 task.ti: ffff880078dc4000
[   65.972000] RIP: 0010:[<ffffffff810c2259>]  [<ffffffff810c2259>] ftrace_ops_list_func+0x144/0x151
[   65.972000] RSP: 0000:ffff880078dc5ac8  EFLAGS: 00000006
[   65.972000] RAX: 0000000000008000 RBX: 0000000000000001 RCX: 000000000000000f
[   65.972000] RDX: 0000000000000002 RSI: ffffffff810c0e9f RDI: ffffffff810b73f7
[   65.972000] RBP: ffff880078dc5af8 R08: ffff880078dc5be0 R09: ffff880078dc5bb0
[   65.972000] R10: ffffffff81a3c730 R11: 0000000000000001 R12: 0000000000008000
[   65.972000] R13: ffffffff810b73f7 R14: 0000000000000000 R15: ffffffff810c0e9f
[   65.972000] FS:  0000000000000000(0000) GS:ffff88007aa00000(0000) knlGS:0000000000000000
[   65.972000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   65.972000] CR2: 0000000000000168 CR3: 0000000001a0b000 CR4: 00000000000007e0
[   65.972000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   65.972000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   65.972000] Stack:
[   65.972000]  ffff880078d97480 0000000000000001 ffffffff810b9148 ffff880078d97480
[   65.972000]  0000000000000000 ffffffff810bb6ba ffff880078dc5bd0 ffffffff814fa7c0
[   65.972000]  0000000000000000 ffffffff810bb6ba ffff880078dc5c00 ffffffff814fa7c0
[   65.972000] Call Trace:
[   65.972000]  [<ffffffff810b9148>] ? check_for_new_grace_period+0x115/0x115
[   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
[   65.972000]  [<ffffffff814fa7c0>] ftrace_call+0x5/0x2f
[   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
[   65.972000]  [<ffffffff814fa7c0>] ? ftrace_call+0x5/0x2f
[   65.972000]  [<ffffffff810b73fc>] ? rcu_is_cpu_idle+0x5/0x69
[   65.972000]  [<ffffffff810558d3>] ? debug_lockdep_rcu_enabled+0x5/0x35
[   65.972000]  [<ffffffff810b73fc>] ? rcu_is_cpu_idle+0x5/0x69
[   65.972000]  [<ffffffff810c0e9f>] ? rcu_read_lock_held+0x19/0x3c
[   65.972000]  [<ffffffff810c0f90>] ftrace_ops_test+0x47/0x93
[   65.972000]  [<ffffffff810b9148>] ? check_for_new_grace_period+0x115/0x115
[   65.972000]  [<ffffffff810c21e1>] ftrace_ops_list_func+0xcc/0x151
[   65.972000]  [<ffffffff814fa7c0>] ftrace_call+0x5/0x2f
[   65.972000]  [<ffffffff810c2234>] ? ftrace_ops_list_func+0x11f/0x151
[   65.972000]  [<ffffffff810bb6b2>] ? rcu_note_context_switch+0x206/0x260
[   67.972000]  [<ffffffff810b914d>] ? rcu_preempt_qs+0x5/0x53
[   65.972000]  [<ffffffff81086f41>] ? trace_hardirqs_off+0xd/0xf
[   65.972000]  [<ffffffff810b914d>] ? rcu_preempt_qs+0x5/0x53
[   65.972000]  [<ffffffff810bb6ba>] ? rcu_note_context_switch+0x20e/0x260
[   65.972000]  [<ffffffff814f2bca>] ? schedule+0x64/0x66
[   65.972000]  [<ffffffff814f246e>] __schedule+0x67/0x710
[   65.972000]  [<ffffffff81052e5e>] ? worker_thread+0x1e9/0x1f3
[   65.972000]  [<ffffffff81052e5e>] ? worker_thread+0x1e9/0x1f3
[   65.972000]  [<ffffffff814f2bca>] schedule+0x64/0x66
[   65.972000]  [<ffffffff81052e63>] worker_thread+0x1ee/0x1f3
[   65.972000]  [<ffffffff81052c75>] ? rescuer_thread+0x26c/0x26c
[   65.972000]  [<ffffffff8105a8bf>] kthread+0xb1/0xb9
[   65.972000]  [<ffffffff81064b3c>] ? finish_task_switch+0x7c/0xf8
[   65.972000]  [<ffffffff8105a80e>] ? __init_kthread_worker+0x59/0x59
[   65.972000]  [<ffffffff814faadc>] ret_from_fork+0x7c/0xb0
[   65.972000]  [<ffffffff8105a80e>] ? __init_kthread_worker+0x59/0x59

What happened was that ftrace's function tracer uses
hlist_for_each_entry_rcu() which uses rcu_dereference_raw(), and also
calls rcu_dereference_raw() directly. As this macro is defined as
rcu_dereference_check() and will call rcu_read_lock_sched_held() which
does various rcu checks as well as lockdep and such, when it happens for
*every* function call in the kernel it slows the system down where it
can live lock (interrupts taking too long to finish, before another
interrupt is triggered).

To give an idea of the abuse that function tracing is taking, every
function that is traced can do the following:

in __ftrace_ops_list_func() we have:

        do_for_each_ftrace_op(op, ftrace_ops_list) {
                if (ftrace_ops_test(op, ip))
                        op->func(ip, parent_ip, op, regs);
        } while_for_each_ftrace_op(op);

Where:

#define do_for_each_ftrace_op(op, list)                 \
        op = rcu_dereference_raw(list);                 \
        do

and

#define while_for_each_ftrace_op(op)                            \
        while (likely(op = rcu_dereference_raw((op)->next)) &&  \
               unlikely((op) != &ftrace_list_end))

And ftrace_ops_test() has:

        filter_hash = rcu_dereference_raw(ops->filter_hash);
        notrace_hash = rcu_dereference_raw(ops->notrace_hash);

        if ((ftrace_hash_empty(filter_hash) ||
             ftrace_lookup_ip(filter_hash, ip)) &&
            (ftrace_hash_empty(notrace_hash) ||
             !ftrace_lookup_ip(notrace_hash, ip)))
                ret = 1;

Where ftrace_lookup_ip() has:

        hlist_for_each_entry_rcu(entry, hhd, hlist) {

Now that is 6 calls to rcu_dereference_raw() that happens for every
single function that is traced! And this doesn't count possibly more by
the callbacks that were registered by the function tracer, or multiple
loops by while_for_each_ftrace_op(). When rcu debugging is on, that's 6
calls to rcu_read_lock_sched_held() which not only does several calls
that is traced by the function tracer (luckily the recursion protection
catches those), but those calls also call preempt_disable/enable (that
gets traced by the function tracer). Worse yet, lock_is_held() is called
that disables and then re-enables interrupts, and does a search for
locks held.

All this overhead that is done for every function call that is traced
(which could be most of the functions in the kernel) and does it at
least 6 times. This can drastically slow things down so much that we get
a live lock in the system.

Now, we could put a bunch of notrace in the RCU code, but as RCU is a
more critical subsystem than tracing, and used more often, I find that
it is advantageous to be able to trace RCU. RCU is also a more complex
subsystem than tracing, and having that ability to trace RCU is an asset
to help debug it and give us better confidence that it works. Thus, I
believe it is better to remove some of the debugability from the tracing
infrastructure than to remove it from RCU. Not to mention, even
eliminating the recursive calls by the RCU being traced, it doesn't
remove the RCU functions being called or the lockdep function being
called 6 times for every function traced.

After discussing this with Paul McKenney, I added two new RCU interfaces:

 rcu_dereference_raw_notrace()
 hlist_for_each_entry_raw_notrace()

The function tracer now uses these and this prevents the live lock.


Please pull the latest trace-fixes-v3.10-rc3-v2 tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
trace-fixes-v3.10-rc3-v2

Tag SHA1: 066277682313adbca025f177cff8a211f2b4f46d
Head SHA1: 1bb539ca36e21c2f4fce0865e11df384bc7b7656


Steven Rostedt (2):
      rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu()
      ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends

----
 include/linux/rculist.h  |   20 ++++++++++++++++++++
 include/linux/rcupdate.h |    9 +++++++++
 kernel/trace/ftrace.c    |   18 +++++++++---------
 3 files changed, 38 insertions(+), 9 deletions(-)

Download attachment "signature.asc" of type "application/pgp-signature" (491 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ