linux-kernel - Re: [PATCH RT 3.18] ring-buffer: Mark irq_work as HARD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <552FD55F.8000105@siemens.com>
Date:	Thu, 16 Apr 2015 17:29:35 +0200
From:	Jan Kiszka <jan.kiszka@...mens.com>
To:	Steven Rostedt <rostedt@...dmis.org>
CC:	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	RT <linux-rt-users@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RT 3.18] ring-buffer: Mark irq_work as HARD_IRQ to prevent
 deadlocks

On 2015-04-16 17:10, Steven Rostedt wrote:
> On Thu, 16 Apr 2015 16:28:58 +0200
> Jan Kiszka <jan.kiszka@...mens.com> wrote:
> 
>> On 2015-04-16 16:26, Sebastian Andrzej Siewior wrote:
>>> On 04/16/2015 04:06 PM, Jan Kiszka wrote:
>>>> ftrace may trigger rb_wakeups while holding pi_lock which will also be
>>>> requested via trace_...->...->ring_buffer_unlock_commit->...->
>>>> irq_work_queue->raise_softirq->try_to_wake_up. This quickly causes
>>>> deadlocks when trying to use ftrace under -rt.
>>>>
>>>> Resolve this by marking the ring buffer's irq_work as HARD_IRQ.
>>>>
>>>> Signed-off-by: Jan Kiszka <jan.kiszka@...mens.com>
>>>> ---
>>>>
>>>> I'm not yet sure if this doesn't push work into hard-irq context that
>>>> is better not done there on -rt.
>>>
>>> everything should be done in the soft-irq.
>>>
>>>>
>>>> I'm also not sure if there aren't more such cases, given that -rt turns
>>>> the default irq_work wakeup policy around. But maybe we are lucky.
>>>
>>> The only thing that is getting done in the hardirq is the FULL_NO_HZ
>>> thingy. I would be _very_ glad if we could keep it that way.
> 
> tracing is special, even more so than NO_HZ_FULL, as it also traces
> that as well (and even RCU). Tracing the kernel is like a debugger.
> Ideally, it would not be part of the kernel, but just an external
> observer. Without special hardware that is not the case, so we try to
> be outside the main system as much as possible.
> 
> 
>>
>> Then - to my current understanding - we need an NMI-safe trigger for
>> soft-irq work. Is there anything like this existing already? Or can we
>> still use the IPI-based kick without actually doing the work in hard-irq
>> context?
>>
> 
> The reason why it uses irq_work() is because a simple wakeup can
> deadlock the system if called by the tracing infrastructure (as we see
> raise_softirq() does too).
> 
> But yeah, there's no real need to have the ring buffer irq work
> handler run from hardirq context. The only requirement is that you can
> not do the raise from the irq_work_queue call. If you want to have the
> hardirq work handle do the raise softirq, that's fine. Perhaps that's
> the solution? Have all irq_work_queue() always trigger the hard irq, but
> the hard irq may just raise a softirq or it will call the handler
> directly if IRQ_WORK_HARD_IRQ is set.

I'll play with that.

My patch is definitely not OK. It causes

[  380.372579] BUG: scheduling while atomic: trace-cmd/2149/0x00010004
...
[  380.372604] Call Trace:
[  380.372610]  <IRQ>  [<ffffffff81607694>] dump_stack+0x50/0x9f
[  380.372613]  [<ffffffff8160413c>] __schedule_bug+0x59/0x69
[  380.372615]  [<ffffffff8160a1d5>] __schedule+0x675/0x800
[  380.372617]  [<ffffffff8160a394>] schedule+0x34/0xa0
[  380.372619]  [<ffffffff8160bf7d>] rt_spin_lock_slowlock+0xcd/0x290
[  380.372621]  [<ffffffff8160d8b5>] rt_spin_lock+0x25/0x30
[  380.372623]  [<ffffffff8108fe39>] __wake_up+0x29/0x60
[  380.372626]  [<ffffffff81106960>] rb_wake_up_waiters+0x40/0x50
[  380.372628]  [<ffffffff8112cdbf>] irq_work_run_list+0x3f/0x60
[  380.372630]  [<ffffffff8112cdf9>] irq_work_run+0x19/0x20
[  380.372632]  [<ffffffff81008409>] smp_trace_irq_work_interrupt+0x39/0x120
[  380.372633]  [<ffffffff8160f8ef>] trace_irq_work_interrupt+0x6f/0x80
[  380.372636]  <EOI>  [<ffffffff8103d66d>] ? native_apic_msr_write+0x2d/0x30
[  380.372637]  [<ffffffff8103d53d>] x2apic_send_IPI_self+0x1d/0x20
[  380.372638]  [<ffffffff8100851e>] arch_irq_work_raise+0x2e/0x40
[  380.372639]  [<ffffffff8112d025>] irq_work_queue+0xc5/0xf0
[  380.372641]  [<ffffffff81107d8a>] ring_buffer_unlock_commit+0x14a/0x2e0
[  380.372643]  [<ffffffff8110f894>] trace_buffer_unlock_commit+0x24/0x60
[  380.372644]  [<ffffffff8111f9da>] ftrace_event_buffer_commit+0x8a/0xc0
[  380.372647]  [<ffffffff811c58de>] ftrace_raw_event_writeback_dirty_inode_template+0x8e/0xc0
[  380.372648]  [<ffffffff811c8b21>] __mark_inode_dirty+0x1d1/0x310
[  380.372650]  [<ffffffff811d0ec8>] generic_write_end+0x78/0xb0
[  380.372658]  [<ffffffffa021c42b>] ext4_da_write_end+0x10b/0x2f0 [ext4]
[  380.372661]  [<ffffffff8116335e>] ? pagefault_enable+0x1e/0x20
[  380.372662]  [<ffffffff8113c337>] generic_perform_write+0x107/0x1b0
[  380.372664]  [<ffffffff8113e49f>] __generic_file_write_iter+0x15f/0x350
[  380.372668]  [<ffffffffa0210c91>] ext4_file_write_iter+0x101/0x3d0 [ext4]
[  380.372670]  [<ffffffff8118f59b>] ? __kmalloc+0x16b/0x250
[  380.372672]  [<ffffffff811ca96e>] ? iter_file_splice_write+0x8e/0x430
[  380.372673]  [<ffffffff811ca96e>] ? iter_file_splice_write+0x8e/0x430
[  380.372674]  [<ffffffff811cab35>] iter_file_splice_write+0x255/0x430
[  380.372676]  [<ffffffff811cc474>] SyS_splice+0x214/0x760
[  380.372677]  [<ffffffff81011fe7>] ? syscall_trace_enter_phase2+0xa7/0x1e0
[  380.372679]  [<ffffffff8160e266>] tracesys_phase2+0xd4/0xd9

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/