linux-kernel - Re: [PATCH 3/5] tracing: add same level recursion detection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.00.0904180956130.1016@gandalf.stny.rr.com>
Date:	Sat, 18 Apr 2009 10:12:19 -0400 (EDT)
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Ingo Molnar <mingo@...e.hu>
cc:	linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Frederic Weisbecker <fweisbec@...il.com>
Subject: Re: [PATCH 3/5] tracing: add same level recursion detection

On Sat, 18 Apr 2009, Ingo Molnar wrote:
> > +
> > +static void trace_recursive_unlock(void)
> > +{
> > +	int level;
> > +
> > +	level = trace_irq_level();
> > +
> > +	WARN_ON_ONCE(!current->trace_recursion & (1 << level));
> > +
> > +	current->trace_recursion &= ~(1 << level);
> > +}
> > +
> >  static DEFINE_PER_CPU(int, rb_need_resched);
> >  
> >  /**
> > @@ -1514,6 +1548,9 @@ ring_buffer_lock_reserve(struct ring_buffer *buffer, unsigned long length)
> >  	/* If we are tracing schedule, we don't want to recurse */
> >  	resched = ftrace_preempt_disable();
> >  
> > +	if (trace_recursive_lock())
> > +		goto out_nocheck;
> 
> This is obviously a good idea, but i'm nervous about two things 
> here: overhead and placement.

I'm not sure how much overhead is added. It modifies a variable in
current without the need for any atomic operations.

> 
> Firstly, the lock/unlock sequence adds to the critical path of all 
> the tracing infrastructure and makes it slower. Secondly, we do this 
> in ring_buffer_lock_reserve() only - instead of doing it at the 
> outmost layer (where recursion protection is the most effective).

The problem with doing it in the outer most layer is that I may need
to disable preemption. Once I do that, I need to be careful about
calling preempt_enable or preempt_enable_noresched. If you call
preempt enable from within the scheduler, you may end up with an
inifinite recursion. If you use noresched, you may miss a preemption 
point.

the ftrace_preempt_disable handles this, but the enable needs to be done 
in another function, and that means the state needs to be passed to that 
function. It can be done (the ring buffer does this), but instead of 
rewriting the code the ring buffer does, I simply moved the tests into the 
ring buffer.

> 
> So while i've pulled it, it would be nice to rework the concepts 
> here comprehensively. We have a number of atomics, layers, checks, 
> and all that adds overhead and complexity.

Agreed. I added this because I thought the lockdep bug was due to
recurion. It was, but it was due to a spinlock in trace_printk.
I have yet to hit a bug that this code would have detected, so I'm not 
sure yet that it is needed. Especially when the ring buffer becomes 
lockless, then the change of recursion being a problem becomes even more 
minimal.

> 
> The best would be to just face it and disable IRQs via 
> raw_local_irq_save() plus set the trace recursion flag - and do this 
> around all trace point facilities and as early as possible (they are 
> not allowed to sleep anyway).

The ring buffer already disables preemption, but disabling interrupts 
would be even a greater hit on performance. And that gives no protection 
to NMIs, which can also call tracing.

> 
> That would remove many of the "what if an IRQ recurses" uglinesses. 
> CLI/POPF pairs are also very fast on most architectures (a lot of 
> performance-sensitive code in Linux depends on that). This would 
> also improve the end result: IRQs would not mix up the order of 
> trace records occasionally.

Yes, a lot of code in the kernel depends on it, but tracing becomes even a 
bigger hot path. The best tracers give the lowest overhead, and besides, 
it gives no protection to NMIs.

The order is presevered even if an IRQ happens. The first one to hit the 
real "reserve buffer" code will have that order.

I put a lot of work into making sure the buffer can be re-entrant. I would 
hate to just through it away. It is also needed for tracing NMIs. Letting 
interrupts reenter the code gives better testing to know that NMIs would 
work too.

This recursive logic is something that I've even thought about being a 
compiled in option only (DEBUG). Something to try out if we see a failure. 
Again, I wrote this to see if the lockdep hang was caused by the ring 
buffer and even the event tracer. I can move the checks into the events, 
and use a similar approach as the ring buffer does for preempt disable.
But this did not help at all in the debugging but let me know that the 
ring buffer and event tracer were not the reason for the hang. It was the 
spinlock in trace_print that caused the lockup. And I could have sworn I 
wrote a patch for that code before. I think it was lost in the rewrites.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/