linux-kernel - Re: [PATCH] netfilter: use per-cpu recursive lock (v11)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090422073524.GA31835@elte.hu>
Date:	Wed, 22 Apr 2009 09:35:24 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Eric Dumazet <dada1@...mosbay.com>
Cc:	Stephen Hemminger <shemminger@...tta.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Paul Mackerras <paulus@...ba.org>, paulmck@...ux.vnet.ibm.com,
	Evgeniy Polyakov <zbr@...emap.net>,
	David Miller <davem@...emloft.net>, kaber@...sh.net,
	jeff.chua.linux@...il.com, laijs@...fujitsu.com,
	jengelh@...ozas.de, r000n@...0n.net, linux-kernel@...r.kernel.org,
	netfilter-devel@...r.kernel.org, netdev@...r.kernel.org,
	benh@...nel.crashing.org, mathieu.desnoyers@...ymtl.ca
Subject: Re: [PATCH] netfilter: use per-cpu recursive lock (v11)


* Eric Dumazet <dada1@...mosbay.com> wrote:

> Ingo Molnar a écrit :
> > 
> > Why not use the obvious solution: a _single_ wrlock for global 
> > access and read_can_lock() plus per cpu locks in the fastpath?
> 
> Obvious is not the qualifier I would use :)
> 
> Brilliant yes :)

thanks :)

> > That way there's no global cacheline bouncing (just the 
> > _reading_ of a global cacheline - which will be nicely localized 
> > - on NUMA too) - and we will hold at most 1-2 locks at once!
> > 
> > Something like:
> > 
> > 	__cacheline_aligned DEFINE_RWLOCK(global_wrlock);
> > 
> > 	DEFINE_PER_CPU(rwlock_t local_lock);
> > 
> > 
> > 	void local_read_lock(void)
> > 	{
> > 	again:
> > 		read_lock(&per_cpu(local_lock, this_cpu));
> 
> Hmm... here we can see global_wrlock locked by on writer, while 
> this cpu already called local_read_lock(), and calls again this 
> function -> Deadlock, because we hold our local_lock locked.

Yeah, indeed.

I wasnt really concentrating on the nested case, i was concentrating 
on the scalability and lock nesting angle. I think the code 
submitted here looks rather incestous in that regard.

Allowing nested locking _on the same CPU_ is asking for trouble. Use 
short critical sections and if there's any exclusion needed, use an 
irq-safe lock or a softirq-safe lock. Problem solved.

> Very interesting and could be changed to use spinlock + depth per 
> cpu.
> 
> -> we can detect recursion and avoid the deadlock, and we only use 
> one atomic operation per lock/unlock pair in fastpath (this was 
> the reason we tried hard to use a percpu spinlock during this 
> thread)
> 
> 
> __cacheline_aligned DEFINE_RWLOCK(global_wrlock);
> 
> struct ingo_local_lock {
> 	spinlock_t lock;
> 	int depth;
> };
> DEFINE_PER_CPU(struct ingo_local_lock local_lock);
> 
> 
> void local_read_lock(void)
> {
> 	struct ingo_local_lock *lck;
> 
> 	local_bh_and_preempt_disable();
> 	lck = &get_cpu_var(local_lock);
> 	if (++lck->depth > 0) /* already locked */
> 		return;
> again:
> 	spin_lock(&lck->lock);
> 
> 	if (unlikely(!read_can_lock(&global_wrlock))) {
> 		spin_unlock(&lck->lock);
> 		/*
> 		 * Just wait for any global write activity:
> 		 */
> 		read_unlock_wait(&global_wrlock);
> 		goto again;
> 	}
> }
> 
> void global_write_lock(void)
> {
> 	write_lock(&global_wrlock);
> 
> 	for_each_possible_cpu(i)
> 		spin_unlock_wait(&per_cpu(local_lock, i));
> }
> 
> Hmm ?

Yeah, this looks IRQ-nesting safe. But i really have to ask: why 
does this code try so hard to allow same-CPU nesting?

Nesting on the same CPU is _bad_ for several reasons:

1) Performance: it rips apart critical sections cache-wise. Instead 
   of a nice:

       C1 ... C2 ... C3 ... C4

   serial sequence of critical sections, we get:
 
       C1A ... ( C2 ) ... C1B ... C3 ... C4

   Note that C1 got "ripped apart" into C1A and C1B with C2 injected 
   - reducing cache locality between C1A and C1B. We have to execute
   C1B no matter what, so we didnt actually win anything in terms of 
   total work to do, by processing C2 out of order.

   [ Preemption of work (which this kind of nesting is really about)
     is _the anti-thesis of performance_, and we try to delay it as
     much as possible and we try to batch up as much as possible. 
     For example the CPU scheduler will try _real_ hard to not 
     preempt a typical workload, as long as external latency 
     boundaries allow that. ]

2) Locking complexity and robustness. Nested locking is rank #1 in
   terms of introducing regressions into the kernel.

3) Instrumentation/checking complexity. Checking locking 
   dependencies is good and catches a boatload of bugs before they 
   hit upstream, and nested locks are supported but cause an 
   exponential explosion in terms of dependencies to check.

   Also, whenever instrumentation explodes is typically the sign of 
   some true, physical complexity that has been introduced into the 
   code. So it often is a canary for a misdesign at a fundamental 
   level, not a failure in the instrumentation framework.

In the past i saw lock nesting often used as a wrong solution when 
the critical sections were too long (causing too long latencies for 
critical work - e.g. delaying hardirq completion processing 
unreasonably), or just plain out of confusion about the items above.

I dont know whether that's the case here - it could be one of the 
rare exceptions calling for a new locking primitive (which should 
then be introduced at the core kernel level IMHO) - i dont know the 
code that well.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/