netdev - Re: [PATCH[] netfilter: use per-cpu reader-writer lock (v0.7)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090416230457.GA16226@elte.hu>
Date:	Fri, 17 Apr 2009 01:04:57 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Stephen Hemminger <shemminger@...tta.com>,
	Eric Dumazet <dada1@...mosbay.com>, paulmck@...ux.vnet.ibm.com,
	Patrick McHardy <kaber@...sh.net>,
	David Miller <davem@...emloft.net>, jeff.chua.linux@...il.com,
	paulus@...ba.org, laijs@...fujitsu.com, jengelh@...ozas.de,
	r000n@...0n.net, linux-kernel@...r.kernel.org,
	netfilter-devel@...r.kernel.org, netdev@...r.kernel.org,
	benh@...nel.crashing.org
Subject: Re: [PATCH[] netfilter: use per-cpu reader-writer lock (v0.7)

* Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Thu, 16 Apr 2009, Stephen Hemminger wrote:
> >
> > This version of x_tables (ip/ip6/arp) locking uses a per-cpu
> > rwlock that can be nested. It is sort of like earlier brwlock 
> > (fast reader, slow writer). The locking is isolated so future improvements
> > can concentrate on measuring/optimizing xt_table_info_lock. I tried
> > other versions based on recursive spin locks and sequence counters and 
> > for me, the risk of inventing own locking primitives not worth it at this time.
> 
> This is stil scary.
> 
> Do we guarantee that read-locks nest in the presense of a waiting 
> writer on another CPU? Now, I know we used to (ie readers always 
> nested happily with readers even if there were pending writers), 
> and then we broke it. I don't know that we ever unbroke it.
> 
> IOW, at least at some point we deadlocked on this (due to trying 
> to be fair, and not lettign in readers while earlier writers were 
> waiting):
> 
> 	CPU#1			CPU#2
> 
> 	read_lock
> 
> 				write_lock
> 				.. spins with write bit set, waiting for
> 				   readers to go away ..
> 
> 	recursive read_lock
> 	.. spins due to the write bit
> 	   being. BOOM: deadlock  ..
> 
> Now, I _think_ we avoid this, but somebody should double-check.

This is a narrow special-case where the spin-rwlock is safe, and the 
rwsem is unsafe.

But it should work for rwlocks - it always worked and the networking 
code always relied on that AFAIK.

Here's the x86 assembly code of the write-side slowpath:

ENTRY(__write_lock_failed)
        CFI_STARTPROC
        LOCK_PREFIX
        addl $RW_LOCK_BIAS,(%rdi)
1:      rep
        nop
        cmpl $RW_LOCK_BIAS,(%rdi)
        jne 1b
        LOCK_PREFIX
        subl $RW_LOCK_BIAS,(%rdi)
        jnz  __write_lock_failed
        ret
        CFI_ENDPROC

the fastpath decreased the value with RW_LOCK_BIAS, and when we 
enter this function we undo that effect by adding RW_LOCK_BIAS. Then 
we spin (without signalling our write-intent) passively until the 
count reaches RW_LOCK_BIAS. Then we try to lock it again and bring 
it to zero (meaning no other readers or writers - we got the lock).

This is pretty much the most unfair strategy possible for writers - 
but this is how rwlocks always behaved - and they do so mostly for 
recursive use within networking.

This is why the tasklist_lock was always so suspect to insane 
starvation symptoms on really large SMP systems, and this is why 
write_lock_irq(&tasklist_lock) was always a dangerous operation to 
do. (it can spin for a _long_ time with irqs off.)

It's not the most optimal of situations. Some patches are in the 
works to fix the irqs-off artifact (on ia64 - no x86 patches yet 
AFAICS) - but that's just papering it over.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html