linux-kernel - Re: [PATCH RFC ticketlock] Auto-queued ticketlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFw2=3ua32vkpWc_9ckCNweFmmH3c0d-VqoaUD86rkR-=w@mail.gmail.com>
Date:	Wed, 12 Jun 2013 17:01:19 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Al Viro <viro@...iv.linux.org.uk>
Cc:	Davidlohr Bueso <davidlohr.bueso@...com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Paul McKenney <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...e.hu>,
	"?????????" <laijs@...fujitsu.com>,
	Dipankar Sarma <dipankar@...ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	Josh Triplett <josh@...htriplett.org>, niv@...ibm.com,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>,
	Valdis Kletnieks <Valdis.Kletnieks@...edu>,
	David Howells <dhowells@...hat.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Darren Hart <darren@...art.com>,
	"Fr??d??ric Weisbecker" <fweisbec@...il.com>,
	Silas Boyd-Wickizer <sbw@....edu>,
	Waiman Long <Waiman.Long@...com>
Subject: Re: [PATCH RFC ticketlock] Auto-queued ticketlock

On Wed, Jun 12, 2013 at 4:32 PM, Al Viro <viro@...iv.linux.org.uk> wrote:
> On Wed, Jun 12, 2013 at 01:26:25PM -0700, Linus Torvalds wrote:
>>
>> However, one optimization missing from your patch is obvious in the
>> profile. "dget_parent()" also needs to be optimized - you still have
>> that as 99% of the spin-lock case. I think we could do something like
>>
>>    rcu_read_lock();
>>    parent = ACCESS_ONCE(dentry->d_parent);
>>    if (atomic_inc_nonzero(&parent->d_count))
>>       return parent;
>>    .. get d_lock and do it the slow way ...
>>    rcu_read_unlock();
>>
>> to locklessly get the parent pointer. We know "parent" isn't going
>> away (dentries are rcu-free'd and we hold the rcu read lock), and I
>> think that we can optimistically take *any* parent dentry that
>> happened to be valid at one point. As long as the refcount didn't go
>> down to zero. Al?
>
> What will you do with __d_rcu_to_refcount()?  Any such scheme has to
> hold d_lock from zero->non-zero d_count changes, or atomic_dec_and_lock
> in dput() won't help at all.

I'd actually suggest we do *not* remove any existing d_lock usage
outside of the particular special cases we want to optimize, which at
least from Davidlohr's profile is just dput() (which has shown up a
lot before) and dget_parent() (which I'm not sure why it happens so
much on his load, but it really seems trivially safe to optimistically
do under just the RCU lock).

>  As it is, both complete_walk() and unlazy_walk()
> are grabbing ->d_lock on the dentry we'd reached, so they can call that
> sucker.  And that'll give you ->d_lock contention when a bunch of threads
> are hitting the same file; I don't see how atomics would avoid that
> one...

I'd love to get rid of complete_walk() using the dcache lock too, but
if we really can't get rid of it, I won't cry.

That said, I do wonder if we could do something like
"atomic_inc_not_zero()" on the d_count, and only if it is zero (which
won't be horribly unusual, since for leaf dentries that nobody else is
using) we'd do the whole locking sequence.

But my first reaction is to not even bother until it shows up on some
profile. Of course, maybe it immediately does.

There's a real downside to making d_count an "atomic_t", and there
might be loads where it actually bites us. But even in the absense of
contention, atomics tend to be sufficiently faster than spinlocks that
even if we end up adding two or even three atomics for each d_lock
lock we get rid of, we should be ok even for single-thread. On the
contention case, we'll obviously win almost regardless of how many
atomics we add.

Of course, that assumes we get rid of any locking at all for the
normal case. with dput() having to take the lock when the refcount
goes to zero, and most single-threaded file opens using the final path
component with a dentry that isn't otherwise used, doing an atomic
d_count might hurt the single-thread case without getting any spinlock
advantage at all. dget_parent() would be the only case that helps even
there, and that should normally only happen for "..", I think.

So who knows.. Are we willing to take a hit on the single-thread case
(*if* that is the case) if it helps scalability a lot? If it was a
common scalability issue, sure. But just for AIM7 looking up
/etc/passwd? Maybe that's not a good idea.

Of course, for the mostly non-contended case, it's quite possible that
the extra atomic isn't even noticeable. As long as the dentry is dirty
in the cache (which it would be, normally), the atomic cost is just in
the 10-20 cycles.

End result: I think it would be interesting to try this all out, and
it could be a noticeable win under some cases, but it *definitely*
needs a lot of timing and testing to see which ways it goes..

            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/