[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 6 Jun 2016 22:15:23 +0100
From: Al Viro <viro@...IV.linux.org.uk>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Dave Hansen <dave.hansen@...el.com>,
"Chen, Tim C" <tim.c.chen@...el.com>,
Ingo Molnar <mingo@...hat.com>,
Davidlohr Bueso <dbueso@...e.de>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
Jason Low <jason.low2@...com>,
Michel Lespinasse <walken@...gle.com>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Waiman Long <waiman.long@...com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: performance delta after VFS i_mutex=>i_rwsem conversion
On Mon, Jun 06, 2016 at 01:46:23PM -0700, Linus Torvalds wrote:
> So my gut feel is that we do want to have the same heuristics for
> rwsems and mutexes (well, modulo possible actual semantic differences
> due to the whole shared-vs-exclusive issues).
>
> And I also suspect that the mutexes have gotten a lot more performance
> tuning done on them, so it's likely the correct thing to try to make
> the rwsem match the mutex code rather than the other way around.
>
> I think we had Jason and Davidlohr do mutex work last year, let's see
> if they agree on that "yes, the mutex case is the likely more tuned
> case" feeling.
>
> The fact that your performance improves when you do that obviously
> then also validates the assumption that the mutex spinning is the
> better optimized one.
FWIW, there's another fun issue on ramfs - dcache_readdir() is doing an
obscene amount of grabbing/releasing ->d_lock and once you take the external
serialization out, parallel getdents load hits contention on *that*.
In spades. And unlike mutex (or rswem exclusive), contention on ->d_lock
chews a lot of cycles. The root cause is the use of cursors - we not only
move them more than we ought to (we do that on each entry reported, rather
than once before return from dcache_readdir()), we can't traverse the real
list entries (which remain nice and stable; another low-hanging fruit is
pointless grabbing ->d_lock on those) without ->d_lock on parent.
I think I have a kinda-sorta solution, but it has a problem. What I want
to do is
* list_move() only once per dcache_readdir()
* ->d_lock taken for that and only for that.
* list_move() itself surrounded with write_seqcount_{begin,end} on
some seqcount
* traversal to the next real entry done under rcu_read_lock in a
seqretry loop.
The only problem is where to put that seqcount (unsigned int, really).
->i_dir_seq is an obvious candidate, but that'll need careful profiling
on getdents/lookup mixes...
Powered by blists - more mailing lists