linux-kernel - Re: performance delta after VFS i_mutex=>i

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160606211522.GF14480@ZenIV.linux.org.uk>
Date:	Mon, 6 Jun 2016 22:15:23 +0100
From:	Al Viro <viro@...IV.linux.org.uk>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Dave Hansen <dave.hansen@...el.com>,
	"Chen, Tim C" <tim.c.chen@...el.com>,
	Ingo Molnar <mingo@...hat.com>,
	Davidlohr Bueso <dbueso@...e.de>,
	"Peter Zijlstra (Intel)" <peterz@...radead.org>,
	Jason Low <jason.low2@...com>,
	Michel Lespinasse <walken@...gle.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Waiman Long <waiman.long@...com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: performance delta after VFS i_mutex=>i_rwsem conversion

On Mon, Jun 06, 2016 at 01:46:23PM -0700, Linus Torvalds wrote:

> So my gut feel is that we do want to have the same heuristics for
> rwsems and mutexes (well, modulo possible actual semantic differences
> due to the whole shared-vs-exclusive issues).
> 
> And I also suspect that the mutexes have gotten a lot more performance
> tuning done on them, so it's likely the correct thing to try to make
> the rwsem match the mutex code rather than the other way around.
> 
> I think we had Jason and Davidlohr do mutex work last year, let's see
> if they agree on that "yes, the mutex case is the likely more tuned
> case" feeling.
> 
> The fact that your performance improves when you do that obviously
> then also validates the assumption that the mutex spinning is the
> better optimized one.

FWIW, there's another fun issue on ramfs - dcache_readdir() is doing an
obscene amount of grabbing/releasing ->d_lock and once you take the external
serialization out, parallel getdents load hits contention on *that*.
In spades.  And unlike mutex (or rswem exclusive), contention on ->d_lock
chews a lot of cycles.  The root cause is the use of cursors - we not only
move them more than we ought to (we do that on each entry reported, rather
than once before return from dcache_readdir()), we can't traverse the real
list entries (which remain nice and stable; another low-hanging fruit is
pointless grabbing ->d_lock on those) without ->d_lock on parent.

I think I have a kinda-sorta solution, but it has a problem.  What I want
to do is
	* list_move() only once per dcache_readdir()
	* ->d_lock taken for that and only for that.
	* list_move() itself surrounded with write_seqcount_{begin,end} on
some seqcount
	* traversal to the next real entry done under rcu_read_lock in a
seqretry loop.

The only problem is where to put that seqcount (unsigned int, really).
->i_dir_seq is an obvious candidate, but that'll need careful profiling
on getdents/lookup mixes...