linux-kernel - Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and dcache scaling tree update and status)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20101213033110.GA7898@amd>
Date:	Mon, 13 Dec 2010 14:31:10 +1100
From:	Nick Piggin <npiggin@...nel.dk>
To:	Nick Piggin <npiggin@...nel.dk>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Al Viro <viro@...IV.linux.org.uk>,
	Stephen Rothwell <sfr@...b.auug.org.au>,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [patch] fs: scale vfsmount refcount (was Re: rcu-walk and
 dcache scaling tree update and status)

On Mon, Dec 13, 2010 at 01:42:17PM +1100, Nick Piggin wrote:
> On Mon, Dec 13, 2010 at 01:37:33PM +1100, Nick Piggin wrote:
> > Final note:
> > You won't be able to reproduce the parallel path walk scalability
> > numbers that I've posted, because the vfsmount refcounting scalability
> > patch is not included. I have a new idea for that now, so I'll be asking
> > for comments with that soon.
> 
> Here is the patch I've been using, which works but has the problem
> described in the changelog. But it works nicely for testing.
> 
> As I said, I have a promising approach to solving the problem.
> 
> fs: scale mntget/mntput

[...]

> [Note: this is not for merging. Un-attached operation (lazy umount) may not be
>  uncommon and will be slowed down and actually have worse scalablilty after
>  this patch. I need to think about how to do fast refcounting with unattached
>  mounts.]

So the problem this patch tries to fix is vfsmount refcount scalability.
We need to take a ref for every successful path lookup, and often
lookups are going to the same mountpoint.

(Yes this little bouncing atomic hurts, badly, even on my small 2s12c
tightly connected system on the parallel git diff workload -- because
there are other bouncing kernel cachelines in this workload).

The fundamental difficulty is that a simple refcount can never be SMP
scalable, because dropping the ref requires we check whether we are
the last reference (which implies communicating with other CPUs that
might have taken references).

We can make them scalable by keeping a local count, and checking the
global sum less frequently. Some possibilities:

- avoid checking global sum while vfsmount is mounted, because the mount
  contributes to the refcount (that is what this patch does, but it
  kills performance inside a lazy umounted subtree).

- check global sum once every time interval (this would delay mount and
  sb garbage collection, so it's probably a showstopper).

- check global sum only if local sum goes to 0 (this is difficult with
  vfsmounts because the 'get' and the 'put' can happen on different
  CPUs, so we'd need to have a per-thread refcount, or carry around the
  CPU number with the refcount, both get horribly ugly, it turns out).

My proposal is a variant / generalisation of the 1st idea, which is to
have "long" refcounts. Normal refcounts will be per-cpu difference of
incs and decs, but dropping a reference will not have to check the
global sum while "long" refcounts are elevated. If the mount is a long
refcount, then that is what this current patch essentially is.

But then I would also have cwd take the long refcount, which allows
detached operation to remain fast while there are processes working
inside the detached namespace.

Details of locking aren't completely worked out -- it's a bit more
tricky because umount can be much heavier than fork() or chdir(), so
there are some difficulties in making long refcount operations faster
(the problem is remaining race-free versus the fast mntput check, but
I think a seqcount to go with the long refcount should do the trick).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/