linux-kernel - Re: [patch 1/6] fs: icache RCU free inodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 10 Nov 2010 09:05:06 +1100
From:	Nick Piggin <npiggin@...nel.dk>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Eric Dumazet <eric.dumazet@...il.com>,
	Nick Piggin <npiggin@...nel.dk>,
	Al Viro <viro@...iv.linux.org.uk>,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [patch 1/6] fs: icache RCU free inodes

On Tue, Nov 09, 2010 at 09:08:17AM -0800, Linus Torvalds wrote:
> On Tue, Nov 9, 2010 at 8:21 AM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> >
> > You can see problems using this fancy thing :
> >
> > - Need to use slab ctor() to not overwrite some sensitive fields of
> > reused inodes.
> >  (spinlock, next pointer)
> 
> Yes, the downside of using SLAB_DESTROY_BY_RCU is that you really
> cannot initialize some fields in the allocation path, because they may
> end up being still used while allocating a new (well, re-used) entry.
> 
> However, I think that in the long run we pretty much _have_ to do that
> anyway, because the "free each inode separately with RCU" is a real
> overhead (Nick reports 10-20% cost). So it just makes my skin crawl to
> go that way.

This is a creat/unlink loop on a tmpfs filesystem. Any real filesystem
is going to be *much* heavier in creat/unlink (so that 10-20% cost would
look more like a few %), and any real workload is going to have much
less intensive pattern.

Much of the performance hit comes from draining the allocator's queues
and going into slow paths there, the remainder is due to cache coldness
of the new objects being allocated, and rcu queueing.

But if you allocate and free in larger batches, RCU and non-RCU have
the same biggest problem of draining slab queues and getting cache
cold objects.

So that remaining few % I suspect goes down to a few points of a %.

Anyway, I actually have tried to measure a regression on slightly more
realistic inode heavy workloads like fs_mark, and couldn't measure any.
On the flip side, I could measure huge improvements with rcu-walk, so
I think it is a valid tradeoff, with an eye to having a few fallback
plans if we really need them.

> And I think SLAB_DESTROY_BY_RCU is the "normal" way to do
> these kinds of things anyway, so I actually think it's "simpler", if
> only because it's the common pattern.
> 
> (Put another way: it might not be less code, and it might have its own
> subtle issues, but they are _shared_ subtle issues with all the other
> SLAB_DESTROY_BY_RCU users, so we hopefully have a better understanding
> of them)
> 
> > - Fancy algo to detect an inode moved from one chain to another. Lookups
> > should be able to detect and restart their loop.
> 
> So this is where I think we should just use locks unless we have hard
> numbers to say that being clever is worth it.

Yeah, it's really not a bit deal.

> > - After a match, need to get a stable reference on inode (lock), then
> > recheck the keys to make sure the target inode is the right one.
> 
> Again, this is only an issue for non-dentry lookup. For the dentry
> case, we know that if the dentry still exists, then the inode still
> exists. So we don't need to check a stable inode pointer if we just
> verify the stability of the dentry - and we'll have to do that anyway
> obviously.

With rcu-walk, we don't know that a dentry still exists -- we can't
store anything to it to pin it. I come back and verify after the
fact that it hasn't changed or gone away, and that enables us to know
that the inode has not gone away too.

But in the intermediate stage of doing access verification on the inode,
it really sucks if it suddenly comes back as something else. Not saying
it is impossible, but the traditional easy model of "lock, re-check" for
DESTROY_BY_RCU does not work with rcu-walk. Believe me I've looked at
doing it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/