[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150716173256.GA17753@redhat.com>
Date: Thu, 16 Jul 2015 19:32:56 +0200
From: Oleg Nesterov <oleg@...hat.com>
To: Jan Kara <jack@...e.cz>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>,
Al Viro <viro@...iv.linux.org.uk>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Paul McKenney <paulmck@...ux.vnet.ibm.com>,
Peter Zijlstra <peterz@...radead.org>,
Daniel Wagner <daniel.wagner@...-carit.de>,
Davidlohr Bueso <dave@...olabs.net>,
Ingo Molnar <mingo@...hat.com>, Tejun Heo <tj@...nel.org>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 0/4] change sb_writers to use percpu_rw_semaphore
On 07/16, Jan Kara wrote:
>
> On Wed 15-07-15 20:19:20, Oleg Nesterov wrote:
> >
> > Perhaps it makes to merge other 2 patches from Dave first? (those which
> > change __sb_start/end_write to rely on RCU). Afaics these changes are
> > straightforward and correct. Although I'd suggest to use preempt_disable()
> > and synchronize_sched() instead. I will be happy to (try to) make this
> > conversion on top of his changes.
> >
> > Because I do not want to delay the performance improvements and I do not
> > know when exactly I'll send the next version: I need to finish the previous
> > discussion about rcu_sync first. And the necessary changes in fs/super.c
> > depend on whether percpu_rw_semaphore will have rcu_sync or not (not too
> > much, only destroy_super() depends, but still).
> >
> > And of course, I am worried that I missed something and percpu_rw_semaphore
> > can't work for some reason. The code in fs/super.c looks simple, but it
> > seems that filesystems do the "strange" things with lockdep at least.
>
> So Dave's patches would go in only in the next merge window anyway so we
> still have like two-three weeks to decide which patchset to take.
OK, good.
> If you
> think it will take you longer,
Hopefully not.
> then merging Dave's patches makes some sense
> although I personally don't think the issue is so important that we have to
> fix it ASAP and eventual delay of one more release would be OK for me.
OK. I will try to do this in any case, I just wanted to say that I can
equally do this on top of Dave's patches.
To remind, I need to finish the discussion about percpu_rw_semaphore
and rcu_sync, then I'll try to make v2.
And. The biggest problem is lockdep. Everything else looks really simple
although of course I could miss something. And not only because the
filesystems abuse lockdep and thus we need some cleanups first. Unless
I am totally confused fs/super.c needs some cleanups (and fixes) too,
with or without this conversion. Say, acquire_freeze_lock() logic does
do not look right:
- wait_event(.frozen < level) without rwsem_acquire_read() is
just wrong from lockdep perspective. If we are going to deadlock
because the caller is buggy, lockdep can't warn us.
- __sb_start_write() can race with thaw_super() + freeze_super(),
and after "goto retry" the 2nd acquire_freeze_lock() is wrong.
- The "tell lockdep we are doing trylock" hack doesn't look nice.
I think this is correct, but this logic should be more explicit.
Yes, the recursive read_lock() is fine if we hold the lock on
higher level. But we do not need to fool lockdep. If we can not
deadlock in this case (and I agree we can't) we should simply use
wait == F consistently.
Something like this (not even compiled tested):
static int
do_sb_start_write(struct super_block *sb, int level, bool wait, unsigned long ip)
{
if (wait)
rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
retry:
if (unlikely(sb->s_writers.frozen >= level)) {
if (!wait)
return 0;
wait_event(sb->s_writers.wait_unfrozen,
sb->s_writers.frozen < level);
}
percpu_counter_inc(&sb->s_writers.counter[level-1]);
/*
* Make sure counter is updated before we check for frozen.
* freeze_super() first sets frozen and then checks the counter.
*/
smp_mb();
if (unlikely(sb->s_writers.frozen >= level)) {
__sb_end_write(sb, level);
goto retry;
}
if (!wait)
rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
return 1;
}
/*
* This is an internal function, please use sb_start_{write,pagefault,intwrite}
* instead.
*/
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
bool cantbelocked = false;
int ret;
#ifdef CONFIG_LOCKDEP
/*
* We want lockdep to tell us about possible deadlocks with freezing but
* it's it bit tricky to properly instrument it. Getting a freeze protection
* works as getting a read lock but there are subtle problems. XFS for example
* gets freeze protection on internal level twice in some cases, which is OK
* only because we already hold a freeze protection also on higher level. Due
* to these cases we have to use wait == F (trylock mode) which must not fail.
*/
if (wait) {
int i;
for (i = 0; i < level - 1; i++)
if (lock_is_held(&sb->s_writers.lock_map[i])) {
cantbelocked = true;
break;
}
}
#endif
ret = do_sb_start_write(sb, level, wait && !cantbelocked, _RET_IP_);
WARN_ON(cantbelocked & !ret);
return ret;
}
This should not generate the additional code if CONFIG_LOCKDEP=n and
After this patch it will be trivial to convert __sb_start_write(), but
we need some more cleanups. And perhaps I'll send some changes (like
above) separately, because again, I think they make sense in any case.
In short: I'll try to make v2 asap, but this is all I can promise ;)
Oleg.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists