linux-kernel - Re: [PATCH RFC 0/4] change sb_writers to use percpu_rw

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150716173256.GA17753@redhat.com>
Date:	Thu, 16 Jul 2015 19:32:56 +0200
From:	Oleg Nesterov <oleg@...hat.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Dave Hansen <dave.hansen@...ux.intel.com>,
	Al Viro <viro@...iv.linux.org.uk>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Paul McKenney <paulmck@...ux.vnet.ibm.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Daniel Wagner <daniel.wagner@...-carit.de>,
	Davidlohr Bueso <dave@...olabs.net>,
	Ingo Molnar <mingo@...hat.com>, Tejun Heo <tj@...nel.org>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 0/4] change sb_writers to use percpu_rw_semaphore

On 07/16, Jan Kara wrote:
>
> On Wed 15-07-15 20:19:20, Oleg Nesterov wrote:
> >
> > Perhaps it makes to merge other 2 patches from Dave first? (those which
> > change __sb_start/end_write to rely on RCU). Afaics these changes are
> > straightforward and correct. Although I'd suggest to use preempt_disable()
> > and synchronize_sched() instead. I will be happy to (try to) make this
> > conversion on top of his changes.
> >
> > Because I do not want to delay the performance improvements and I do not
> > know when exactly I'll send the next version: I need to finish the previous
> > discussion about rcu_sync first. And the necessary changes in fs/super.c
> > depend on whether percpu_rw_semaphore will have rcu_sync or not (not too
> > much, only destroy_super() depends, but still).
> >
> > And of course, I am worried that I missed something and percpu_rw_semaphore
> > can't work for some reason. The code in fs/super.c looks simple, but it
> > seems that filesystems do the "strange" things with lockdep at least.
>
> So Dave's patches would go in only in the next merge window anyway so we
> still have like two-three weeks to decide which patchset to take.

OK, good.

> If you
> think it will take you longer,

Hopefully not.

> then merging Dave's patches makes some sense
> although I personally don't think the issue is so important that we have to
> fix it ASAP and eventual delay of one more release would be OK for me.

OK. I will try to do this in any case, I just wanted to say that I can
equally do this on top of Dave's patches.

To remind, I need to finish the discussion about percpu_rw_semaphore
and rcu_sync, then I'll try to make v2.

And. The biggest problem is lockdep. Everything else looks really simple
although of course I could miss something. And not only because the
filesystems abuse lockdep and thus we need some cleanups first. Unless
I am totally confused fs/super.c needs some cleanups (and fixes) too,
with or without this conversion. Say, acquire_freeze_lock() logic does
do not look right:

	- wait_event(.frozen < level) without rwsem_acquire_read() is
	  just wrong from lockdep perspective. If we are going to deadlock
	  because the caller is buggy, lockdep can't warn us.

	- __sb_start_write() can race with thaw_super() + freeze_super(),
	  and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.

	- The "tell lockdep we are doing trylock" hack doesn't look nice.

	  I think this is correct, but this logic should be more explicit.
	  Yes, the recursive read_lock() is fine if we hold the lock on
	  higher level. But we do not need to fool lockdep. If we can not
	  deadlock in this case (and I agree we can't) we should simply use
	  wait == F consistently.

Something like this (not even compiled tested):

	static int
	do_sb_start_write(struct super_block *sb, int level, bool wait, unsigned long ip)
	{

		if (wait)
			rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
	retry:
		if (unlikely(sb->s_writers.frozen >= level)) {
			if (!wait)
				return 0;
			wait_event(sb->s_writers.wait_unfrozen,
				   sb->s_writers.frozen < level);
		}

		percpu_counter_inc(&sb->s_writers.counter[level-1]);
		/*
		 * Make sure counter is updated before we check for frozen.
		 * freeze_super() first sets frozen and then checks the counter.
		 */
		smp_mb();
		if (unlikely(sb->s_writers.frozen >= level)) {
			__sb_end_write(sb, level);
			goto retry;
		}

		if (!wait)
			rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
		return 1;
	}

	/*
	 * This is an internal function, please use sb_start_{write,pagefault,intwrite}
	 * instead.
	 */
	int __sb_start_write(struct super_block *sb, int level, bool wait)
	{
		bool cantbelocked = false;
		int ret;

	#ifdef CONFIG_LOCKDEP
		/*
		 * We want lockdep to tell us about possible deadlocks with freezing but
		 * it's it bit tricky to properly instrument it. Getting a freeze protection
		 * works as getting a read lock but there are subtle problems. XFS for example
		 * gets freeze protection on internal level twice in some cases, which is OK
		 * only because we already hold a freeze protection also on higher level. Due
		 * to these cases we have to use wait == F (trylock mode) which must not fail.
		 */
		if (wait) {
			int i;

			for (i = 0; i < level - 1; i++)
				if (lock_is_held(&sb->s_writers.lock_map[i])) {
					cantbelocked = true;
					break;
				}
		}
	#endif
		ret = do_sb_start_write(sb, level, wait && !cantbelocked, _RET_IP_);
		WARN_ON(cantbelocked & !ret);
		return ret;
	}

This should not generate the additional code if CONFIG_LOCKDEP=n and
After this patch it will be trivial to convert __sb_start_write(), but
we need some more cleanups. And perhaps I'll send some changes (like
above) separately, because again, I think they make sense in any case.

In short: I'll try to make v2 asap, but this is all I can promise ;)

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/