linux-kernel - Re: [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090502011914.GA6208@nowhere>
Date:	Sat, 2 May 2009 03:19:15 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Chris Mason <chris.mason@...cle.com>
Cc:	Andi Kleen <andi@...stfloor.org>, Ingo Molnar <mingo@...e.hu>,
	LKML <linux-kernel@...r.kernel.org>,
	Jeff Mahoney <jeffm@...e.com>,
	ReiserFS Development List <reiserfs-devel@...r.kernel.org>,
	Alexander Beregalov <a.beregalov@...il.com>,
	Alessio Igor Bogani <abogani@...ware.it>,
	Jonathan Corbet <corbet@....net>,
	Alexander Viro <viro@...iv.linux.org.uk>
Subject: Re: [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on
	fs_changed()

On Fri, May 01, 2009 at 10:14:01AM -0400, Chris Mason wrote:
> On Fri, 2009-05-01 at 16:01 +0200, Frederic Weisbecker wrote:
> > On Fri, May 01, 2009 at 09:44:16AM -0400, Chris Mason wrote:
> > > On Fri, 2009-05-01 at 15:28 +0200, Frederic Weisbecker wrote:
> > > > On Fri, May 01, 2009 at 08:31:12AM +0200, Andi Kleen wrote:
> > > > > Frederic Weisbecker <fweisbec@...il.com> writes:
> > > > > >
> > > > > > diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
> > > > > > index 6587b4e..397d281 100644
> > > > > > --- a/include/linux/reiserfs_fs.h
> > > > > > +++ b/include/linux/reiserfs_fs.h
> > > > > > @@ -1302,7 +1302,13 @@ static inline loff_t max_reiserfs_offset(struct inode *inode)
> > > > > >  #define get_generation(s) atomic_read (&fs_generation(s))
> > > > > >  #define FILESYSTEM_CHANGED_TB(tb)  (get_generation((tb)->tb_sb) != (tb)->fs_gen)
> > > > > >  #define __fs_changed(gen,s) (gen != get_generation (s))
> > > > > > -#define fs_changed(gen,s) ({cond_resched(); __fs_changed(gen, s);})
> > > > > > +#define fs_changed(gen,s)		\
> > > > > > +({					\
> > > > > > +	reiserfs_write_unlock(s);	\
> > > > > > +	cond_resched();			\
> > > > > > +	reiserfs_write_lock(s);		\
> > > > > 
> > > > > Did you try writing that 
> > > > > 
> > > > >     if (need_resched()) {               \
> > > > > 	reiserfs_write_unlock(s);	\
> > > > > 	cond_resched();			\  (or schedule(), but cond_resched does a loop)
> > > > > 	reiserfs_write_lock(s);		\
> > > > >     }				
> > > > > 
> > > > > ? That might give better performance under load because users will be better
> > > > > batched and you don't release the lock unnecessarily in the unloaded case.
> > > > 
> > > > 
> > > > 
> > > > Good catch!
> > > > And I guess this pattern matches most of the cond_resched()
> > > > all over the code (the only condition is that we must already hold
> > > > the write lock).
> > > > 
> > > > I will merge your idea and Ingo's one, write a
> > > > reiserfs_cond_resched() to have a helper which
> > > > factorizes this pattern.
> > > 
> > > The pattern you'll find goes like this:
> > > 
> > > lock_kernel()
> > > do some work
> > > do something that might schedule
> > > run fs_changed(), fixup as required.
> > > 
> > > In your setup it is translating to:
> > > 
> > > reiserfs_write_lock(s)
> > > do some work
> > > reiserfs_write_unlock(s)
> > > 
> > > do something that might schedule
> > > 
> > > reiserfs_write_lock(s)
> > > if (need_resched()) {
> > >     reiserfs_write_unlock(s)
> > >     cond_resched()
> > >     reiserfs_write_lock(s)
> > > }
> > > 
> > > if (__fs_changed()) fixup as required
> > > 
> > > You'll also find that item_moved is similar to __fs_changed() but more
> > > fine grained.
> > > 
> > > One easy optimization is to make an fs_changed_relock()
> > > 
> > > static inline int fs_changed_relock(gen, s) {
> > > 	cond_resched();
> > > 	reiserfs_write_lock(s)
> > > 	return __fs_changed(gen, s)
> > > }
> > 
> > 
> > 
> > Nice idea!
> > Does it means I can also replace the item_moved() calls to __fs_changed()?
> > 
> 
> Not quite, it looks like a common convention is also
> 
> if (fs_changed && item_moved) { fixup }
> 
> item_moved is the expensive check based on the contents of a btree
> block, while fs_changed is a simple generation number.


Ok.

 
> > They seem to not work the same way, I guess I should provide two different
> > helpers, depending on the check.
> > 
> > 
> > > 
> > > Another cause of scheduling is going to be reiserfs_prepare_for_journal.
> > > This function gets called before we modify a metadata buffer and it
> > > waits for IO to finish.
> > > 
> > > Not sure if your patch series already found it, but if you change this:
> > > 
> > > int reiserfs_prepare_for_journal(struct super_block *sb,
> > >                                  struct buffer_head *bh, int wait)
> > > {
> > >         PROC_INFO_INC(sb, journal.prepare);
> > > 
> > >         if (!trylock_buffer(bh)) {
> > >                 if (!wait)
> > >                         return 0;
> > >                 lock_buffer(bh);
> > >         }
> > > 
> > > Into:
> > > 
> > > 	if (!trylock_buffer(bh)) {
> > > 		if (!wait)
> > > 			return 0;
> > > 		reiserfs_write_unlock(s);
> > > 		wait_on_buffer(bh);
> > > 		reiserfs_write_lock(s);
> > > 		lock_buffer(bh);
> > > 	}
> > > 
> > > You'll catch a big cause of waiting for the disk with the lock held.
> > 
> > 
> > Again, good catch. I will try that.
> > 
> > I should also check the different other lock_buffer() which
> > indeed might sleep.
> 
> ftrace can help some.
> 
> cd /sys/kernel/debug/tracing
> echo function > current_tracer
> echo func_stack_trace > trace_options
> echo schedule > set_ftrace_filter
> echo 1 > tracing_on
> 
> trace-histogram < trace_pipe > /tmp/trace_output
> 
> (in another window run the reiserfs workload)
> 
> When done hit ctrl-c on the trace-histogram window
> 
> You'll get the most common causes of schedule during the run, groupd by
> stack trace.  Just look for the ones that are done with the lock held.
> 
> trace-histogram is here:
> 
> http://oss.oracle.com/~mason/trace-histogram
> 
> -chris


Thanks, looks like a useful script! May be it could be merged
in scripts/tracing ? I will try it out.

Recently I used a specific thing to find the hotspots which schedule
with the lock:

I enabled PREEMPT_NONE on an UP box.
And once we find the lock contended, we print the
stacktrace of the lock owner, we are sure that he is
sleeping and that he does it voluntarily because of PREEMPT_NONE
and the UP.

The latest reiserfs patchset is a result of this trick.
I will try yours now that I'm testing on SMP preempt :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/