[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110721050927.GV2313@linux.vnet.ibm.com>
Date:	Wed, 20 Jul 2011 22:09:27 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	linux-kernel@...r.kernel.org, mingo@...e.hu, laijs@...fujitsu.com,
	dipankar@...ibm.com, akpm@...ux-foundation.org,
	mathieu.desnoyers@...ymtl.ca, josh@...htriplett.org,
	niv@...ibm.com, tglx@...utronix.de, peterz@...radead.org,
	rostedt@...dmis.org, Valdis.Kletnieks@...edu, dhowells@...hat.com,
	eric.dumazet@...il.com, darren@...art.com, patches@...aro.org,
	greearb@...delatech.com, edt@....ca
Subject: Re: [PATCH tip/core/urgent 3/7] rcu: Streamline code produced by
 __rcu_read_unlock()
On Wed, Jul 20, 2011 at 03:44:55PM -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 11:26 AM, Paul E. McKenney
> <paulmck@...ux.vnet.ibm.com> wrote:
> > Given some common flag combinations, particularly -Os, gcc will inline
> > rcu_read_unlock_special() despite its being in an unlikely() clause.
> > Use noinline to prohibit this misoptimization.
> 
> Btw, I suspect that we should at least look at what it would mean if
> we make the rcu_read_lock_nesting and the preempt counters both be
> per-cpu variables instead of making them per-thread/process counters.
> 
> Then, when we switch threads, we'd just save/restore them from the
> process register save area.
> 
> There's a lot of critical code sequences (spin-lock/unlock, rcu
> read-lock/unlock) that currently fetches the thread/process pointer
> only to then offset it and increment the count. I get the strong
> feeling that code generation could be improved and we could avoid one
> level of indirection by just making it a per-thread counter.
> 
> For example, instead of __rcu_read_lock: looking like this (and being
> an external function, partly because of header file dependencies on
> the data structures involved):
> 
>   push   %rbp
>   mov    %rsp,%rbp
>   mov    %gs:0xb580,%rax
>   incl   0x100(%rax)
>   leaveq
>   retq
> 
> it should inline to just something like
> 
>   incl %gs:0x100
> 
> instead. Same for the preempt counter.
> 
> Of course, it would need to involve making sure that we pick a good
> cacheline etc that is already always dirty. But other than that, is
> there any real downside?
We would need a form of per-CPU variable access that generated
efficient code, but that didn't complain about being used when
preemption was enabled.  __this_cpu_add_4() might do the trick,
but I haven't dug fully through it yet.
						Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists
 
