[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110721050927.GV2313@linux.vnet.ibm.com>
Date: Wed, 20 Jul 2011 22:09:27 -0700
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: linux-kernel@...r.kernel.org, mingo@...e.hu, laijs@...fujitsu.com,
dipankar@...ibm.com, akpm@...ux-foundation.org,
mathieu.desnoyers@...ymtl.ca, josh@...htriplett.org,
niv@...ibm.com, tglx@...utronix.de, peterz@...radead.org,
rostedt@...dmis.org, Valdis.Kletnieks@...edu, dhowells@...hat.com,
eric.dumazet@...il.com, darren@...art.com, patches@...aro.org,
greearb@...delatech.com, edt@....ca
Subject: Re: [PATCH tip/core/urgent 3/7] rcu: Streamline code produced by
__rcu_read_unlock()
On Wed, Jul 20, 2011 at 03:44:55PM -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2011 at 11:26 AM, Paul E. McKenney
> <paulmck@...ux.vnet.ibm.com> wrote:
> > Given some common flag combinations, particularly -Os, gcc will inline
> > rcu_read_unlock_special() despite its being in an unlikely() clause.
> > Use noinline to prohibit this misoptimization.
>
> Btw, I suspect that we should at least look at what it would mean if
> we make the rcu_read_lock_nesting and the preempt counters both be
> per-cpu variables instead of making them per-thread/process counters.
>
> Then, when we switch threads, we'd just save/restore them from the
> process register save area.
>
> There's a lot of critical code sequences (spin-lock/unlock, rcu
> read-lock/unlock) that currently fetches the thread/process pointer
> only to then offset it and increment the count. I get the strong
> feeling that code generation could be improved and we could avoid one
> level of indirection by just making it a per-thread counter.
>
> For example, instead of __rcu_read_lock: looking like this (and being
> an external function, partly because of header file dependencies on
> the data structures involved):
>
> push %rbp
> mov %rsp,%rbp
> mov %gs:0xb580,%rax
> incl 0x100(%rax)
> leaveq
> retq
>
> it should inline to just something like
>
> incl %gs:0x100
>
> instead. Same for the preempt counter.
>
> Of course, it would need to involve making sure that we pick a good
> cacheline etc that is already always dirty. But other than that, is
> there any real downside?
We would need a form of per-CPU variable access that generated
efficient code, but that didn't complain about being used when
preemption was enabled. __this_cpu_add_4() might do the trick,
but I haven't dug fully through it yet.
Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists