linux-kernel - Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 8 May 2015 13:27:11 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Andy Lutomirski <luto@...capital.net>
Cc:	fweisbec@...hat.com, Paolo Bonzini <pbonzini@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Rik van Riel <riel@...hat.com>, williams@...hat.com
Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable
 & enable from context tracking on syscall entry


* Andy Lutomirski <luto@...capital.net> wrote:

> On May 8, 2015 12:07 PM, "Ingo Molnar" <mingo@...nel.org> wrote:
> >
> >
> > * Andy Lutomirski <luto@...capital.net> wrote:
> >
> > > > So do you mean:
> > > >
> > > >    this_cpu_set(rcu_state) = IN_KERNEL;
> > > >    ...
> > > >    this_cpu_inc(rcu_qs_ctr);
> > > >    this_cpu_set(rcu_state) = IN_USER;
> > > >
> > > > ?
> > > >
> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
> > > > it just two simple stores into a byte flag, one on entry and one on
> > > > exit:
> > > >
> > > >    this_cpu_set(rcu_state) = IN_KERNEL;
> > > >    ...
> > > >    this_cpu_set(rcu_state) = IN_USER;
> > > >
> > >
> > > I was thinking that either a counter or a state flag could make sense.
> > > Doing both would be pointless.  The counter could use the low bits to
> > > indicate the state.  The benefit of the counter would be that the
> > > RCU-waiting CPU could observe that the counter has incremented and
> > > that therefore a grace period has elapsed.  Getting it right would
> > > require lots of care.
> >
> > So if you mean:
> >
> >        <syscall entry>
> >        ...
> >        this_cpu_inc(rcu_qs_ctr);
> >        <syscall exit>
> >
> > I don't see how this would work reliably: how do you handle the case
> > of a SCHED_FIFO task never returning from user-space (common technique
> > in RT apps)? synchronize_rcu() would block indefinitely as it would
> > never see rcu_qs_ctr increase.
> >
> > We have to be able to observe user-mode anyway, for system-time
> > statistics purposes, and that flag could IMHO also drive the RCU GC
> > machinery.
> 
> The counter would have to be fancier than that to work.  We could 
> say that an even value means user mode, for example.  IOW the high 
> bits of the counter would count transitions to quiescent and the low 
> bits would indicate the state.
> 
> Actually, I'd count backwards.  We could use an andl instruction to 
> go to user mode and a decl to go to kernel mode.

at which point it's not really a monotonic quiescent state counter - 
which your naming suggested before.

Also it would have to be done both at entry and at exit.

But yes, if you replace a state flag with a recursive state flag that 
is INC/DEC maintained that would work as well. That's not a counter 
though.

> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
> > > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
> > > because stores can be delayed.  (It could even happen after 
> > > sysret, IIRC, but IRET is serializing.)
> >
> > All it has to do in the synchronize_rcu() slowpath is something 
> > like:
> 
> I don't think this works.  Suppose rt_cpu starts in kernel mode.  
> Then it checks ti flags.
>
> >
> >         if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> >                 smp_mb__before_atomic();
> >                 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> >                 smp_rmb();
> >                 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
> 
> Now it sets rcu_state and exits to user mode.  It never notices
> TIF_RCU_NOTIFY.

Indeed, you are right, that's racy.

> [...]  We could fix this by sending an IPI to kick it.

With an IPI we won't need any flags or counters on the RT CPU - it's 
the IPI we are trying to avoid.

So how about the following way to fix the race: simply do a poll loop 
with a fixed timeout sleep: like it would do anyway if 
synchronize_rcu() was waiting for the timer irq to end the grace 
period on the RT-CPU.

The TIF flag would cause an RCU grace period to lapse, no need to wake 
up the synchronize_rcu() side: it will notice the (regular) increased 
RCU counter.

> > We are talking about a dozen cycles, while a typical 
> > synchronize_rcu() will wait millions (sometimes billions) of 
> > cycles. There's absolutely zero performance concern here and it's 
> > all CPU local in any case.
> 
> The barrier would affect all syscalls, though. [...]

What barrier? I never suggested any barrier in the syscall fast path, 
this bit:

> >         if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> >                 smp_mb__before_atomic();
> >                 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> >                 smp_rmb();
> >                 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

runs inside synchronize_rcu() or so.

But none of that is needed if we do:

	- simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context 
	  entry/exit time, straight in the assembly

	- TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does 
	  nothing but: this_cpu_inc(rcu_qs_ctr).

	- synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and 
	  then does a timeout-poll-loop with jiffies granular 
	  timeouts, simulating a timer IRQ in essence. It will either 
	  observe IN_USER or will see the regular RCU qs counter 
	  increase.

this should be race free.

Alternatively we can even get rid of the TIF flag by merging the 
percpu counter with the percpu state. Quiescent states are recorded 
via a counter shifted by two bits:

	rcu_qs_ctr += 4;

while user/kernel/idle mode is recorded in the lower 2 bits.

So on entering the kernel we'd do:

	rcu_qs_ctr += 4+1; /* Register QS and set bit 0 to IN_KERNEL */

on returning to user-space we'd do:

	rcu_qs_ctr += 4-1; /* We have bit 0 set already, clear it and register a QS */

on going idle we'd do:

	rcu_qs_ctr += 4+2; /* Register QS, set bit 1 */

on return from idle we'd do:

	rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */

etc. On all boundary transitions we can use a constant ADD on a 
suitable percpu variable.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/