lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrXA9XMshR1iMVmeOseYC62Ch3db6jc+GrcqfA1MB6ywyw@mail.gmail.com>
Date:	Fri, 8 May 2015 05:56:52 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	fweisbec@...hat.com, Paolo Bonzini <pbonzini@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Rik van Riel <riel@...hat.com>, williams@...hat.com
Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable &
 enable from context tracking on syscall entry

On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar <mingo@...nel.org> wrote:
>
> * Andy Lutomirski <luto@...capital.net> wrote:
>
>> On May 8, 2015 12:07 PM, "Ingo Molnar" <mingo@...nel.org> wrote:
>> >
>> >
>> > * Andy Lutomirski <luto@...capital.net> wrote:
>> >
>> > > > So do you mean:
>> > > >
>> > > >    this_cpu_set(rcu_state) = IN_KERNEL;
>> > > >    ...
>> > > >    this_cpu_inc(rcu_qs_ctr);
>> > > >    this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > > > ?
>> > > >
>> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
>> > > > it just two simple stores into a byte flag, one on entry and one on
>> > > > exit:
>> > > >
>> > > >    this_cpu_set(rcu_state) = IN_KERNEL;
>> > > >    ...
>> > > >    this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > >
>> > > I was thinking that either a counter or a state flag could make sense.
>> > > Doing both would be pointless.  The counter could use the low bits to
>> > > indicate the state.  The benefit of the counter would be that the
>> > > RCU-waiting CPU could observe that the counter has incremented and
>> > > that therefore a grace period has elapsed.  Getting it right would
>> > > require lots of care.
>> >
>> > So if you mean:
>> >
>> >        <syscall entry>
>> >        ...
>> >        this_cpu_inc(rcu_qs_ctr);
>> >        <syscall exit>
>> >
>> > I don't see how this would work reliably: how do you handle the case
>> > of a SCHED_FIFO task never returning from user-space (common technique
>> > in RT apps)? synchronize_rcu() would block indefinitely as it would
>> > never see rcu_qs_ctr increase.
>> >
>> > We have to be able to observe user-mode anyway, for system-time
>> > statistics purposes, and that flag could IMHO also drive the RCU GC
>> > machinery.
>>
>> The counter would have to be fancier than that to work.  We could
>> say that an even value means user mode, for example.  IOW the high
>> bits of the counter would count transitions to quiescent and the low
>> bits would indicate the state.
>>
>> Actually, I'd count backwards.  We could use an andl instruction to
>> go to user mode and a decl to go to kernel mode.
>
> at which point it's not really a monotonic quiescent state counter -
> which your naming suggested before.
>
> Also it would have to be done both at entry and at exit.
>
> But yes, if you replace a state flag with a recursive state flag that
> is INC/DEC maintained that would work as well. That's not a counter
> though.

I won't quibble about the name :)

>
>> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
>> > > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
>> > > because stores can be delayed.  (It could even happen after
>> > > sysret, IIRC, but IRET is serializing.)
>> >
>> > All it has to do in the synchronize_rcu() slowpath is something
>> > like:
>>
>> I don't think this works.  Suppose rt_cpu starts in kernel mode.
>> Then it checks ti flags.
>>
>> >
>> >         if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> >                 smp_mb__before_atomic();
>> >                 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> >                 smp_rmb();
>> >                 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>>
>> Now it sets rcu_state and exits to user mode.  It never notices
>> TIF_RCU_NOTIFY.
>
> Indeed, you are right, that's racy.
>
>> [...]  We could fix this by sending an IPI to kick it.
>
> With an IPI we won't need any flags or counters on the RT CPU - it's
> the IPI we are trying to avoid.
>
> So how about the following way to fix the race: simply do a poll loop
> with a fixed timeout sleep: like it would do anyway if
> synchronize_rcu() was waiting for the timer irq to end the grace
> period on the RT-CPU.

Seems reasonable to me.

>
> The TIF flag would cause an RCU grace period to lapse, no need to wake
> up the synchronize_rcu() side: it will notice the (regular) increased
> RCU counter.
>
>> > We are talking about a dozen cycles, while a typical
>> > synchronize_rcu() will wait millions (sometimes billions) of
>> > cycles. There's absolutely zero performance concern here and it's
>> > all CPU local in any case.
>>
>> The barrier would affect all syscalls, though. [...]
>
> What barrier? I never suggested any barrier in the syscall fast path,
> this bit:

Oh, I misunderstood.

>
>> >         if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> >                 smp_mb__before_atomic();
>> >                 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> >                 smp_rmb();
>> >                 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>
> runs inside synchronize_rcu() or so.
>
> But none of that is needed if we do:
>
>         - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context
>           entry/exit time, straight in the assembly
>
>         - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does
>           nothing but: this_cpu_inc(rcu_qs_ctr).
>
>         - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and
>           then does a timeout-poll-loop with jiffies granular
>           timeouts, simulating a timer IRQ in essence. It will either
>           observe IN_USER or will see the regular RCU qs counter
>           increase.
>
> this should be race free.
>
> Alternatively we can even get rid of the TIF flag by merging the
> percpu counter with the percpu state. Quiescent states are recorded
> via a counter shifted by two bits:
>
>         rcu_qs_ctr += 4;
>
> while user/kernel/idle mode is recorded in the lower 2 bits.
>
> So on entering the kernel we'd do:
>
>         rcu_qs_ctr += 4+1; /* Register QS and set bit 0 to IN_KERNEL */
>
> on returning to user-space we'd do:
>
>         rcu_qs_ctr += 4-1; /* We have bit 0 set already, clear it and register a QS */
>
> on going idle we'd do:
>
>         rcu_qs_ctr += 4+2; /* Register QS, set bit 1 */
>
> on return from idle we'd do:
>
>         rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */
>
> etc. On all boundary transitions we can use a constant ADD on a
> suitable percpu variable.

Sounds good to me, except that we need to be careful to distinguish
between non-syscall entries from quiescent states and non-syscall
entries from quiescent states.  We could save the old state (as the
current exception_enter code does) or we could allocate enough low
bits for the state that the problem goes away.

I don't think the TIF_RCU_QS variant is worthwhile -- merging the
counter and state is probably both easier and faster.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ