linux-kernel - Re: [RFC] sched, x86: Prevent resched interrupts if task in kernel mode and !CONFIG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1422032989.6345.26.camel@tkhai>
Date:	Fri, 23 Jan 2015 20:09:49 +0300
From:	Kirill Tkhai <ktkhai@...allels.com>
To:	Andy Lutomirski <luto@...capital.net>
CC:	Peter Zijlstra <peterz@...radead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	"Ingo Molnar" <mingo@...hat.com>, "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [RFC] sched, x86: Prevent resched interrupts if task in kernel
 mode and !CONFIG_PREEMPT

В Пт, 23/01/2015 в 08:24 -0800, Andy Lutomirski пишет:
> On Fri, Jan 23, 2015 at 8:07 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> > On Fri, Jan 23, 2015 at 06:53:32PM +0300, Kirill Tkhai wrote:
> >> It's useless to send reschedule interrupts in such situations. The earliest
> >> point, where schedule() call is possible, is sysret_careful(). But in that
> >> function we directly test TIF_NEED_RESCHED.
> >>
> >> So it's possible to get rid of that type of interrupts.
> >>
> >> How about this idea? Is set_bit() cheap on x86 machines?
> >
> > So you set TIF_POLLING_NRFLAG on syscall entry and clear it again on
> > exit? Thereby we avoid the IPI, because the exit path already checks for
> > TIF_NEED_RESCHED.
> 
> The idle code says:
> 
>         /*
>          * If the arch has a polling bit, we maintain an invariant:
>          *
>          * Our polling bit is clear if we're not scheduled (i.e. if
>          * rq->curr != rq->idle).  This means that, if rq->idle has
>          * the polling bit set, then setting need_resched is
>          * guaranteed to cause the cpu to reschedule.
>          */
> 
> Setting polling on non-idle tasks like this will either involve
> weakening this a bit (it'll still be true for rq->idle) or changing
> the polling state on context switch.
> 
> >
> > Should work I suppose, but I'm not too familiar with all that entry.S
> > muck. Andy might know and appreciate this.
> >
> >> ---
> >>  arch/x86/kernel/entry_64.S |   10 ++++++++++
> >>  1 file changed, 10 insertions(+)
> >>
> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> >> index c653dc4..a046ba8 100644
> >> --- a/arch/x86/kernel/entry_64.S
> >> +++ b/arch/x86/kernel/entry_64.S
> >> @@ -409,6 +409,13 @@ GLOBAL(system_call_after_swapgs)
> >>       movq_cfi rax,(ORIG_RAX-ARGOFFSET)
> >>       movq  %rcx,RIP-ARGOFFSET(%rsp)
> >>       CFI_REL_OFFSET rip,RIP-ARGOFFSET
> >> +#if !defined(CONFIG_PREEMPT) || !defined(SMP)
> >> +     /*
> >> +      * Tell resched_curr() do not send useless interrupts to us.
> >> +      * Kernel isn't preemptible till sysret_careful() anyway.
> >> +      */
> >> +     LOCK ; bts $TIF_POLLING_NRFLAG,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> >> +#endif
> 
> That's kind of expensive.  What's the !SMP part for?

smp_send_reschedule() is NOP on UP. There is no problem.

> 
> >>       testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> >>       jnz tracesys
> >>  system_call_fastpath:
> >> @@ -427,6 +434,9 @@ GLOBAL(system_call_after_swapgs)
> >>   * Has incomplete stack frame and undefined top of stack.
> >>   */
> >>  ret_from_sys_call:
> >> +#if !defined(CONFIG_PREEMPT) || !defined(SMP)
> >> +     LOCK ; btr $TIF_POLLING_NRFLAG,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> >> +#endif
> 
> If only it were this simple.  There are lots of ways out of syscalls,
> and this is only one of them :(  If we did this, I'd rather do it
> through the do_notify_resume mechanism or something.

Yes, syscall is the only thing I did as an example.

> I don't see any way to do this without at least one atomic op or
> smp_mb per syscall, and that's kind of expensive.

JFI, doesn't x86 set_bit() lock a small area of memory? I thought
it's not very expensive on this arch (some bus optimizations or
something like this).

> Would it make sense to try to use context tracking instead?  On
> systems that use context tracking, syscalls are already expensive, and
> we're already keeping track of which CPUs are in user mode.

I'll look at context_tracking, but I'm not sure some smp synchronization
there.

Thanks,
Kirill

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/