linux-kernel - Re: [RFC] sched, x86: Prevent resched interrupts if task in kernel mode and !CONFIG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrVEsNj8dvqd-mNqb5tKNQOwQEgtMRUeTtJSS8-EmntAiA@mail.gmail.com>
Date:	Fri, 23 Jan 2015 08:24:40 -0800
From:	Andy Lutomirski <luto@...capital.net>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Kirill Tkhai <ktkhai@...allels.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [RFC] sched, x86: Prevent resched interrupts if task in kernel
 mode and !CONFIG_PREEMPT

On Fri, Jan 23, 2015 at 8:07 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Fri, Jan 23, 2015 at 06:53:32PM +0300, Kirill Tkhai wrote:
>> It's useless to send reschedule interrupts in such situations. The earliest
>> point, where schedule() call is possible, is sysret_careful(). But in that
>> function we directly test TIF_NEED_RESCHED.
>>
>> So it's possible to get rid of that type of interrupts.
>>
>> How about this idea? Is set_bit() cheap on x86 machines?
>
> So you set TIF_POLLING_NRFLAG on syscall entry and clear it again on
> exit? Thereby we avoid the IPI, because the exit path already checks for
> TIF_NEED_RESCHED.

The idle code says:

        /*
         * If the arch has a polling bit, we maintain an invariant:
         *
         * Our polling bit is clear if we're not scheduled (i.e. if
         * rq->curr != rq->idle).  This means that, if rq->idle has
         * the polling bit set, then setting need_resched is
         * guaranteed to cause the cpu to reschedule.
         */

Setting polling on non-idle tasks like this will either involve
weakening this a bit (it'll still be true for rq->idle) or changing
the polling state on context switch.

>
> Should work I suppose, but I'm not too familiar with all that entry.S
> muck. Andy might know and appreciate this.
>
>> ---
>>  arch/x86/kernel/entry_64.S |   10 ++++++++++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index c653dc4..a046ba8 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -409,6 +409,13 @@ GLOBAL(system_call_after_swapgs)
>>       movq_cfi rax,(ORIG_RAX-ARGOFFSET)
>>       movq  %rcx,RIP-ARGOFFSET(%rsp)
>>       CFI_REL_OFFSET rip,RIP-ARGOFFSET
>> +#if !defined(CONFIG_PREEMPT) || !defined(SMP)
>> +     /*
>> +      * Tell resched_curr() do not send useless interrupts to us.
>> +      * Kernel isn't preemptible till sysret_careful() anyway.
>> +      */
>> +     LOCK ; bts $TIF_POLLING_NRFLAG,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>> +#endif

That's kind of expensive.  What's the !SMP part for?

>>       testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>>       jnz tracesys
>>  system_call_fastpath:
>> @@ -427,6 +434,9 @@ GLOBAL(system_call_after_swapgs)
>>   * Has incomplete stack frame and undefined top of stack.
>>   */
>>  ret_from_sys_call:
>> +#if !defined(CONFIG_PREEMPT) || !defined(SMP)
>> +     LOCK ; btr $TIF_POLLING_NRFLAG,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>> +#endif

If only it were this simple.  There are lots of ways out of syscalls,
and this is only one of them :(  If we did this, I'd rather do it
through the do_notify_resume mechanism or something.

I don't see any way to do this without at least one atomic op or
smp_mb per syscall, and that's kind of expensive.

Would it make sense to try to use context tracking instead?  On
systems that use context tracking, syscalls are already expensive, and
we're already keeping track of which CPUs are in user mode.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/