linux-kernel - Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <21988538-1533-4C4B-A8B0-B298BC4517BB@amacapital.net>
Date:   Wed, 13 May 2020 18:22:05 -0700
From:   Andy Lutomirski <luto@...capital.net>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        x86 <x86@...nel.org>, paulmck <paulmck@...nel.org>,
        Andy Lutomirski <luto@...nel.org>,
        Alexandre Chartre <alexandre.chartre@...cle.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Petr Mladek <pmladek@...e.com>,
        "Joel Fernandes, Google" <joel@...lfernandes.org>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        Juergen Gross <JGross@...e.com>,
        Brian Gerst <brgerst@...il.com>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [patch V4 part 1 05/36] x86/entry: Flip _TIF_SIGPENDING and _TIF_NOTIFY_RESUME handling



> On May 13, 2020, at 5:12 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
> 
> Steven, Mathieu
> 
> (combo reply)
> 
> Steven Rostedt <rostedt@...dmis.org> writes:
>> On Wed, 13 May 2020 16:56:41 -0400 (EDT)
>>>> +        /* deal with pending signal delivery */
>>>> +        if (cached_flags & _TIF_SIGPENDING)
>>>> +            do_signal(regs);
>> 
>> Looking deeper into this, it appears that do_signal() can freeze or kill the
>> task.
>> 
>> That is, it wont go back to user space here, but simply schedule out (being
>> traced) or even exit (killed).
>> 
>> Before the resume hooks would never be called in such cases, and now they
>> are.
> 
> It theoretically matters because pending task work might kill the
> task. That's the concern Andy and Peter had. Assume the following:
> 
> usermode
> 
> -> exception
>    set not fatal signal
> 
>    -> exception
>        queue task work to kill task
>    <- return
> 
>  <- return
> 
> The same could happen when the non fatal signal is set from a remote CPU.
> 
> So in theory that would result in:
> 
>   handle non fatal signal first
> 
>   handle task work which kills task
> 
> which would be the wrong order.
> 
> But that's just illusion.
> 
>>> Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
> 
>>> Also, color me confused: is "do_signal()" actually running any user-space,
>>> or just setting up the user-space stack for eventual return to signal
>>> handler ?
> 
> I'm surprised that you can't answer that question yourself. How did you
> ever make rseq work and how did rseq_signal_deliver() end up in
> setup_rt_frame()?
> 
> Hint: Tracing might answer that question
> 
> And to cut it short:
> 
>    Exit to user space happnes only through ONE channel, i.e. leaving
>    prepare_exit_to usermode().
> 
>      exit_to_usermode_loop <-prepare_exit_to_usermode
>      do_signal <-exit_to_usermode_loop
>      get_signal <-do_signal
>      setup_sigcontext <-do_signal
>      do_syscall_64 <-entry_SYSCALL_64_after_hwframe
>      syscall_trace_enter <-do_syscall_64
> 
>      sys_rt_sigreturn()
>      restore_altstack <-__ia32_sys_rt_sigreturn
>      syscall_slow_exit_work <-do_syscall_64
>      exit_to_usermode_loop <-do_syscall_64
> 
>>> Also, it might be OK, but we're changing the order of two things which
>>> have effects on each other: restartable sequences abort fixup for preemption
>>> and do_signal(), which also have effects on rseq abort.
>>> 
>>> Because those two will cause the abort to trigger, I suspect changing
>>> the order might be OK, but we really need to think this through.
> 
> That's a purely academic problem. The order is completely
> irrelevant. You have to handle any order anyway:
> 
> usermode
> 
>  -> exception / syscall
>       sets signal
> 
>   <- return
> 
>  prepare_exit_to_usemode()
>      cached_flags = READ_ONCE(t->flags);
>      exit_to_user_mode_loop(regs, cached_flags) {
>        while (cached_flags) {
>           local_irq_enable();
> 
>           handle(cached_flags & RESCHED);
>           handle(cached_flags & UPROBE);
>           handle(cached_flags & PATCHING);
>           handle(cached_flags & SIGNAL);
>           handle(cached_flags & NOTIFY_RESUME);
>           handle(cached_flags & RETURN_NOTIFY);
> 
>           local_irq_disable();
> 
>           cached_flags = READ_ONCE(t->flags);
>         }
> 
> cached_flag is a momentary snapshot when attempting to return to user
> space.
> 
> But after reenabling interrupts any of the relevant flag bits can be set
> by an exception/interrupt or from remote. If preemption is enabled the
> task can be scheduled out, migrated at any point before disabling
> interrupts again. Even after disabling interrupts and before re-reading
> cached flags there might be a remote change of flags.
> 
> That said, even for the case Andy and Peter were looking at (MCE) the
> ordering is completely irrelevant.
> 
> I should have thought about this before, so thanks to both of you for
> making me look at it again for the very wrong reasons.
> 
> Consider the patch dropped.

I disagree.

There is only one relevant MCE case: #MC hits user code and tries to recover.

Right now, this works via the ist_begin_non_atomic hack. But the series changes it so that it uses task_work(). So now the kernel ends up in prepare_exit_to_usermode() and the machine check work is pending, but a signal might *also* be pending due to another CPU setting the bit.

If we process the signal first, we could block on userfaultfd or whatever, and now we could take forever to finish.

Heck, even with the order changed, we could get preempted, return to another user task, and get a new machine check. Ick.

So I agree that this patch is problematic, and it doesn’t fully fix the problem, but I do believe that the task_work thing is problematic.

Rumor has it that it gets improved a bit farther along in the series, but I’m still plodding through.



> 
> Thanks,
> 
>        tglx