linux-kernel - Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1872633041.20290.1629485463253.JavaMail.zimbra@efficios.com>
Date:   Fri, 20 Aug 2021 14:51:03 -0400 (EDT)
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     "Russell King, ARM Linux" <linux@...linux.org.uk>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>, Guo Ren <guoren@...nel.org>,
        Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
        Michael Ellerman <mpe@...erman.id.au>,
        Heiko Carstens <hca@...ux.ibm.com>, gor <gor@...ux.ibm.com>,
        Christian Borntraeger <borntraeger@...ibm.com>,
        Oleg Nesterov <oleg@...hat.com>, rostedt <rostedt@...dmis.org>,
        Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Andy Lutomirski <luto@...nel.org>,
        paulmck <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>,
        Paolo Bonzini <pbonzini@...hat.com>, shuah <shuah@...nel.org>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Paul Mackerras <paulus@...ba.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        linux-csky <linux-csky@...r.kernel.org>,
        linux-mips <linux-mips@...r.kernel.org>,
        linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
        linux-s390 <linux-s390@...r.kernel.org>,
        KVM list <kvm@...r.kernel.org>,
        linux-kselftest <linux-kselftest@...r.kernel.org>,
        Peter Foley <pefoley@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Ben Gardon <bgardon@...gle.com>
Subject: Re: [PATCH 1/5] KVM: rseq: Update rseq when processing
 NOTIFY_RESUME on xfer to KVM guest

----- On Aug 19, 2021, at 7:48 PM, Sean Christopherson seanjc@...gle.com wrote:

> On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
>> ----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@...gle.com wrote:
>> > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
>> > 	 * If not nested over a rseq critical section, restart is useless.
>> > 	 * Clear the rseq_cs pointer and return.
>> > 	 */
>> > -	if (!in_rseq_cs(ip, &rseq_cs))
>> > +	if (!regs || !in_rseq_cs(ip, &rseq_cs))
>> 
>> I think clearing the thread's rseq_cs unconditionally here when regs is NULL
>> is not the behavior we want when this is called from xfer_to_guest_mode_work.
>> 
>> If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
>> from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
>> kill this application in the rseq_syscall handler when exiting back to usermode
>> when the ioctl eventually returns.
>> 
>> However, clearing the thread's rseq_cs will prevent this from happening.
>> 
>> So I would favor an approach where we simply do:
>> 
>> if (!regs)
>>      return 0;
>> 
>> Immediately at the beginning of rseq_ip_fixup, before getting the instruction
>> pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it
>> is not relevant to do any fixup here, because it is nested in a ioctl system
>> call.
>> 
>> Effectively, this would preserve the SIGSEGV behavior when this ioctl is
>> erroneously called by user-space from a rseq critical section.
> 
> Ha, that's effectively what I implemented first, but I changed it because of the
> comment in clear_rseq_cs() that says:
> 
>  The rseq_cs field is set to NULL on preemption or signal delivery ... as well
>  as well as on top of code outside of the rseq assembly block.
> 
> Which makes it sound like something might rely on clearing rseq_cs?

This comment is describing succinctly the lazy clear scheme for rseq_cs.

Without the lazy clear scheme, a rseq c.s. would look like:

 *                     init(rseq_cs)
 *                     cpu = TLS->rseq::cpu_id_start
 *   [1]               TLS->rseq::rseq_cs = rseq_cs
 *   [start_ip]        ----------------------------
 *   [2]               if (cpu != TLS->rseq::cpu_id)
 *                             goto abort_ip;
 *   [3]               <last_instruction_in_cs>
 *   [post_commit_ip]  ----------------------------
 *   [4]               TLS->rseq::rseq_cs = NULL

But as a fast-path optimization, [4] is not entirely needed because the rseq_cs
descriptor contains information about the instruction pointer range of the critical
section. Therefore, userspace can omit [4], but if the kernel never clears it, it
means that it will have to re-read the rseq_cs descriptor's content each time it
needs to check it to confirm that it is not nested over a rseq c.s..

So making the kernel lazily clear the rseq_cs pointer is just an optimization which
ensures that the kernel won't do useless work the next time it needs to check
rseq_cs, given that it has already validated that the userspace code is currently
not within the rseq c.s. currently advertised by the rseq_cs field.

> 
> Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an
> rseq critical section, and because syscalls in critical sections are illegal, by
> definition clearing rseq_cs is a nop unless userspace is misbehaving.

Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ
code executed when returning from ioctl to userspace will be able to validate that
it is not nested within a rseq critical section.

> 
> If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is it
> not worth the extra code to detect an error that will likely be caught anyways?

The error will indeed already be caught on return from ioctl to userspace, so I
don't see any added value in duplicating this check.

Thanks,

Mathieu

> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..28b8342290b0 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> struct pt_regs *regs)
> 
>        if (unlikely(t->flags & PF_EXITING))
>                return;
> +       if (!regs) {
> +#ifdef CONFIG_DEBUG_RSEQ
> +               if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs))
> +                       goto error;
> +#endif
> +               return;
> +       }
>        ret = rseq_ip_fixup(regs);
>        if (unlikely(ret < 0))
>                goto error;
> 
>> Thanks for looking into this !
>> 
>> Mathieu
>> 
>> > 		return clear_rseq_cs(t);
>> > 	ret = rseq_need_restart(t, rseq_cs.flags);
>> > 	if (ret <= 0)
>> > --
>> > 2.33.0.rc1.237.g0d66db33f3-goog
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com