linux-kernel - Re: rseq with syscall as the last instruction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CACT4Y+aJJPhDKB51U3d=Jq5HLAupsP+giNKqi67jgpXoPg=4oQ@mail.gmail.com>
Date:   Fri, 1 Oct 2021 14:55:33 +0200
From:   Dmitry Vyukov <dvyukov@...gle.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: rseq with syscall as the last instruction

On Thu, 30 Sept 2021 at 16:01, Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Tue, Sep 28, 2021 at 11:09:24AM +0200, Dmitry Vyukov wrote:
> > Hi rseq maintainers,
> >
> > I wonder if rseq can be used in the following scenario (or extended to be used).
> > I want to pass extra arguments to syscalls using a kind of
> > side-channel, for example, to say "do fault injection for the next
> > system call", or "trace the next system call". But what is "next"
> > system call should be atomic with respect to signals.
> > Let's say there is shared per-task memory location known to the kernel
> > where these arguments can be stored:
> >
> > __thread struct trace_descriptor desk;
> > prctl(REGISTER_PER_TASK_TRACE_DESCRIPTOR, &desk);
> >
> > then before a system call I can setup the descriptor to enable tracing:
> >
> > desk = ...
> > SYSCALL;
> >
> > The problem is that if a signal arrives in between we setup desk and
> > SYSCALL instruction, we will actually trace some unrelated syscall in
> > the signal handler.
> > Potentially the kernel could switch/restore 'desk' around syscall
> > delivery, but it becomes tricky/impossible for signal handlers that do
> > longjmp or mess with PC in other ways; and also would require
> > extending ucontext to include the desc information (not sure if it's
> > feasible).
> >
> > So instead the idea is to protect this sequence with rseq that will be
> > restarted on signal delivery:
> >
> > enter rseq critical section with end right after SYSCALL instruction;
> > desk = ...
> > SYSCALL;
> >
> > Then, the kernel can simply clear 'desc', on syscall delivery.
> >
> > rseq docs seem to suggest that this can work:
> >
> > https://lwn.net/Articles/774098/
> > +Restartable sequences are atomic with respect to preemption (making it
> > +atomic with respect to other threads running on the same CPU), as well
> > +as signal delivery (user-space execution contexts nested over the same
> > +thread). They either complete atomically with respect to preemption on
> > +the current CPU and signal delivery, or they are aborted.
> >
> > But the doc also says that the sequence must not do syscalls:
> >
> > +Restartable sequences must not perform system calls. Doing so may result
> > +in termination of the process by a segmentation fault.
> >
> > The question is:
> > Can this restriction be weakened to allow syscalls as the last instruction?
> > For flags in this case we would pass
> > RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT and
> > RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE, but no
> > RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL.
> >
> > I don't see any fundamental reasons why this couldn't work b/c if we
> > restart only on signals, then once we reach the syscall, rseq critical
> > section is committed, right?
> >
> > Do you have any feeling of how hard it would be to support or if there
> > can be some implementation issues?
>
> IIRC the only enforcement of this constraint is rseq_syscall() (which is
> a NOP when !CONFIG_DEBUG_RSEQ, because performance).
>
> However, since we use regs->ip, which for SYSCALL points to right
> *after* the SYSCALL instruction (for obvious reasons), it will not in
> fact match in_rseq_cs().
>
> And as such, I think your scheme should just work as is. Did you try?

Well, no, I did not try (wasn't sure how to interpret results).
Thanks, we will consider this option as well then.