linux-kernel - Re: [PATCH] riscv/entry: get correct syscall number from syscall_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87a5exy2rx.fsf@all.your.base.are.belong.to.us>
Date: Mon, 21 Oct 2024 07:00:18 -0700
From: Björn Töpel <bjorn@...nel.org>
To: Celeste Liu via B4 Relay <devnull+CoelacanthusHex.gmail.com@...nel.org>,
 Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt
 <palmer@...belt.com>, Albert Ou <aou@...s.berkeley.edu>, Björn Töpel
 <bjorn@...osinc.com>, Celeste Liu <coelacanthushex@...il.com>
Cc: Palmer Dabbelt <palmer@...osinc.com>, Alexandre Ghiti <alex@...ti.fr>,
 "Dmitry V. Levin" <ldv@...ace.io>, Andrea Bolognani <abologna@...hat.com>,
 Felix Yan <felixonmars@...hlinux.org>, Ruizhe Pan <c141028@...il.com>,
 Shiqi Zhang <shiqi@...c.iscas.ac.cn>, Guo Ren <guoren@...nel.org>, Yao Zi
 <ziyao@...root.org>, Han Gao <gaohan@...as.ac.cn>,
 linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org, Celeste Liu <CoelacanthusHex@...il.com>, Thomas
 Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH] riscv/entry: get correct syscall number from
 syscall_get_nr()

Celeste!

I'll pick up your thread [1] here, since there's code here! :-) Let's
try to clear up/define the possible flow.

The common/generic entry syscall_enter_from_user_mode{,_work}() says
that a return value of -1 means that the syscall should be skipped.

The RISC-V calling convention uses the same register for arg0/retval
(a0). So, when a ptracer (PTRACE_SYSCALL+PTRACE_GETREGS). That means
that the kernel cannot call into the tracer, *after* changing a0.

The interesting flow for a syscall is roughly:
1. The exception/trap function
2. syscall_enter_from_user_mode() which might return -1, meaning that
   the syscall should be skipped. A tracer might have altered the
   regs. More importantly, if it's -1 the kernel cannot change the
   return value, because seccomp filtering might already done that.
3. If it's a valid syscall, perform the syscall.

Now some scenarios:
* A user does a valid syscall.
* A user does an invalid syscall(-1)
* A user does an invalid syscall(!= -1)

Then there're the tracing variants of those, and that's where we go get
trouble.

Now the bugs we've seen in RISC-V:

1. strace "tracing": Requires that regs->a0 is not tampered with prior
   ptrace notification

   E.g.:
   | # ./strace /
   | execve("/", ["/"], 0x7ffffaac3890 /* 21 vars */) = -1 EACCES (Permission denied)
   | ./strace: exec: Permission denied
   | +++ exited with 1 +++
   | # ./disable_ptrace_get_syscall_info ./strace /
   | execve(0xffffffffffffffda, ["/"], 0x7fffd893ce10 /* 21 vars */) = -1 EACCES (Permission denied)
   | ./strace: exec: Permission denied
   | +++ exited with 1 +++

   In the second case, arg0 is prematurely set to -ENOSYS
   (0xffffffffffffffda).

2. strace "syscall tampering": Requires that ENOSYS is returned for
   syscall(-1), and not skipped w/o a proper return value.
   
   E.g.:
   | ./strace -a0 -ewrite -einject=write:error=enospc echo helloject=write:error=enospc echo hello   
   
   Here, strace expects that injecting -1, would result in a ENOSYS.

3. seccomp filtering: Requires that the a0 is not tampered to

First 3 was broken (tampering with a0 after seccomp), then 2 was
broken (not setting ENOSYS for -1), and finally 1 was broken (and
still is!).

Now for your patch:

Celeste Liu via B4 Relay <devnull+CoelacanthusHex.gmail.com@...nel.org>
writes:

> From: Celeste Liu <CoelacanthusHex@...il.com>
>
> The return value of syscall_enter_from_user_mode() is always -1 when the
> syscall was filtered. We can't know whether syscall_nr is -1 when we get -1
> from syscall_enter_from_user_mode(). And the old syscall variable is
> unusable because syscall_enter_from_user_mode() may change a7 register.
> So get correct syscall number from syscall_get_nr().
>
> So syscall number part of return value of syscall_enter_from_user_mode()
> is completely useless. We can remove it from API and require caller to
> get syscall number from syscall_get_nr(). But this change affect more
> architectures and will block more time. So we split it into another
> patchset to avoid block this fix. (Other architectures can works
> without this change but riscv need it, see Link: tag below)
>
> Fixes: 61119394631f ("riscv: entry: always initialize regs->a0 to -ENOSYS")
> Reported-by: Andrea Bolognani <abologna@...hat.com>
> Closes: https://github.com/strace/strace/issues/315
> Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
> Signed-off-by: Celeste Liu <CoelacanthusHex@...il.com>
> ---
>  arch/riscv/kernel/traps.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index 51ebfd23e0076447518081d137102a9a11ff2e45..3125fab8ee4af468ace9f692dd34e1797555cce3 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -316,18 +316,25 @@ void do_trap_ecall_u(struct pt_regs *regs)
>  {
>  	if (user_mode(regs)) {
>  		long syscall = regs->a7;
> +		long res;
>  
>  		regs->epc += 4;
>  		regs->orig_a0 = regs->a0;
> -		regs->a0 = -ENOSYS;
>  
>  		riscv_v_vstate_discard(regs);
>  
> -		syscall = syscall_enter_from_user_mode(regs, syscall);
> +		res = syscall_enter_from_user_mode(regs, syscall);
> +		/*
> +		 * Call syscall_get_nr() again because syscall_enter_from_user_mode()
> +		 * may change a7 register.
> +		 */
> +		syscall = syscall_get_nr(current, regs);
>  
>  		add_random_kstack_offset();
>  
> -		if (syscall >= 0 && syscall < NR_syscalls)
> +		if (syscall < 0 || syscall >= NR_syscalls)
> +			regs->a0 = -ENOSYS;
> +		else if (res != -1)
>  			syscall_handler(regs, syscall);

Here we can perform the syscall, even if res is -1. E.g., if this path
[2] is taken, we might have a valid syscall number in a7, but the
syscall should not be performed.

Also, one reason for the generic entry is so that it should be less
work. Here, you pull (IMO) details that belong to the common entry
implementation/API up the common entry user. Wdyt about pushing it down
to common entry? Something like:

--8<--
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 51ebfd23e007..66fded8e4b60 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -319,7 +319,6 @@ void do_trap_ecall_u(struct pt_regs *regs)
 
 		regs->epc += 4;
 		regs->orig_a0 = regs->a0;
-		regs->a0 = -ENOSYS;
 
 		riscv_v_vstate_discard(regs);
 
@@ -329,6 +328,8 @@ void do_trap_ecall_u(struct pt_regs *regs)
 
 		if (syscall >= 0 && syscall < NR_syscalls)
 			syscall_handler(regs, syscall);
+		else if (syscall != -1)
+			regs->a0 = -ENOSYS;
 
 		/*
 		 * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1e50cdb83ae5..9b69c2ad4f12 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -14,6 +14,7 @@
 #include <linux/kmsan.h>
 
 #include <asm/entry-common.h>
+#include <asm/syscall.h>
 
 /*
  * Define dummy _TIF work flags if not defined by the architecture or for
@@ -166,6 +167,8 @@ static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *re
 
 	if (work & SYSCALL_WORK_ENTER)
 		syscall = syscall_trace_enter(regs, syscall, work);
+	else if (syscall == -1L)
+		syscall_set_return_value(current, regs, -ENOSYS, 0);
 
 	return syscall;
 }
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 5b6934e23c21..99742aee5002 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -43,8 +43,10 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
 	/* Handle ptrace */
 	if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
 		ret = ptrace_report_syscall_entry(regs);
-		if (ret || (work & SYSCALL_WORK_SYSCALL_EMU))
+		if (ret || (work & SYSCALL_WORK_SYSCALL_EMU)) {
+			syscall_set_return_value(current, regs, -ENOSYS, 0);
 			return -1L;
+		}
 	}
 
 	/* Do seccomp after ptrace, to catch any tracer changes. */
@@ -66,6 +68,14 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
 		syscall = syscall_get_nr(current, regs);
 	}
 
+	/*
+	 * If we're not setting the return value here, the context is
+	 * gone; For higher callers, -1 means that the syscall should
+	 * be skipped.
+	 */
+	if (syscall == -1L)
+		syscall_set_return_value(current, regs, -ENOSYS, 0);
+
 	syscall_enter_audit(regs, syscall);
 
 	return ret ? : syscall;
--8<--

I did a quick test of the above, and it seems to cover all the previous
bugs -- but who knows! ;-)


Björn

[1] https://lore.kernel.org/linux-riscv/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/entry/common.c#n47