[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55CCB510.3060807@redhat.com>
Date: Thu, 13 Aug 2015 17:17:36 +0200
From: Denys Vlasenko <dvlasenk@...hat.com>
To: David Drysdale <drysdale@...gle.com>,
Kees Cook <keescook@...omium.org>,
Andy Lutomirski <luto@...capital.net>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Will Drewry <wad@...omium.org>, Ingo Molnar <mingo@...nel.org>
CC: Alok Kataria <akataria@...are.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Borislav Petkov <bp@...en8.de>,
Alexei Starovoitov <ast@...mgrid.com>,
Frederic Weisbecker <fweisbec@...il.com>,
"H. Peter Anvin" <hpa@...or.com>, Oleg Nesterov <oleg@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>, X86 ML <x86@...nel.org>
Subject: Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values
wrong in VM?
On 08/13/2015 10:30 AM, David Drysdale wrote:
> Hi folks,
>
> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
> else could reproduce it.
>
> The problem occurs with a seccomp-bpf filter program that's set up to return
> an errno value -- an errno of 1 is always returned instead of what's in the
> filter, plus other oddities (selftest output below).
>
> The problem seems to need a combination of circumstances to occur:
>
> - The seccomp-bpf userspace program needs to be 32-bit, running against a
> 64-bit kernel -- I'm testing with seccomp_bpf from
> tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
Does it work correctly when built as 64-bit program?
>
> - The kernel needs to be running as a VM guest -- it occurs inside my
> VMware Fusion host, but not if I run on bare metal. Kees tells me he
> cannot repro with a kvm guest though.
>
> Bisecting indicates that the commit that induces the problem is
> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
> 64-bit logic", included in all the v4.2-rc* candidates.
>
> Apologies if I've just got something odd with my local setup, but the
> bisection was unequivocal enough that I thought it worth reporting...
>
> Thanks,
> David
>
>
> seccomp_bpf failure outputs:
>
> seccomp_bpf.c:533:global.ERRNO_valid:Expected 7 (7) ==
> (*__errno_location ()) (1)
Test source code:
TEST(ERRNO_valid)
{
struct sock_filter filter[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_read, 0, 1),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | E2BIG),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog prog = {
.len = (unsigned short)ARRAY_SIZE(filter),
.filter = filter,
};
long ret;
pid_t parent = getppid();
ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
ASSERT_EQ(0, ret);
ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
ASSERT_EQ(0, ret);
EXPECT_EQ(parent, syscall(__NR_getppid));
EXPECT_EQ(-1, read(0, NULL, 0));
EXPECT_EQ(E2BIG, errno);
}
The last EXPECT expects 7 (E2BIG) but sees 1.
I'm trying to see how that happens.
SECCOMP_RET_ERRNO action is processed as follows:
static u32 __seccomp_phase1_filter(int this_syscall, struct seccomp_data *sd)
{
...
case SECCOMP_RET_ERRNO:
/* Set low-order bits as an errno, capped at MAX_ERRNO. */
if (data > MAX_ERRNO)
data = MAX_ERRNO;
syscall_set_return_value(current, task_pt_regs(current),
-data, 0);
goto skip;
...
skip:
audit_seccomp(this_syscall, 0, action);
return SECCOMP_PHASE1_SKIP; // "the syscall should not be invoked"
}
The above is called from:
unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
{
...
if (work & _TIF_SECCOMP) {
... ret = seccomp_phase1(&sd);
if (ret == SECCOMP_PHASE1_SKIP) {
regs->orig_ax = -1;
ret = 0;
}
...
}
/* Do our best to finish without phase 2. */
if (work == 0)
return ret; /* seccomp and/or nohz only (ret == 0 here) */
#ifdef CONFIG_AUDITSYSCALL
if (work == _TIF_SYSCALL_AUDIT) {
/*
* If there is no more work to be done except auditing,
* then audit in phase 1. Phase 2 always audits, so, if
* we audit here, then we can't go on to phase 2.
*/
do_audit_syscall_entry(regs, arch);
return 0;
}
#endif
return 1; /* Something is enabled that we can't handle in phase 1 */
}
...
long syscall_trace_enter(struct pt_regs *regs)
{
u32 arch = is_ia32_task() ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
unsigned long phase1_result = syscall_trace_enter_phase1(regs, arch);
if (phase1_result == 0)
return regs->orig_ax;
else
return syscall_trace_enter_phase2(regs, arch, phase1_result);
}
End result should be:
pt_regs->ax = -E2BIG (via syscall_set_return_value())
pt_regs->orig_ax = -1 ("skip syscall")
and syscall_trace_enter_phase1() usually returns with 0,
meaning "re-execute syscall at once, no phase2 needed".
This, in turn, is called from .S files, and when it returns there,
execution loops back to syscall dispatch.
Because of orig_ax = -1, syscall dispatch should skip calling syscall.
So -E2BIG should survive and be returned...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists