[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200716131755.l5tsyhupimpinlfi@yavin.dot.cyphar.com>
Date: Thu, 16 Jul 2020 23:17:55 +1000
From: Aleksa Sarai <cyphar@...har.com>
To: Kees Cook <keescook@...omium.org>
Cc: Pavel Begunkov <asml.silence@...il.com>,
Miklos Szeredi <miklos@...redi.hu>,
Matthew Wilcox <willy@...radead.org>,
Andy Lutomirski <luto@...capital.net>,
Jann Horn <jannh@...gle.com>,
Stefano Garzarella <sgarzare@...hat.com>,
Christian Brauner <christian.brauner@...ntu.com>,
strace-devel@...ts.strace.io, io-uring@...r.kernel.org,
Linux API <linux-api@...r.kernel.org>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Michael Kerrisk <mtk.manpages@...il.com>
Subject: Re: strace of io_uring events?
On 2020-07-15, Kees Cook <keescook@...omium.org> wrote:
> Earlier Andy Lutomirski wrote:
> > Let’s add some seccomp folks. We probably also want to be able to run
> > seccomp-like filters on io_uring requests. So maybe io_uring should call into
> > seccomp-and-tracing code for each action.
>
> Okay, I'm finally able to spend time looking at this. And thank you to
> the many people that CCed me into this and earlier discussions (at least
> Jann, Christian, and Andy).
>
> It *seems* like there is a really clean mapping of SQE OPs to syscalls.
> To that end, yes, it should be trivial to add ptrace and seccomp support
> (sort of). The trouble comes for doing _interception_, which is how both
> ptrace and seccomp are designed.
>
> In the basic case of seccomp, various syscalls are just being checked
> for accept/reject. It seems like that would be easy to wire up. For the
> more ptrace-y things (SECCOMP_RET_TRAP, SECCOMP_RET_USER_NOTIF, etc),
> I think any such results would need to be "upgraded" to "reject". Things
> are a bit complex in that seccomp's form of "reject" can be "return
> errno" (easy) or it can be "kill thread (or thread_group)" which ...
> becomes less clear. (More on this later.)
>
> In the basic case of "I want to run strace", this is really just a
> creative use of ptrace in that interception is being used only for
> reporting. Does ptrace need to grow a way to create/attach an io_uring
> eventfd? Or should there be an entirely different tool for
> administrative analysis of io_uring events (kind of how disk IO can be
> monitored)?
I would hope that we wouldn't introduce ptrace to io_uring, because
unless we plan to attach to io_uring events via GDB it's simply the
wrong tool for the job. strace does use ptrace, but that's mostly
because Linux's dynamic tracing was still in its infancy at the time
(and even today it requires more privileges than ptrace) -- but you can
emulate strace using bpftrace these days fairly easily.
So really what is being asked here is "can we make it possible to debug
io_uring programs as easily as traditional I/O programs". And this does
not require ptrace, nor should ptrace be part of this discussion IMHO. I
believe this issue (along with seccomp-style filtering) have been
mentioned informally in the past, but I am happy to finally see a thread
about this appear.
> For io_uring generally, I have a few comments/questions:
>
> - Why did a new syscall get added that couldn't be extended? All new
> syscalls should be using Extended Arguments. :(
io_uring was introduced in Linux 5.1, predating clone3() and openat2().
My larger concern is that io_uring operations aren't extensible-structs
-- but we can resolve that issue with some slight ugliness if we ever
run into the problem.
> - Why aren't the io_uring syscalls in the man-page git? (It seems like
> they're in liburing, but that's should document the _library_ not the
> syscalls, yes?)
I imagine because using the syscall requires specific memory barriers
which we probably don't want most C programs to be fiddling with
directly. Sort of similar to how iptables doesn't have a syscall-style
man page.
> Speaking to Stefano's proposal[1]:
>
> - There appear to be three classes of desired restrictions:
> - opcodes for io_uring_register() (which can be enforced entirely with
> seccomp right now).
> - opcodes from SQEs (this _could_ be intercepted by seccomp, but is
> not currently written)
> - opcodes of the types of restrictions to restrict... for making sure
> things can't be changed after being set? seccomp already enforces
> that kind of "can only be made stricter"
Unless I misunderstood the patch cover-letter, Stefano's proposal is to
have a mechanism for adding restrictions to individual io_urings -- so
we still need a separate mechanism (or an extended version of Stefano's
proposal) to allow for the "reduce attack surface" usecase of seccomp.
It seems to me like Stefano's proposal is more related to cases where
you might SCM_RIGHTS-send an io_uring to an unprivileged process.
> Solving the mapping of seccomp interception types into CQEs (or anything
> more severe) will likely inform what it would mean to map ptrace events
> to CQEs. So, I think they're related, and we should get seccomp hooked
> up right away, and that might help us see how (if) ptrace should be
> attached.
We could just emulate the seccomp-bpf API with the pseudo-syscalls done
as a result of CQEs, though I'm not sure how happy folks will be with
this kind of glue code in "seccomp-uring" (though in theory it would
allow us to attach existing filters to io_uring...).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)
Powered by blists - more mailing lists