lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Fri, 7 May 2021 00:15:46 -0700
From:   Sargun Dhillon <sargun@...gun.me>
To:     Andy Lutomirski <luto@...nel.org>
Cc:     Rodrigo Campos <rodrigo@...volk.io>,
        Kees Cook <keescook@...omium.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux Containers <containers@...ts.linux-foundation.org>,
        Mauricio Vásquez Bernal <mauricio@...volk.io>,
        Tycho Andersen <tycho@...ho.pizza>,
        Giuseppe Scrivano <gscrivan@...hat.com>,
        Christian Brauner <christian.brauner@...ntu.com>
Subject: Re: [PATCH v2 2/5] seccomp: Add wait_killable semantic to seccomp
 user notifier

On Thu, May 6, 2021 at 4:59 PM Andy Lutomirski <luto@...nel.org> wrote:
>
> On Thu, May 6, 2021 at 12:35 PM Rodrigo Campos <rodrigo@...volk.io> wrote:
>
> > We want to be woken-up in seccomp_do_user_notification() so we know we
> > received a signal and notify to the supervisor about it. IMHO, we want
> > an additional "if" here[2], like (haven't checked the retun codes of
> > functions) "if err != 0"  and do a "wake_up_poll(&match->wqh, EPOLLIN
> > | EPOLLRDNORM);" or POLLPRI or something, and then "goto wait". IOW,
> > we only return when the supervisor tells us to, if a signal arrives we
> > just let the supervisor know and continue waiting for a response. The
> > proper response might be "I wrote partial data, return only 3 bytes",
> > etc. This way we can properly emulate it.
> >
> > What is not clear to me, and I'd like others to comment is:
> >
> > 1. How should this new "the supervisor should handle signals" mode be enabled?
> >    a. If we use an extra ioctl, as Andy suggested, I think we have a
> > race too: from the moment the seccomp syscall is issued until the new
> > ioctl is called there is a race. If the container does a syscall in
> > that window, it _can_ incorrectly return EINTR as it does now. This
> > race can be very small and the ioctl can make _all the next syscalls_
> > wait in this new mode, so maybe the race is so small that we don't
> > care in practice.
>
> If the ioctl is sticky, it can presumably avoid the race entirely by
> having whoever calls seccomp() immediately do a dummy syscall or
> otherwise wait for the notifier to confirm that it has done the ioctl.
> Admittedly, this may be annoying.
>
> >    b. But if we want to really fully solve the issue, the other option
> > that comes to mind is adding a new flag for the seccomp syscall, to
> > signal that the supervisor will handle signals and should be
> > forwarded. This way, if the container does a new syscall it will be
> > put to wait in this new mode (we have the information that the new
> > mode should be used already). Something like
> > SECCOMP_FILTER_FLAG_NEW_LISTENER_SOMETHING (today we have
> > SECCOMP_FILTER_FLAG_NEW_LISTENER). Ideally shorter :D
>
> I like that better.
>
> >
> > 2. What should we notify to the supervisor? Only that a signal arrived
> > or also which signal it was? If we do the EPOLLPRI thingy, as Andy
> > mentioned in a previous thread, IIUC we will only notify that _a_
> > signal arrived, but not _which_. To notify which signal arrived might
> > be complex, though, (not sure, I haven't explored how that should be).
>
> Are there any examples in the kernel of syscalls that care *which*
> signal came in?  As far as I know, there are really only three states
> that matter: no signal is pending, a kill is pending, or a regular
> signal is pending.  (The latter means that there's a signal that is
> unmasked, etc and will should therefore interrupt a syscall if the
> syscall can be interrupted.

Andy,
I do not believe we want full preemption on all syscalls. I think we want
the ability to have "safe" (for some definition of safe) preemption on some
syscalls.

For example, a syscall like connect should allow for arbitrary
preemption, even if the supervisor isn't able to get to it in time (using
alarm here is a thing I've seen as a common pattern). If a connect
starts, and is then preempted, you get an EINTR. glibc will retry the
connect, and if it was already in progress, you should get an
EINPROGRESS, otherwise no error. This is "correct" behaviour,
but not all syscalls do this.

I think we can take a slightly lighterweight approach in in the filter
action, we can include a flag in the data of the action that the filter
return which signifies whether or not the call should be preemptible
before being seen by the supervisor, and another flag which indicates
that once the supervisor has RECV'd the notification, it should not
allow for interruption.

I agree though, that if the call is in the non-preemptible state we need to
have a mechanism to notify the supervisor to cancel work, and
return to the notifying process. I think that if we do the POLLPRI thing,
then there should be a RECV_SIGNAL ioctl to read POLLPRI notifications,
that indicate a particular notification has been preempted (signaled). At this
point, the supervisor can do whatever it wants.

There is also a slight other race condition:
1. Notification is generated wait non-preemptible flag in data
2. Program returns signal
3. At this point, should the kernel do POLLPRI or wait for
  the supervisor to first read the notification. Should we add
  some metadata to the notification indicating it was attempted
  to be preempted?

Would the following work:
1. Add a two flags to the data, indicating pre-notification preemption being
  allowed and return -EINTR, and once the notification is picked up by
  the supervisor transitioning into an -EINTR state
2. Make it so if an in-progress notification is interrupted via a
signal, it will
    send a POLLPRI, which is read via RECV_SIGNAL.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ