linux-kernel - Re: For review: seccomp_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <91b74ce1-de95-2b92-c62e-e2715d6071d3@gmail.com>
Date:   Thu, 29 Oct 2020 20:53:45 +0100
From:   "Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>
To:     Christian Brauner <christian.brauner@...onical.com>
Cc:     mtk.manpages@...il.com, Tycho Andersen <tycho@...ho.pizza>,
        Sargun Dhillon <sargun@...gun.me>,
        Giuseppe Scrivano <gscrivan@...hat.com>,
        Song Liu <songliubraving@...com>,
        Will Drewry <wad@...omium.org>,
        Kees Cook <keescook@...omium.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        linux-man <linux-man@...r.kernel.org>,
        Robert Sesek <rsesek@...gle.com>,
        Containers <containers@...ts.linux-foundation.org>,
        Jann Horn <jannh@...gle.com>,
        lkml <linux-kernel@...r.kernel.org>,
        Alexei Starovoitov <ast@...nel.org>, bpf <bpf@...r.kernel.org>,
        Andy Lutomirski <luto@...capital.net>,
        Christian Brauner <christian@...uner.io>
Subject: Re: For review: seccomp_user_notif(2) manual page [v2]

Hello Christian

Thanks for taking a look at the page.

On 10/29/20 4:26 PM, Christian Brauner wrote:
> On Mon, Oct 26, 2020 at 10:55:04AM +0100, Michael Kerrisk (man-pages) wrote:
>> Hi all (and especially Tycho and Sargun),
>>
>> Following review comments on the first draft (thanks to Jann, Kees,
>> Christian and Tycho), I've made a lot of changes to this page.
>> I've also added a few FIXMEs relating to outstanding API issues.
>> I'd like a second pass review of the page before I release it.
>> But also, this mail serves as a way of noting the outstanding API
>> issues.
>>
>> Tycho: I still have an outstanding question for you at [2].
>>
>> Sargun: can you please prepare something on SECCOMP_ADDFD_FLAG_SETFD
>> and SECCOMP_IOCTL_NOTIF_ADDFD to be added to this page?
>>
>> I've shown the rendered version of the page below. The page source
>> currently sits in a branch at
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=seccomp_user_notif
>>
>> At this point, I'm mainly interested in feedback about the FIXMEs,
>> some of which relate to the text of the page itself, while the
>> others relate to the various outstanding API issues. The first 
>> FIXME provides a small opportunity for some bikeshedding :-);
> 
> I like this manpage. I think this is the most comprehensive explanation
> of any seccomp feature

Thanks (at least, I think so...)

> and somewhat understandable.
      ^^^^^^^^

(... but I'm not sure ;-).)

> Just tiny comments below, hopefully no bike-shedding though. :)

Most relevant point for bikeshedding is the page name. I plan 
to change it to seccomp_unotify(2) (shorter, reads better out loud).

>> Thanks,
>>
>> Michael
>>
>> [1] https://lore.kernel.org/linux-man/45f07f17-18b6-d187-0914-6f341fe90857@gmail.com/
>> [2] https://lore.kernel.org/linux-man/8f20d586-9609-ef83-c85a-272e37e684d8@gmail.com/
>>
>> =====
>>
>> SECCOMP_USER_NOTIF(2)   Linux Programmer's Manual  SECCOMP_USER_NOTIF(2)

[...]

>>        An overview of the steps performed by the target and the
>>        supervisor is as follows:
>>
>>        1. The target establishes a seccomp filter in the usual manner,
>>           but with two differences:
>>
>>           · The seccomp(2) flags argument includes the flag
>>             SECCOMP_FILTER_FLAG_NEW_LISTENER.  Consequently, the return
>>             value of the (successful) seccomp(2) call is a new
>>             "listening" file descriptor that can be used to receive
>>             notifications.  Only one "listening" seccomp filter can be
>>             installed for a thread.
>>
>>             ┌─────────────────────────────────────────────────────┐
>>             │FIXME                                                │
>>             ├─────────────────────────────────────────────────────┤
>>             │Is the last sentence above correct?                  │
>>             │                                                     │
>>             │Kees Cook (25 Oct 2020) notes:                       │
>>             │                                                     │
>>             │I like this limitation, but I expect that it'll need │
>>             │to change in the future. Even with LSMs, we see the  │
>>             │need for arbitrary stacking, and the idea of there   │
>>             │being only 1 supervisor will eventually break down.  │
>>             │Right now there is only 1 because only container     │
>>             │managers are using this feature. But if some daemon  │
>>             │starts using it to isolate some thread, suddenly it  │
>>             │might break if a container manager is trying to      │
>>             │listen to it too, etc. I expect it won't be needed   │
>>             │soon, but I do think it'll change.                   │
>>             │                                                     │
>>             └─────────────────────────────────────────────────────┘
>>
>>           · In cases where it is appropriate, the seccomp filter returns
>>             the action value SECCOMP_RET_USER_NOTIF.  This return value
>>             will trigger a notification event.
>>
>>        2. In order that the supervisor can obtain notifications using
>>           the listening file descriptor, (a duplicate of) that file
>>           descriptor must be passed from the target to the supervisor.
>>           One way in which this could be done is by passing the file
>>           descriptor over a UNIX domain socket connection between the
>>           target and the supervisor (using the SCM_RIGHTS ancillary
>>           message type described in unix(7)).
> 
> Fwiw, on newer kernels you could also use pidfd_getfd() for that.

Thanks. I added that to the text.

>>        3. The supervisor will receive notification events on the
>>           listening file descriptor.  These events are returned as
>>           structures of type seccomp_notif.  Because this structure and
>>           its size may evolve over kernel versions, the supervisor must
>>           first determine the size of this structure using the
>>           seccomp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a
>>           structure of type seccomp_notif_sizes.  The supervisor
>>           allocates a buffer of size seccomp_notif_sizes.seccomp_notif
>>           bytes to receive notification events.  In addition,the
>>           supervisor allocates another buffer of size
>>           seccomp_notif_sizes.seccomp_notif_resp bytes for the response
>>           (a struct seccomp_notif_resp structure) that it will provide
>>           to the kernel (and thus the target).
>>
>>        4. The target then performs its workload, which includes system
>>           calls that will be controlled by the seccomp filter.  Whenever
>>           one of these system calls causes the filter to return the
>>           SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet)
>>           execute the system call; instead, execution of the target is
>>           temporarily blocked inside the kernel (in a sleep state that
>>           is interruptible by signals) and a notification event is
>>           generated on the listening file descriptor.
>>
>>        5. The supervisor can now repeatedly monitor the listening file
>>           descriptor for SECCOMP_RET_USER_NOTIF-triggered events.  To do
>>           this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV
>>           ioctl(2) operation to read information about a notification
>>           event; this operation blocks until an event is available.  The
> 
> Maybe mention that users can choose to either use the blocking ioctl()
> directly or use poll semantics and point to the section below.

Thanks. I added mention of the poll/select/epoll here.

> (Do we support O_NONBLOCK with SECCOMP_IOCTL_NOTIF_RECV and if not should
> we?)

A quick test suggests that O_NONBLOCK has no effect on the blocking
behavior of SECCOMP_IOCTL_NOTIF_RECV.

(I've added your question and this info as a FIXME in the page.)

>>           operation returns a seccomp_notif structure containing
>>           information about the system call that is being attempted by
>>           the target.
>>
>>        6. The seccomp_notif structure returned by the
>>           SECCOMP_IOCTL_NOTIF_RECV operation includes the same
>>           information (a seccomp_data structure) that was passed to the
>>           seccomp filter.  This information allows the supervisor to
>>           discover the system call number and the arguments for the
>>           target's system call.  In addition, the notification event
>>           contains the ID of the thread that triggered the notification
>>           and a unique cookie value that is used in subsequent
>>           SECCOMP_IOCTL_NOTIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND
>>           operations.
>>
>>           The information in the notification can be used to discover
>>           the values of pointer arguments for the target's system call.
>>           (This is something that can't be done from within a seccomp
>>           filter.)  One way in which the supervisor can do this is to
>>           open the corresponding /proc/[tid]/mem file (see proc(5)) and
>>           read bytes from the location that corresponds to one of the
>>           pointer arguments whose value is supplied in the notification
>>           event.  (The supervisor must be careful to avoid a race
>>           condition that can occur when doing this; see the description
>>           of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.)
>>           In addition, the supervisor can access other system
>>           information that is visible in user space but which is not
>>           accessible from a seccomp filter.
>>
>>        7. Having obtained information as per the previous step, the
>>           supervisor may then choose to perform an action in response to
>>           the target's system call (which, as noted above, is not
>>           executed when the seccomp filter returns the
>>           SECCOMP_RET_USER_NOTIF action value).
>>
>>           One example use case here relates to containers.  The target
>>           may be located inside a container where it does not have
>>           sufficient capabilities to mount a filesystem in the
>>           container's mount namespace.  However, the supervisor may be a
>>           more privileged process that does have sufficient capabilities
>>           to perform the mount operation.
>>
>>        8. The supervisor then sends a response to the notification.  The
>>           information in this response is used by the kernel to
>>           construct a return value for the target's system call and
>>           provide a value that will be assigned to the errno variable of
>>           the target.
>>
>>           The response is sent using the SECCOMP_IOCTL_NOTIF_SEND
>>           ioctl(2) operation, which is used to transmit a
>>           seccomp_notif_resp structure to the kernel.  This structure
>>           includes a cookie value that the supervisor obtained in the
>>           seccomp_notif structure returned by the
>>           SECCOMP_IOCTL_NOTIF_RECV operation.  This cookie value allows
>>           the kernel to associate the response with the target.  This
>>           structure must include the cookie value that the supervisor
>>           obtained in the seccomp_notif structure returned by the
>>           SECCOMP_IOCTL_NOTIF_RECV operation; the cookie allows the
>>           kernel to associate the response with the target.
>>
>>        9. Once the notification has been sent, the system call in the
>>           target thread unblocks, returning the information that was
>>           provided by the supervisor in the notification response.
>>
>>        As a variation on the last two steps, the supervisor can send a
>>        response that tells the kernel that it should execute the target
>>        thread's system call; see the discussion of
>>        SECCOMP_USER_NOTIF_FLAG_CONTINUE, below.
>>
>>    ioctl(2) operations
>>        The following ioctl(2) operations are provided to support seccomp
>>        user-space notification.  For each of these operations, the first
> 
> Hm, since the ioctls() are associatd with the seccomp notify file
> descriptor maybe we should rephrase this a bit to make this more
> obvious:
> "[...] ioctl(2) operations are supported by the seccomp user-space file descriptor"
> That might line-uper better with the following sentence. Just a thought,
> feel free to ignore.

Yep, your idea is better. I changed the text.

>>        (file descriptor) argument of ioctl(2) is the listening file
>>        descriptor returned by a call to seccomp(2) with the
>>        SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
>>
>>        SECCOMP_IOCTL_NOTIF_RECV
>>               This operation is used to obtain a user-space notification
>>               event.  If no such event is currently pending, the
>>               operation blocks until an event occurs.  The third
>>               ioctl(2) argument is a pointer to a structure of the
>>               following form which contains information about the event.
>>               This structure must be zeroed out before the call.
>>
>>                   struct seccomp_notif {
>>                       __u64  id;              /* Cookie */
>>                       __u32  pid;             /* TID of target thread */
>>                       __u32  flags;           /* Currently unused (0) */
>>                       struct seccomp_data data;   /* See seccomp(2) */
>>                   };
>>
>>               The fields in this structure are as follows:
>>
>>               id     This is a cookie for the notification.  Each such
>>                      cookie is guaranteed to be unique for the
>>                      corresponding seccomp filter.
>>
>>                      · It can be used with the
>>                        SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation
>>                        to verify that the target is still alive.
>>
>>                      · When returning a notification response to the
>>                        kernel, the supervisor must include the cookie
>>                        value in the seccomp_notif_resp structure that is
>>                        specified as the argument of the
>>                        SECCOMP_IOCTL_NOTIF_SEND operation.
>>
>>               pid    This is the thread ID of the target thread that
>>                      triggered the notification event.
>>
>>               flags  This is a bit mask of flags providing further
>>                      information on the event.  In the current
>>                      implementation, this field is always zero.
> 
> I think we haven't settled whether this is input or output only. I guess
> we could technically use it for both.

So, change something here in the page?

> 
>>
>>               data   This is a seccomp_data structure containing
>>                      information about the system call that triggered
>>                      the notification.  This is the same structure that
>>                      is passed to the seccomp filter.  See seccomp(2)
>>                      for details of this structure.
>>
>>               On success, this operation returns 0; on failure, -1 is
>>               returned, and errno is set to indicate the cause of the
>>               error.  This operation can fail with the following errors:
>>
>>               EINVAL (since Linux 5.5)
>>                      The seccomp_notif structure that was passed to the
>>                      call contained nonzero fields.
>>
>>               ENOENT The target thread was killed by a signal as the
>>                      notification information was being generated, or
>>                      the target's (blocked) system call was interrupted
>>                      by a signal handler.
> 
> (Technically also EFAULT because the user provided a garbage address.)

Yeah. But that error is kind of presumed anywhere a pointer 
is provided.

>>        ┌─────────────────────────────────────────────────────┐
>>        │FIXME                                                │
>>        ├─────────────────────────────────────────────────────┤
>>        │From my experiments, it appears that if a            │
>>        │SECCOMP_IOCTL_NOTIF_RECV is done after the target    │
>>        │thread terminates, then the ioctl() simply blocks    │
>>        │(rather than returning an error to indicate that the │
>>        │target no longer exists).                            │
>>        │                                                     │
>>        │I found that surprising, and it required some        │
>>        │contortions in the example program.  It was not      │
>>        │possible to code my SIGCHLD handler (which reaps the │
>>        │zombie when the worker/target terminates) to simply  │
>>        │set a flag checked in the main handleNotifications() │
>>        │loop, since this created an unavoidable race where   │
>>        │the child might terminate just after I had checked   │
>>        │the flag, but before I blocked (forever!) in the     │
>>        │SECCOMP_IOCTL_NOTIF_RECV operation. Instead, I had   │
>>        │to code the signal handler to simply call _exit(2)   │
>>        │in order to terminate the parent process (the        │
>>        │supervisor).                                         │
>>        │                                                     │
>>        │Is this expected behavior? It seems to me rather     │
>>        │desirable that SECCOMP_IOCTL_NOTIF_RECV should give  │
>>        │an error if the target has terminated.               │
>>        │                                                     │
>>        │Jann posted a patch to rectify this, but there was   │
>>        │no response (Lore link: https://bit.ly/3jvUBxk) to   │
>>        │his question about fixing this issue. (I've tried    │
>>        │building with the patch, but encountered an issue    │
>>        │with the target process entering D state after a     │
>>        │signal.)                                             │
>>        │                                                     │
>>        │For now, this behavior is documented in BUGS.        │
>>        │                                                     │
>>        │Kees Cook commented: Let's change [this] ASAP!       │
>>        └─────────────────────────────────────────────────────┘
>>
>>        SECCOMP_IOCTL_NOTIF_ID_VALID
>>               This operation can be used to check that a notification ID
>>               returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation
>>               is still valid (i.e., that the target still exists and its
>>               system call is still blocked waiting for a response).
>>
>>               The third ioctl(2) argument is a pointer to the cookie
>>               (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
>>
>>               This operation is necessary to avoid race conditions that
>>               can occur when the pid returned by the
>>               SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that
>>               process ID is reused by another process.  An example of
>>               this kind of race is the following
>>
>>               1. A notification is generated on the listening file
>>                  descriptor.  The returned seccomp_notif contains the
>>                  TID of the target thread (in the pid field of the
>>                  structure).
>>
>>               2. The target terminates.
>>
>>               3. Another thread or process is created on the system that
>>                  by chance reuses the TID that was freed when the target
>>                  terminated.
>>
>>               4. The supervisor open(2)s the /proc/[tid]/mem file for
>>                  the TID obtained in step 1, with the intention of (say)
>>                  inspecting the memory location(s) that containing the
>>                  argument(s) of the system call that triggered the
>>                  notification in step 1.
>>
>>               In the above scenario, the risk is that the supervisor may
>>               try to access the memory of a process other than the
>>               target.  This race can be avoided by following the call to
>>               open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to
>>               verify that the process that generated the notification is
>>               still alive.  (Note that if the target terminates after
>>               the latter step, a subsequent read(2) from the file
>>               descriptor may return 0, indicating end of file.)
>>
>>               On success (i.e., the notification ID is still valid),
>>               this operation returns 0.  On failure (i.e., the
>>               notification ID is no longer valid), -1 is returned, and
>>               errno is set to ENOENT.
>>
>>        SECCOMP_IOCTL_NOTIF_SEND
>>               This operation is used to send a notification response
>>               back to the kernel.  The third ioctl(2) argument of this
>>               structure is a pointer to a structure of the following
>>               form:
>>
>>                   struct seccomp_notif_resp {
>>                       __u64 id;               /* Cookie value */
>>                       __s64 val;              /* Success return value */
>>                       __s32 error;            /* 0 (success) or negative
>>                                                  error number */
>>                       __u32 flags;            /* See below */
>>                   };
>>
>>               The fields of this structure are as follows:
>>
>>               id     This is the cookie value that was obtained using
>>                      the SECCOMP_IOCTL_NOTIF_RECV operation.  This
>>                      cookie value allows the kernel to correctly
>>                      associate this response with the system call that
>>                      triggered the user-space notification.
>>
>>               val    This is the value that will be used for a spoofed
>>                      success return for the target's system call; see
>>                      below.
>>
>>               error  This is the value that will be used as the error
>>                      number (errno) for a spoofed error return for the
>>                      target's system call; see below.
>>
>>               flags  This is a bit mask that includes zero or more of
>>                      the following flags:
>>
>>                      SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
>>                             Tell the kernel to execute the target's
>>                             system call.
>>
>>               Two kinds of response are possible:
>>
>>               · A response to the kernel telling it to execute the
>>                 target's system call.  In this case, the flags field
>>                 includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error
>>                 and val fields must be zero.
>>
>>                 This kind of response can be useful in cases where the
>>                 supervisor needs to do deeper analysis of the target's
>>                 system call than is possible from a seccomp filter
>>                 (e.g., examining the values of pointer arguments), and,
>>                 having decided that the system call does not require
>>                 emulation by the supervisor, the supervisor wants the
>>                 system call to be executed normally in the target.
>>
>>                 The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used
>>                 with caution; see NOTES.
>>
>>               · A spoofed return value for the target's system call.  In
>>                 this case, the kernel does not execute the target's
>>                 system call, instead causing the system call to return a
>>                 spoofed value as specified by fields of the
>>                 seccomp_notif_resp structure.  The supervisor should set
>>                 the fields of this structure as follows:
>>
>>                 +  flags does not contain
>>                    SECCOMP_USER_NOTIF_FLAG_CONTINUE.
>>
>>                 +  error is set either to 0 for a spoofed "success"
>>                    return or to a negative error number for a spoofed
>>                    "failure" return.  In the former case, the kernel
>>                    causes the target's system call to return the value
>>                    specified in the val field.  In the later case, the
> 
> Not a native English speaker but shouldn't this be "latter"?

Yup!

>>                    kernel causes the target's system call to return -1,
>>                    and errno is assigned the negated error value.
>>
>>                 +  val is set to a value that will be used as the return
>>                    value for a spoofed "success" return for the target's
>>                    system call.  The value in this field is ignored if
>>                    the error field contains a nonzero value.
>>
>>                    ┌─────────────────────────────────────────────────────┐
>>                    │FIXME                                                │
>>                    ├─────────────────────────────────────────────────────┤
>>                    │Kees Cook suggested:                                 │
>>                    │                                                     │
>>                    │Strictly speaking, this is architecture specific,    │
>>                    │but all architectures do it this way. Should seccomp │
>>                    │enforce val == 0 when err != 0 ?                     │
> 
> Feels like it should, at least for the SEND ioctl where we already
> verify that val and err are both 0 when CONTINUE is specified (as you
> pointed out correctly above).

Your comments ended here (with no sign-off). Was that intentional?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/