linux-kernel - Re: For review: seccomp_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <964c2191-db78-ff4d-5664-1d80dc382df4@gmail.com>
Date:   Mon, 2 Nov 2020 20:45:14 +0100
From:   "Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>
To:     Sargun Dhillon <sargun@...gun.me>
Cc:     mtk.manpages@...il.com, Tycho Andersen <tycho@...ho.pizza>,
        Christian Brauner <christian@...uner.io>,
        Kees Cook <keescook@...omium.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Giuseppe Scrivano <gscrivan@...hat.com>,
        Song Liu <songliubraving@...com>,
        Robert Sesek <rsesek@...gle.com>,
        Containers <containers@...ts.linux-foundation.org>,
        linux-man <linux-man@...r.kernel.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Aleksa Sarai <cyphar@...har.com>, Jann Horn <jannh@...gle.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Will Drewry <wad@...omium.org>, bpf <bpf@...r.kernel.org>,
        Andy Lutomirski <luto@...capital.net>
Subject: Re: For review: seccomp_user_notif(2) manual page [v2]

Hello Sargun,

Thanks for your reply!

On 11/2/20 9:07 AM, Sargun Dhillon wrote:
> On Sat, Oct 31, 2020 at 9:27 AM Michael Kerrisk (man-pages)
> <mtk.manpages@...il.com> wrote:
>>
>> Hello Sargun,
>>
>> Thanks for your reply.
>>
>> On 10/30/20 9:27 PM, Sargun Dhillon wrote:
>>> On Thu, Oct 29, 2020 at 09:37:21PM +0100, Michael Kerrisk (man-pages)
>>> wrote:
>>
>> [...]
>>
>>>>> I think I commented in another thread somewhere that the
>>>>> supervisor is not notified if the syscall is preempted. Therefore
>>>>> if it is performing a preemptible, long-running syscall, you need
>>>>> to poll SECCOMP_IOCTL_NOTIF_ID_VALID in the background, otherwise
>>>>> you can end up in a bad situation -- like leaking resources, or
>>>>> holding on to file descriptors after the program under
>>>>> supervision has intended to release them.
>>>>
>>>> It's been a long day, and I'm not sure I reallu understand this.
>>>> Could you outline the scnario in more detail?
>>>>
>>> S: Sets up filter + interception for accept T: socket(AF_INET,
>>> SOCK_STREAM, 0) = 7 T: bind(7, {127.0.0.1, 4444}, ..) T: listen(7,
>>> 10) T: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>>
>> Presumably, the preceding line should have been:
>>
>> S: pidfd_getfd(T, 7) = 7 # For the sake of discussion.
>> (s/T:/S:/)
>>
>> right?
> 
> Right.
>>
>>
>>> T: accept(7, ...) S: Intercepts accept S: Does accept in background
>>> T: Receives signal, and accept(...) responds in EINTR T: close(7) S:
>>> Still running accept(7, ....), holding port 4444, so if now T
>>> retries to bind to port 4444, things fail.
>>
>> Okay -- I understand. Presumably the solution here is not to
>> block in accept(), but rather to use poll() to monitor both the
>> notification FD and the listening socket FD?
>>
> You need to have some kind of mechanism to periodically check
> if the notification is still alive, and preempt the accept. It doesn't
> matter how exactly you "background" the accept (threads, or
> O_NONBLOCK + epoll).
> 
> The thing is you need to make sure that when the process
> cancels a syscall, you need to release the resources you
> may have acquired on its behalf or bad things can happen.
> 

Got it. I added the following text:

   Caveats regarding blocking system calls
       Suppose that the target performs a blocking system call (e.g.,
       accept(2)) that the supervisor should handle.  The supervisor
       might then in turn execute the same blocking system call.

       In this scenario, it is important to note that if the target's
       system call is now interrupted by a signal, the supervisor is not
       informed of this.  If the supervisor does not take suitable steps
       to actively discover that the target's system call has been
       canceled, various difficulties can occur.  Taking the example of
       accept(2), the supervisor might remain blocked in its accept(2)
       holding a port number that the target (which, after the
       interruption by the signal handler, perhaps closed  its listening
       socket) might expect to be able to reuse in a bind(2) call.

       Therefore, when the supervisor wishes to emulate a blocking system
       call, it must do so in such a way that it gets informed if the
       target's system call is interrupted by a signal handler.  For
       example, if the supervisor itself executes the same blocking
       system call, then it could employ a separate thread that uses the
       SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is
       still blocked in its system call.  Alternatively, in the accept(2)
       example, the supervisor might use poll(2) to monitor both the
       notification file descriptor (so as as to discover when the
       target's accept(2) call has been interrupted) and the listening
       file descriptor (so as to know when a connection is available).

       If the target's system call is interrupted, the supervisor must
       take care to release resources (e.g., file descriptors) that it
       acquired on behalf of the target.

Does that seem okay?

>>>>> ENOENT The cookie number is not valid. This can happen if a
>>>>> response has already been sent, or if the syscall was
>>>>> interrupted
>>>>>
>>>>> EBADF If the file descriptor specified in srcfd is invalid, or if
>>>>> the fd is out of range of the destination program.
>>>>
>>>> The piece "or if the fd is out of range of the destination program"
>>>> is not clear to me. Can you say some more please.
>>>>
>>>
>>> IIRC the maximum fd range is specific in proc by some sysctl named
>>> nr_open. It's also evaluated against RLIMITs, and nr_max.
>>>
>>> If nr-open (maximum fds open per process, iiirc) is 1000, even if 10
>>> FDs are open, it wont work if newfd is 1001.
>>
>> Actually, the relevant limit seems to be just the RLIMIT_NOFILE
>> resource limit at least in my reading of fs/file.c::replace_fd().
>> So I made the text
>>
>>               EBADF  Allocating the file descriptor in the target would
>>                      cause the target's RLIMIT_NOFILE limit to be
>>                      exceeded (see getrlimit(2)).
>>
>>
> 
> If you're above RLIMIT_NOFILE, you get EBADF.
> 
> When we do __receive_fd with a specific fd (newfd specified):
> https://elixir.bootlin.com/linux/latest/source/fs/file.c#L1086
> 
> it calls replace_fd, which calls expand_files. expand_files
> can fail with EMFILE.
> 
>>>>> EINVAL If flags or new_flags were unrecognized, or if newfd is
>>>>> non-zero, and SECCOMP_ADDFD_FLAG_SETFD has not been set.
>>>>>
>>>>> EMFILE Too many files are open by the destination process.
>>
>> I'm not sure that the error can really occur. That's the error
>> that in most other places occurs when RLIMIT_NOFILE is exceeded.
>> But I may have missed something. More precisely, when do you think
>> EMFILE can occur?
>>
> It can happen if the user specifies a newfd which is too large.

Got it. Thanks! I made the error text:

        EMFILE The file descriptor number specified in newfd  exceeds  the
              limit specified in /proc/sys/fs/nr_open.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/