linux-kernel - Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d0266a24-dfab-83d0-e178-aa67c9f5ebc0@i-love.sakura.ne.jp>
Date:   Fri, 3 Jul 2020 22:19:17 +0900
From:   Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>,
        Al Viro <viro@...iv.linux.org.uk>,
        Casey Schaufler <casey@...aufler-ca.com>
Cc:     Alexei Starovoitov <alexei.starovoitov@...il.com>,
        linux-kernel@...r.kernel.org, David Miller <davem@...emloft.net>,
        Greg Kroah-Hartman <greg@...ah.com>,
        Kees Cook <keescook@...omium.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alexei Starovoitov <ast@...nel.org>, bpf <bpf@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Masahiro Yamada <yamada.masahiro@...ionext.com>,
        Gary Lin <GLin@...e.com>, Bruno Meneguele <bmeneg@...hat.com>,
        LSM List <linux-security-module@...r.kernel.org>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

On 2020/07/02 22:08, Eric W. Biederman wrote:
>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>> called from both kernel thread and from process context (e.g. init_module()
>> syscall by /sbin/insmod )) ?
> 
> And __fput_sync needs to be even more careful.
> umd_load_blob is called in these changes without any locks held.

But where is the guarantee that a thread which called flush_delayed_fput() waits for
the completion of processing _all_ "struct file" linked into delayed_fput_list ?
If some other thread or delayed_fput_work (scheduled by fput_many()) called
flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
needs to be processed before execve(), can't it?

Also, I don't know how convoluted the dependency of all "struct file" linked into
delayed_fput_list might be, for there can be "struct file" which will not be a
simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.

On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
there is a guarantee that __fput_sync() waits for the completion of "struct file"
which needs to be flushed before execve(), isn't there?

> 
> We fundamentally AKA in any correct version of this code need to flush
> the file descriptor before we call exec or exec can not open it a
> read-only denying all writes from any other opens.
> 
> The use case of flush_delayed_fput is exactly the same as that used
> when loading the initramfs.

When loading the initramfs, the number of threads is quite few (which
means that the possibility of hitting the race window and convoluted
dependency is small).

But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
flush_delayed_fput() might be called after many number of threads already
started running.



On 2020/07/03 1:02, Eric W. Biederman wrote:
>>>> On 2020/06/30 21:29, Eric W. Biederman wrote:
>>>>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>>>>> release_task is called so there is a race.  As it is possible to wake
>>>>> up and then go back to sleep before pid_has_task becomes false.
>>>>
>>>> What is the reason we want to wait until pid_has_task() becomes false?
>>>>
>>>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
>>>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1));
>>>
>>> So that it is safe to call bpfilter_umh_cleanup.  The previous code
>>> performed the wait by having a callback in do_exit.
>>
>> But bpfilter_umh_cleanup() does only
>>
>> 	fput(info->pipe_to_umh);
>> 	fput(info->pipe_from_umh);
>> 	put_pid(info->tgid);
>> 	info->tgid = NULL;
>>
>> which is (I think) already safe regardless of the usermode process because
>> bpfilter_umh_cleanup() merely closes one side of two pipes used between
>> two processes and forgets about the usermode process.
> 
> It is not safe.
> 
> Baring bugs there is only one use of shtudown_umh that matters.  The one
> in fini_umh.  The use of the file by the mm must be finished before
> umd_unload_blob.  AKA unmount.  Which completely frees the filesystem.

Do we really need to mount upon umd_load_blob() and unmount upon umd_unload_blob() ?
LSM modules might prefer only one instance of filesystem for umd blobs.

For pathname based LSMs, since that filesystem is not visible from mount tree, only
info->driver_name can be used for distinction. Therefore, one instance of filesystem
with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL) might be preferable.

For inode based LSMs, reusing one instance of filesystem created upon early boot might
be convenient for labeling.

Also, we might want a dedicated filesystem (say, "umdfs") instead of regular tmpfs in
order to implement protections without labeling files. Then, we might also be able to
implement minimal protections without LSMs.