linux-kernel - Re: [PATCH] epoll: try to be a _bit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240505194603.GH2118490@ZenIV>
Date: Sun, 5 May 2024 20:46:03 +0100
From: Al Viro <viro@...iv.linux.org.uk>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Christian Brauner <brauner@...nel.org>, keescook@...omium.org,
	axboe@...nel.dk, christian.koenig@....com,
	dri-devel@...ts.freedesktop.org, io-uring@...r.kernel.org,
	jack@...e.cz, laura@...bott.name, linaro-mm-sig@...ts.linaro.org,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-media@...r.kernel.org, minhquangbui99@...il.com,
	sumit.semwal@...aro.org,
	syzbot+045b454ab35fd82a35fb@...kaller.appspotmail.com,
	syzkaller-bugs@...glegroups.com
Subject: Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

On Sat, May 04, 2024 at 08:53:47AM -0700, Linus Torvalds wrote:

>   poll_wait
>     -> __pollwait
>      -> get_file (*boom*)
> 
> but the boom is very small because the poll_wait() will be undone by
> poll_freewait(), and normally poll/select has held the file count
> elevated.

Not quite.  It's not that poll_wait() calls __pollwait(); it calls
whatever callback that caller of ->poll() has set for it.

__pollwait users (select(2) and poll(2), currently) must (and do) make
sure that refcount is elevated; others (and epoll is not the only one)
need to do whatever's right for their callbacks.

I've no problem with having epoll grab a reference, but if we make that
a universal requirement ->poll() instances can rely upon, we'd better
verify that *all* vfs_poll() are OK.  And that ought to go into
Documentation/filesystems/porting.rst ("callers of vfs_poll() must
make sure that file is pinned; ->poll() instances may rely upon that,
but they'd better be very careful about grabbing extra references themselves -
it's acceptable for files on internal mounts, but *NOT* for anything on
mountable filesystems.  Any instance that does it needs an explicit
comment telling not to reuse that blindly." or something along those
lines).

Excluding epoll, select/poll and callers that have just done fdget() and will
do fdput() after vfs_poll(), we have this:

drivers/vhost/vhost.c:213:      mask = vfs_poll(file, &poll->table);
	vhost_poll_start().  Might get interesting...  Calls working
with vq->kick as file seem to rely upon vq->mutex, but I'll need to
refresh my memories of that code to check if that's all we need - and
then there's vhost_net_enable_vq(), which also needs an audit.

fs/aio.c:1738:          mask = vfs_poll(req->file, &pt) & req->events;
fs/aio.c:1932:  mask = vfs_poll(req->file, &apt.pt) & req->events;
	aio_poll() and aio_poll_wake() resp.  req->file here is actually ->ki_filp
	of iocb that contains work as iocb->req.work; it get dropped only in
	iocb_destroy(), which also frees iocb.  Any call that might've run into
	req->file not pinned is already in UAF land.

io_uring/poll.c:303:                    req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
io_uring/poll.c:622:    mask = vfs_poll(req->file, &ipt->pt) & poll->events;
	Should have req->file pinned, but I'll need to RTFS a bit for
details.  That, or ask Jens to confirm...

net/9p/trans_fd.c:236:  ret = vfs_poll(ts->rd, pt);
net/9p/trans_fd.c:238:          ret = (ret & ~EPOLLOUT) | (vfs_poll(ts->wr, pt) & ~EPOLLIN);
	p9_fd_poll(); ->rd and ->wr are pinned and won't get dropped until
p9_fd_close(), which frees ts immediately afterwards.  IOW, if we risk
being called with ->rd or ->wr not pinned, we are in UAF land already.
Incidentally, what the hell is this in p9_fd_open()?
         * It's technically possible for userspace or concurrent mounts to
         * modify this flag concurrently, which will likely result in a
         * broken filesystem. However, just having bad flags here should
         * not crash the kernel or cause any other sort of bug, so mark this
         * particular data race as intentional so that tooling (like KCSAN)
         * can allow it and detect further problems.
         */
Why not simply fix the race instead?  As in
	spin_lock(&ts->rd->f_lock);
        ts->rd->f_flags |= O_NONBLOCK;
	spin_unlock(&ts->rd->f_lock);
and similar for ts->wr?  Sigh...