netdev - Re: [PATCH 1/3] io_uring: add support for async work inheriting files table

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c252182a-4d09-5e9b-112b-2dad9ef123b5@kernel.dk>
Date:   Thu, 24 Oct 2019 16:04:25 -0600
From:   Jens Axboe <axboe@...nel.dk>
To:     Jann Horn <jannh@...gle.com>
Cc:     linux-block@...r.kernel.org,
        "David S. Miller" <davem@...emloft.net>,
        Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH 1/3] io_uring: add support for async work inheriting files
 table

On 10/24/19 2:31 PM, Jann Horn wrote:
> On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@...nel.dk> wrote:
>> On 10/18/19 12:50 PM, Jann Horn wrote:
>>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@...nel.dk> wrote:
>>>> On 10/18/19 12:06 PM, Jann Horn wrote:
>>>>> But actually, by the way: Is this whole files_struct thing creating a
>>>>> reference loop? The files_struct has a reference to the uring file,
>>>>> and the uring file has ACCEPT work that has a reference to the
>>>>> files_struct. If the task gets killed and the accept work blocks, the
>>>>> entire files_struct will stay alive, right?
>>>>
>>>> Yes, for the lifetime of the request, it does create a loop. So if the
>>>> application goes away, I think you're right, the files_struct will stay.
>>>> And so will the io_uring, for that matter, as we depend on the closing
>>>> of the files to do the final reap.
>>>>
>>>> Hmm, not sure how best to handle that, to be honest. We need some way to
>>>> break the loop, if the request never finishes.
>>>
>>> A wacky and dubious approach would be to, instead of taking a
>>> reference to the files_struct, abuse f_op->flush() to synchronously
>>> flush out pending requests with references to the files_struct... But
>>> it's probably a bad idea, given that in f_op->flush(), you can't
>>> easily tell which files_struct the close is coming from. I suppose you
>>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests
>>> have come in and then let f_op->flush() probe whether the file
>>> pointers are gone from them...
>>
>> Got back to this after finishing the io-wq stuff, which we need for the
>> cancel.
>>
>> Here's an updated patch:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570
>>
>> that seems to work for me (lightly tested), we correctly find and cancel
>> work that is holding on to the file table.
>>
>> The full series sits on top of my for-5.5/io_uring-wq branch, and can be
>> viewed here:
>>
>> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test
>>
>> Let me know what you think!
> 
> Ah, I didn't realize that the second argument to f_op->flush is a
> pointer to the files_struct. That's neat.
> 
> 
> Security: There is no guarantee that ->flush() will run after the last
> io_uring_enter() finishes. You can race like this, with threads A and
> B in one process and C in another one:
> 
> A: sends uring fd to C via unix domain socket
> A: starts syscall io_uring_enter(fd, ...)
> A: calls fdget(fd), takes reference to file
> B: starts syscall close(fd)
> B: fd table entry is removed
> B: f_op->flush is invoked and finds no pending transactions
> B: syscall close() returns
> A: continues io_uring_enter(), grabbing current->files
> A: io_uring_enter() returns
> A and B: exit
> worker: use-after-free access to files_struct
> 
> I think the solution to this would be (unless you're fine with adding
> some broad global read-write mutex) something like this in
> __io_queue_sqe(), where "fd" and "f" are the variables from
> io_uring_enter(), plumbed through the stack somehow:
> 
> if (req->flags & REQ_F_NEED_FILES) {
>    rcu_read_lock();
>    spin_lock_irq(&ctx->inflight_lock);
>    if (fcheck(fd) == f) {
>      list_add(&req->inflight_list,
>        &ctx->inflight_list);
>      req->work.files = current->files;
>      ret = 0;
>    } else {
>      ret = -EBADF;
>    }
>    spin_unlock_irq(&ctx->inflight_lock);
>    rcu_read_unlock();
>    if (ret)
>      goto put_req;
> }

First of all, thanks for the thorough look at this! We already have f
available here, it's req->file. And we just made a copy of the sqe, so
we have sqe->fd available as well. I fixed this up.

> Minor note: If a process uses dup() to duplicate the uring fd, then
> closes the duplicated fd, that will cause work cancellations - but I
> guess that's fine?

I don't think that's a concern.

> Style nit: I find it a bit confusing to name both the list head and
> the list member heads "inflight_list". Maybe name them "inflight_list"
> and "inflight_entry", or something like that?

Fixed

> Correctness: Why is the wait in io_uring_flush() TASK_INTERRUPTIBLE?
> Shouldn't it be TASK_UNINTERRUPTIBLE? If someone sends a signal to the
> task while it's at that schedule(), it's just going to loop back
> around and retry what it was doing already, right?

Fixed

> Security + Correctness: If there is more than one io_wqe, it seems to
> me that io_uring_flush() calls io_wq_cancel_work(), which calls
> io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first
> request it looks at is pending. In that case, io_wq_cancel_work() will
> immediately return, and io_uring_flush() will also immediately return.
> It looks like any other requests will continue running?

Ah good point, I missed that. We need to keep looping until we get
NOTFOUND returned. Fixed as well.

Also added cancellation if the task is going away. Here's the
incremental patch, I'll resend with the full version.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 43ae0e04fd09..ec9dadfa90d2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -326,7 +326,7 @@ struct io_kiocb {
 	u32			result;
 	u32			sequence;
 
-	struct list_head	inflight_list;
+	struct list_head	inflight_entry;
 
 	struct io_wq_work	work;
 };
@@ -688,7 +688,7 @@ static void __io_free_req(struct io_kiocb *req)
 		unsigned long flags;
 
 		spin_lock_irqsave(&ctx->inflight_lock, flags);
-		list_del(&req->inflight_list);
+		list_del(&req->inflight_entry);
 		if (wq_has_sleeper(&ctx->inflight_wait))
 			wake_up(&ctx->inflight_wait);
 		spin_unlock_irqrestore(&ctx->inflight_lock, flags);
@@ -2329,6 +2329,24 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	return 0;
 }
 
+static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			 struct io_uring_sqe *sqe)
+{
+	int ret = -EBADF;
+
+	rcu_read_lock();
+	spin_lock_irq(&ctx->inflight_lock);
+	if (fcheck(sqe->fd) == req->file) {
+		list_add(&req->inflight_entry, &ctx->inflight_list);
+		req->work.files = current->files;
+		ret = 0;
+	}
+	spin_unlock_irq(&ctx->inflight_lock);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			struct sqe_submit *s)
 {
@@ -2349,23 +2367,24 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			s->sqe = sqe_copy;
 			memcpy(&req->submit, s, sizeof(*s));
 			if (req->flags & REQ_F_NEED_FILES) {
-				spin_lock_irq(&ctx->inflight_lock);
-				list_add(&req->inflight_list,
-						&ctx->inflight_list);
-				req->work.files = current->files;
-				spin_unlock_irq(&ctx->inflight_lock);
+				ret = io_grab_files(ctx, req, sqe_copy);
+				if (ret) {
+					kfree(sqe_copy);
+					goto err;
+				}
 			}
-			io_queue_async_work(ctx, req);
 
 			/*
 			 * Queued up for async execution, worker will release
 			 * submit reference when the iocb is actually submitted.
 			 */
+			io_queue_async_work(ctx, req);
 			return 0;
 		}
 	}
 
 	/* drop submission reference */
+err:
 	io_put_req(req, NULL);
 
 	/* and drop final reference, if we failed */
@@ -3768,38 +3787,51 @@ static int io_uring_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
-static int io_uring_flush(struct file *file, void *data)
+static void io_uring_cancel_files(struct io_ring_ctx *ctx,
+				  struct files_struct *files)
 {
-	struct io_ring_ctx *ctx = file->private_data;
-	enum io_wq_cancel ret;
 	struct io_kiocb *req;
 	DEFINE_WAIT(wait);
 
-restart:
-	ret = IO_WQ_CANCEL_NOTFOUND;
+	while (!list_empty_careful(&ctx->inflight_list)) {
+		enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND;
 
-	spin_lock_irq(&ctx->inflight_lock);
-	list_for_each_entry(req, &ctx->inflight_list, inflight_list) {
-		if (req->work.files == data) {
-			ret = io_wq_cancel_work(ctx->io_wq, &req->work);
-			break;
+		spin_lock_irq(&ctx->inflight_lock);
+		list_for_each_entry(req, &ctx->inflight_list, inflight_entry) {
+			if (req->work.files == files) {
+				ret = io_wq_cancel_work(ctx->io_wq, &req->work);
+				break;
+			}
 		}
-	}
-	if (ret == IO_WQ_CANCEL_RUNNING)
-		prepare_to_wait(&ctx->inflight_wait, &wait, TASK_INTERRUPTIBLE);
+		if (ret == IO_WQ_CANCEL_RUNNING)
+			prepare_to_wait(&ctx->inflight_wait, &wait,
+					TASK_UNINTERRUPTIBLE);
 
-	spin_unlock_irq(&ctx->inflight_lock);
+		spin_unlock_irq(&ctx->inflight_lock);
 
-	/*
-	 * If it was found running, wait for one inflight request to finish.
-	 * Retry loop after that, as it may not have been the one we were
-	 * looking for.
-	 */
-	if (ret == IO_WQ_CANCEL_RUNNING) {
+		/*
+		 * We need to keep going until we get NOTFOUND. We only cancel
+		 * one work at the time.
+		 *
+		 * If we get CANCEL_RUNNING, then wait for a work to complete
+		 * before continuing.
+		 */
+		if (ret == IO_WQ_CANCEL_OK)
+			continue;
+		else if (ret != IO_WQ_CANCEL_RUNNING)
+			break;
 		schedule();
-		goto restart;
-	}
+	};
+}
 
+static int io_uring_flush(struct file *file, void *data)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
+		io_wq_cancel_all(ctx->io_wq);
+	else
+		io_uring_cancel_files(ctx, data);
 	return 0;
 }
 

-- 
Jens Axboe