linux-kernel - Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6850f08d-0e89-4eb3-bbfb-bdcc5d4e1b78@gmail.com>
Date: Fri, 5 Apr 2024 14:01:07 +0100
From: Pavel Begunkov <asml.silence@...il.com>
To: Oliver Crumrine <ozlinuxc@...il.com>, axboe@...nel.dk,
 davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
 pabeni@...hat.com, shuah@...nel.org, leitao@...ian.org
Cc: io-uring@...r.kernel.org, netdev@...r.kernel.org,
 linux-kselftest@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring
 zerocopy

On 4/4/24 23:17, Oliver Crumrine wrote:
> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov
> specifically disabled REQ_F_CQE_SKIP, as (at least from my
> understanding) the userspace program wouldn't receive the
> IORING_CQE_F_MORE flag in the result value.

No. IORING_CQE_F_MORE means there will be another CQE from this
request, so a single CQE without IORING_CQE_F_MORE is trivially
fine.

The problem is the semantics, because by suppressing the first
CQE you're loosing the result value. You might rely on WAITALL
as other sends and "fail" (in terms of io_uring) the request
in case of a partial send posting 2 CQEs, but that's not a great
way and it's getting userspace complicated pretty easily.

In short, it was left out for later because there is a
better way to implement it, but it should be done carefully


> To fix this, instead of keeping track of how many CQEs have been
> received, and subtracting notifs from that, programs can keep track of

That's a benchmark way of doing it, more realistically
it'd be more like

event_loop() {
	cqe = wait_cqe();
	struct req *r = (struct req *)cqe->user_data;
	r->callback(r, cqe);
}

send_zc_callback(req, cqe) {
	if (cqe->flags & F_MORE) {
		// don't free the req
		// we should wait for another CQE
		...
	}
}

> how many SQEs they have issued, and if a CQE is returned with an error,
> they can simply subtract from how many notifs they expect to receive.

The design specifically untangles those two notions, i.e. there can
be a notification even when the main CQE fails (ret<0). It's safer
this way, even though AFAIK relying on errors would be fine with
current users (TCP/UDP).


> Signed-off-by: Oliver Crumrine <ozlinuxc@...il.com>
> ---
>   io_uring/net.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/io_uring/net.c b/io_uring/net.c
> index 1e7665ff6ef7..822f49809b68 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -1044,9 +1044,6 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>   
>   	if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>   		return -EINVAL;
> -	/* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */
> -	if (req->flags & REQ_F_CQE_SKIP)
> -		return -EINVAL;
>   
>   	notif = zc->notif = io_alloc_notif(ctx);
>   	if (!notif)
> @@ -1342,7 +1339,8 @@ void io_sendrecv_fail(struct io_kiocb *req)
>   		req->cqe.res = sr->done_io;
>   
>   	if ((req->flags & REQ_F_NEED_CLEANUP) &&
> -	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
> +	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC) &&
> +	    !(req->flags & REQ_F_CQE_SKIP))
>   		req->cqe.flags |= IORING_CQE_F_MORE;
>   }
>   

-- 
Pavel Begunkov