linux-kernel - Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK1VsR1b-dbAa8pMqGvfcAAcVP3ZkTYZdyqcaK4wJdbuAZtJsA@mail.gmail.com>
Date: Sun, 7 Apr 2024 12:14:56 -0700
From: Oliver Crumrine <ozlinuxc@...il.com>
To: Pavel Begunkov <asml.silence@...il.com>, Oliver Crumrine <ozlinuxc@...il.com>, axboe@...nel.dk
Cc: io-uring@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy

Oliver Crumrine wrote:
> Pavel Begunkov wrote:
> > On 4/5/24 21:04, Oliver Crumrine wrote:
> > > Pavel Begunkov wrote:
> > >> On 4/4/24 23:17, Oliver Crumrine wrote:
> > >>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov
> > >>> specifically disabled REQ_F_CQE_SKIP, as (at least from my
> > >>> understanding) the userspace program wouldn't receive the
> > >>> IORING_CQE_F_MORE flag in the result value.
> > >>
> > >> No. IORING_CQE_F_MORE means there will be another CQE from this
> > >> request, so a single CQE without IORING_CQE_F_MORE is trivially
> > >> fine.
> > >>
> > >> The problem is the semantics, because by suppressing the first
> > >> CQE you're loosing the result value. You might rely on WAITALL
> > > That's already happening with io_send.
> >
> > Right, and it's still annoying and hard to use
> Another solution might be something where there is a counter that stores
> how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting,
> userspace could call a function like: io_wait_completions(int completions)
> which would wait until everything is done, and then userspace could peek
> the completion ring.
> >
> > >> as other sends and "fail" (in terms of io_uring) the request
> > >> in case of a partial send posting 2 CQEs, but that's not a great
> > >> way and it's getting userspace complicated pretty easily.
> > >>
> > >> In short, it was left out for later because there is a
> > >> better way to implement it, but it should be done carefully
> > > Maybe we could put the return values in the notifs? That would be a
> > > discrepancy between io_send and io_send_zc, though.
> >
> > Yes. And yes, having a custom flavour is not good. It'd only
> > be well usable if apart from returning the actual result
> > it also guarantees there will be one and only one CQE, then
> > the userspace doesn't have to do the dancing with counting
> > and checking F_MORE. In fact, I outlined before how a generic
> > solution may looks like:
> >
> > https://github.com/axboe/liburing/issues/824
> >
> > The only interesting part, IMHO, is to be able to merge the
> > main completion with its notification. Below is an old stash
> > rebased onto for-6.10. The only thing missing is relinking,
> > but maybe we don't even care about it. I need to cover it
> > well with tests.
> The patch looks pretty good. The only potential issue is that you store
> the res of the normal CQE into the notif CQE. This overwrites the
> IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would
> indicate to userspace that there will be another CQE, of which there
> won't.
I was wrong here; Mixed up flags and result value.
> >
> >
> >
> >
> > commit ca5e4fb6d105b5dfdf3768d46ce01529b7bb88c5
> > Author: Pavel Begunkov <asml.silence@...il.com>
> > Date:   Sat Apr 6 15:46:38 2024 +0100
> >
> >      io_uring/net: introduce single CQE send zc mode
> >
> >      IORING_OP_SEND[MSG]_ZC requests are posting two completions, one to
> >      notify that the data was queued, and later a second, usually referred
> >      as "notification", to let the user know that the buffer used can be
> >      reused/freed. In some cases the user might not care about the main
> >      completion and would be content getting only the notification, which
> >      would allow to simplify the userspace.
> >
> >      One example is when after a send the user would be waiting for the other
> >      end to get the message and reply back not pushing any more data in the
> >      meantime. Another case is unreliable protocols like UDP, which do not
> >      require a confirmation from the other end before dropping buffers, and
> >      so the notifications are usually posted shortly after the send request
> >      is queued.
> >
> >      Add a flag merging completions into a single CQE. cqe->res will store
> >      the send's result as usual, and it will have IORING_CQE_F_NOTIF set if
> >      the buffer was potentially used. Timewise, it would be posted at the
> >      moment when the notification would have been originally completed.
> >
> >      Signed-off-by: Pavel Begunkov <asml.silence@...il.com>
> >
> > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > index 7bd10201a02b..e2b528c341c9 100644
> > --- a/include/uapi/linux/io_uring.h
> > +++ b/include/uapi/linux/io_uring.h
> > @@ -356,6 +356,7 @@ enum io_uring_op {
> >   #define IORING_RECV_MULTISHOT		(1U << 1)
> >   #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
> >   #define IORING_SEND_ZC_REPORT_USAGE	(1U << 3)
> > +#define IORING_SEND_ZC_COMBINE_CQE	(1U << 4)
> >
> >   /*
> >    * cqe.res for IORING_CQE_F_NOTIF if
> > diff --git a/io_uring/net.c b/io_uring/net.c
> > index a74287692071..052f030ab8f8 100644
> > --- a/io_uring/net.c
> > +++ b/io_uring/net.c
> > @@ -992,7 +992,19 @@ void io_send_zc_cleanup(struct io_kiocb *req)
> >   	}
> >   }
> >
> > -#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF)
> > +static inline void io_sendzc_adjust_res(struct io_kiocb *req)
> > +{
> > +	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
> > +
> > +	if (sr->flags & IORING_SEND_ZC_COMBINE_CQE) {
> > +		sr->notif->cqe.res = req->cqe.res;
> > +		req->flags |= REQ_F_CQE_SKIP;
> > +	}
> > +}
> > +
> > +#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | \
> > +			    IORING_RECVSEND_FIXED_BUF | \
> > +			    IORING_SEND_ZC_COMBINE_CQE)
> >   #define IO_ZC_FLAGS_VALID  (IO_ZC_FLAGS_COMMON | IORING_SEND_ZC_REPORT_USAGE)
> >
> >   int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > @@ -1022,6 +1034,8 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> >   		if (zc->flags & ~IO_ZC_FLAGS_VALID)
> >   			return -EINVAL;
> >   		if (zc->flags & IORING_SEND_ZC_REPORT_USAGE) {
> > +			if (zc->flags & IORING_SEND_ZC_COMBINE_CQE)
> > +				return -EINVAL;
> >   			io_notif_set_extended(notif);
> >   			io_notif_to_data(notif)->zc_report = true;
> >   		}
> > @@ -1197,6 +1211,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
> >   	else if (zc->done_io)
> >   		ret = zc->done_io;
> >
> > +	io_req_set_res(req, ret, IORING_CQE_F_MORE);
> > +	io_sendzc_adjust_res(req);
> > +
> >   	/*
> >   	 * If we're in io-wq we can't rely on tw ordering guarantees, defer
> >   	 * flushing notif to io_send_zc_cleanup()
> > @@ -1205,7 +1222,6 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
> >   		io_notif_flush(zc->notif);
> >   		io_req_msg_cleanup(req, 0);
> >   	}
> > -	io_req_set_res(req, ret, IORING_CQE_F_MORE);
> >   	return IOU_OK;
> >   }
> >
>
> >   	else if (sr->done_io)
> >   		ret = sr->done_io;
> >
> > +	io_req_set_res(req, ret, IORING_CQE_F_MORE);
> > +	io_sendzc_adjust_res(req);
> > +
> >   	/*
> >   	 * If we're in io-wq we can't rely on tw ordering guarantees, defer
> >   	 * flushing notif to io_send_zc_cleanup()
> > @@ -1266,7 +1285,6 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
> >   		io_notif_flush(sr->notif);
> >   		io_req_msg_cleanup(req, 0);
> >   	}
> > -	io_req_set_res(req, ret, IORING_CQE_F_MORE);
> >   	return IOU_OK;
> >   }
> >
> > @@ -1278,8 +1296,10 @@ void io_sendrecv_fail(struct io_kiocb *req)
> >   		req->cqe.res = sr->done_io;
> >
> >   	if ((req->flags & REQ_F_NEED_CLEANUP) &&
> > -	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
> > +	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) {
> >   		req->cqe.flags |= IORING_CQE_F_MORE;
> > +		io_sendzc_adjust_res(req);
> > +	}
> >   }
> >
> >   int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> >
> >
> > --
> > Pavel Begunkov
>
>