linux-kernel - Re: [PATCH V5 1/2] ublk_drv: add io_uring based userspace block driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ys/0jTxQCEHdI560@T590>
Date:   Thu, 14 Jul 2022 18:48:45 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     Ziyang Zhang <ZiyangZhang@...ux.alibaba.com>
Cc:     linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        io-uring@...r.kernel.org, Jens Axboe <axboe@...nel.dk>,
        Gabriel Krisman Bertazi <krisman@...labora.com>,
        Xiaoguang Wang <xiaoguang.wang@...ux.alibaba.com>,
        ming.lei@...hat.com
Subject: Re: [PATCH V5 1/2] ublk_drv: add io_uring based userspace block
 driver

On Thu, Jul 14, 2022 at 06:20:38PM +0800, Ziyang Zhang wrote:
> On 2022/7/13 22:07, Ming Lei wrote:
> > This is the driver part of userspace block driver(ublk driver), the other
> > part is userspace daemon part(ublksrv)[1].
> > 
> > The two parts communicate by io_uring's IORING_OP_URING_CMD with one
> > shared cmd buffer for storing io command, and the buffer is read only for
> > ublksrv, each io command is indexed by io request tag directly, and
> > is written by ublk driver.
> > 
> > For example, when one READ io request is submitted to ublk block driver, ublk
> > driver stores the io command into cmd buffer first, then completes one
> > IORING_OP_URING_CMD for notifying ublksrv, and the URING_CMD is issued to
> > ublk driver beforehand by ublksrv for getting notification of any new io request,
> > and each URING_CMD is associated with one io request by tag.
> > 
> > After ublksrv gets the io command, it translates and handles the ublk io
> > request, such as, for the ublk-loop target, ublksrv translates the request
> > into same request on another file or disk, like the kernel loop block
> > driver. In ublksrv's implementation, the io is still handled by io_uring,
> > and share same ring with IORING_OP_URING_CMD command. When the target io
> > request is done, the same IORING_OP_URING_CMD is issued to ublk driver for
> > both committing io request result and getting future notification of new
> > io request.
> > 
> > Another thing done by ublk driver is to copy data between kernel io
> > request and ublksrv's io buffer:
> > 
> > 1) before ubsrv handles WRITE request, copy the request's data into
> > ublksrv's userspace io buffer, so that ublksrv can handle the write
> > request
> > 
> > 2) after ubsrv handles READ request, copy ublksrv's userspace io buffer
> > into this READ request, then ublk driver can complete the READ request
> > 
> > Zero copy may be switched if mm is ready to support it.
> > 
> > ublk driver doesn't handle any logic of the specific user space driver,
> > so it is small/simple enough.
> > 
> > [1] ublksrv
> > 
> > https://github.com/ming1/ubdsrv
> > 
> > Signed-off-by: Ming Lei <ming.lei@...hat.com>
> > ---
> 
> 
> Hi, Ming
> 
> I find that a big change from v4 to v5 is the simplification of locks.
> 
> In v5 you remove ubq->abort_lock, and I want to ask why it is OK to remove it?

Actually V4 and previous version dealt with the issue too complicated.

> 
> If you have time, could you explain how ublk deals with potential race on:
> 1)queue_rq 2)ublk_abort_queue 3) ublk_ctrl_stop_dev 4) ublk_rq_task_work.
> (Lock in ublk really confuses me...)

One big change is the following code:

__ublk_rq_task_work():
	bool task_exiting = current != ubq->ubq_daemon ||
                (current->flags & PF_EXITING);
	...
	if (unlikely(task_exiting)) {
                blk_mq_end_request(req, BLK_STS_IOERR);
                mod_delayed_work(system_wq, &ub->monitor_work, 0);
                return;
    }

Abort is always started after PF_EXITING is set, but if PF_EXITING is
set, __ublk_rq_task_work fails the request immediately, then io->flags
won't be touched, then no race with abort. Also PF_EXITING is
per-task flag, can only be set before calling __ublk_rq_task_work(),
and setting it actually serialized with calling task work func.

In ublk_queue_rq(), we don't touch io->flags, so there isn't race
with abort.

Wrt. ublk_ctrl_stop_dev(), it isn't related with abort directly, and
if del_gendisk() waits for inflight IO, abort work will be started
for making forward progress. After del_gendisk() returns, there can't
be any inflight io, so it is safe to cancel other pending io command.

> 
> 
> [...]
> 
> > +
> > +/*
> > + * __ublk_fail_req() may be called from abort context or ->ubq_daemon
> > + * context during exiting, so lock is required.
> > + *
> > + * Also aborting may not be started yet, keep in mind that one failed
> > + * request may be issued by block layer again.
> > + */
> > +static void __ublk_fail_req(struct ublk_io *io, struct request *req)
> > +{
> > +	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
> > +
> > +	if (!(io->flags & UBLK_IO_FLAG_ABORTED)) {
> > +		io->flags |= UBLK_IO_FLAG_ABORTED;
> > +		blk_mq_end_request(req, BLK_STS_IOERR);
> > +	}
> > +}
> > +
> 
> [...]
> 
> > +
> > +/*
> > + * When ->ubq_daemon is exiting, either new request is ended immediately,
> > + * or any queued io command is drained, so it is safe to abort queue
> > + * lockless
> > + */
> > +static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
> > +{
> > +	int i;
> > +
> > +	if (!ublk_get_device(ub))
> > +		return;
> > +
> > +	for (i = 0; i < ubq->q_depth; i++) {
> > +		struct ublk_io *io = &ubq->ios[i];
> > +
> > +		if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
> > +			struct request *rq;
> > +
> > +			/*
> > +			 * Either we fail the request or ublk_rq_task_work_fn
> > +			 * will do it
> > +			 */
> > +			rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i);
> > +			if (rq)
> > +				__ublk_fail_req(io, rq);
> > +		}
> > +	}
> > +	ublk_put_device(ub);
> > +}
> > +
> 
> 
> Another problem: 
> 
> 1) comment of __ublk_fail_req():  "so lock is required"

Yeah, now __ublk_fail_req is only called in abort context, and no race
with task work any more, so lock isn't needed.

> 
> 2) comment of ublk_abort_queue(): "so it is safe to abort queue lockless"

This comment is updated in v5, and it is correct.

> 
> 3) ublk_abort_queue() calls _ublk_fail_req() on all ubqs.

No, ublk_abort_queue() only aborts the passed ubq, so if one ubq daemon
is aborted, other ubqs can still handle IO during deleting disk.


Thanks,
Ming