linux-kernel - Re: [RFC PATCH] ubd: add io_uring based userspace block driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ac7c54a-e82b-9d28-9761-446a644e181c@kernel.dk>
Date:   Mon, 9 May 2022 21:29:46 -0600
From:   Jens Axboe <axboe@...nel.dk>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        io-uring@...r.kernel.org,
        Gabriel Krisman Bertazi <krisman@...labora.com>,
        ZiyangZhang <ZiyangZhang@...ux.alibaba.com>,
        Xiaoguang Wang <xiaoguang.wang@...ux.alibaba.com>
Subject: Re: [RFC PATCH] ubd: add io_uring based userspace block driver

On 5/9/22 8:58 PM, Ming Lei wrote:
> On Mon, May 09, 2022 at 10:09:10AM -0600, Jens Axboe wrote:
>> On 5/9/22 3:23 AM, Ming Lei wrote:
>>> This is the driver part of userspace block driver(ubd driver), the other
>>> part is userspace daemon part(ubdsrv)[1].
>>>
>>> The two parts communicate by io_uring's IORING_OP_URING_CMD with one
>>> shared cmd buffer for storing io command, and the buffer is read only for
>>> ubdsrv, each io command is indexed by io request tag directly, and
>>> is written by ubd driver.
>>>
>>> For example, when one READ io request is submitted to ubd block driver, ubd
>>> driver stores the io command into cmd buffer first, then completes one
>>> IORING_OP_URING_CMD for notifying ubdsrv, and the URING_CMD is issued to
>>> ubd driver beforehand by ubdsrv for getting notification of any new io request,
>>> and each URING_CMD is associated with one io request by tag.
>>>
>>> After ubdsrv gets the io command, it translates and handles the ubd io
>>> request, such as, for the ubd-loop target, ubdsrv translates the request
>>> into same request on another file or disk, like the kernel loop block
>>> driver. In ubdsrv's implementation, the io is still handled by io_uring,
>>> and share same ring with IORING_OP_URING_CMD command. When the target io
>>> request is done, the same IORING_OP_URING_CMD is issued to ubd driver for
>>> both committing io request result and getting future notification of new
>>> io request.
>>>
>>> Another thing done by ubd driver is to copy data between kernel io
>>> request and ubdsrv's io buffer:
>>>
>>> 1) before ubsrv handles WRITE request, copy the request's data into
>>> ubdsrv's userspace io buffer, so that ubdsrv can handle the write
>>> request
>>>
>>> 2) after ubsrv handles READ request, copy ubdsrv's userspace io buffer
>>> into this READ request, then ubd driver can complete the READ request
>>>
>>> Zero copy may be switched if mm is ready to support it.
>>>
>>> ubd driver doesn't handle any logic of the specific user space driver,
>>> so it should be small/simple enough.
>>
>> This is pretty interesting! Just one small thing I noticed, since you
>> want to make sure batching is Good Enough:
>>
>>> +static blk_status_t ubd_queue_rq(struct blk_mq_hw_ctx *hctx,
>>> +		const struct blk_mq_queue_data *bd)
>>> +{
>>> +	struct ubd_queue *ubq = hctx->driver_data;
>>> +	struct request *rq = bd->rq;
>>> +	struct ubd_io *io = &ubq->ios[rq->tag];
>>> +	struct ubd_rq_data *data = blk_mq_rq_to_pdu(rq);
>>> +	blk_status_t res;
>>> +
>>> +	if (ubq->aborted)
>>> +		return BLK_STS_IOERR;
>>> +
>>> +	/* this io cmd slot isn't active, so have to fail this io */
>>> +	if (WARN_ON_ONCE(!(io->flags & UBD_IO_FLAG_ACTIVE)))
>>> +		return BLK_STS_IOERR;
>>> +
>>> +	/* fill iod to slot in io cmd buffer */
>>> +	res = ubd_setup_iod(ubq, rq);
>>> +	if (res != BLK_STS_OK)
>>> +		return BLK_STS_IOERR;
>>> +
>>> +	blk_mq_start_request(bd->rq);
>>> +
>>> +	/* mark this cmd owned by ubdsrv */
>>> +	io->flags |= UBD_IO_FLAG_OWNED_BY_SRV;
>>> +
>>> +	/*
>>> +	 * clear ACTIVE since we are done with this sqe/cmd slot
>>> +	 *
>>> +	 * We can only accept io cmd in case of being not active.
>>> +	 */
>>> +	io->flags &= ~UBD_IO_FLAG_ACTIVE;
>>> +
>>> +	/*
>>> +	 * run data copy in task work context for WRITE, and complete io_uring
>>> +	 * cmd there too.
>>> +	 *
>>> +	 * This way should improve batching, meantime pinning pages in current
>>> +	 * context is pretty fast.
>>> +	 */
>>> +	task_work_add(ubq->ubq_daemon, &data->work, TWA_SIGNAL);
>>> +
>>> +	return BLK_STS_OK;
>>> +}
>>
>> It'd be better to use bd->last to indicate what kind of signaling you
>> need here. TWA_SIGNAL will force an immediate transition if the app is
>> running in userspace, which may not be what you want. Also see:
>>
>> https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.19/io_uring&id=e788be95a57a9bebe446878ce9bf2750f6fe4974
>>
>> But regardless of signaling needed, you don't need it except if bd->last
>> is true. Would need a commit_rqs() as well, but that's trivial.
> 
> Good point, I think we may add non-last request via task_work_add(TWA_NONE),
> and only notify via TWA_SIGNAL_NO_IPI for bd->last.

Yep, I think that'd be the way to go.

>> More importantly, what prevents ubq->ubq_daemon from going away after
>> it's been assigned? I didn't look at the details, but is this relying on
>> io_uring being closed to cancel pending requests? That should work, but
> 
> I think no way can prevent ubq->ubq_daemon from being killed by 'kill -9',
> even though ubdsrv has handled SIGTERM. That is why I suggest to add
> one service for removing all ubd devices before shutdown:
> 
> https://github.com/ming1/ubdsrv/blob/devel/README

Right, you can't prevent a task from getting killed. But what you do
know is that file descriptors get closed when the task goes away, and if
you're integrated into io_uring in terms of how request are handled,
then the closing of the io_ring ring descriptor should wait-for/cancel
pending requests. If done right, that could perhaps exclude the issue of
having the stored task become invalid.

I haven't looked too closely at it all yet, so the above may not be a
viable approach. Or maybe it will... It's how io_uring itself does it.

> All the commands of UBD_IO_FETCH_REQ or UBD_IO_COMMIT_AND_FETCH_REQ have
> been submitted to driver, I understand io_uring can't cancel them,
> please correct me if it is wrong.

Right, any storage IO can't get canceled if it's already hit the block
layer or further down. So you end up waiting for them, which is fine
too.

> One solution I thought of is to use one watchdog to check if ubq->ubq_daemon
> is dead, then abort whole device if yes. Or any suggestion?

You'd still need to ensure that it remains valid.

>> we need some way to ensure that ->ubq_daemon is always valid here.
> 
> Good catch.
> 
> get_task_struct() should be used for assigning ubq->ubq_daemon.

Yep, that could work too as long as it doesn't introduce a weird loopy
dependency. Since it's just the task_struct itself, I think it'd be fine
and the simplest solution. This is a setup thing, and not per-io?

-- 
Jens Axboe