linux-kernel - Re: [RFC PATCH] ubd: add io_uring based userspace block driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YoMBJMk0GhXk+13E@T590>
Date:   Tue, 17 May 2022 09:57:56 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     Stefan Hajnoczi <stefanha@...hat.com>
Cc:     Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org, io-uring@...r.kernel.org,
        Gabriel Krisman Bertazi <krisman@...labora.com>,
        ZiyangZhang <ZiyangZhang@...ux.alibaba.com>,
        Xiaoguang Wang <xiaoguang.wang@...ux.alibaba.com>,
        kwolf@...hat.com, sgarzare@...hat.com
Subject: Re: [RFC PATCH] ubd: add io_uring based userspace block driver

Hello Stefan,

On Mon, May 16, 2022 at 08:29:25PM +0100, Stefan Hajnoczi wrote:
> Hi,
> This looks interesting! I have some questions:

Thanks for your comment!

> 
> 1. What is the ubdsrv permission model?
> 
> A big usability challenge for *-in-userspace interfaces is the balance
> between security and allowing unprivileged processes to use these
> features.
> 
> - Does /dev/ubd-control need to be privileged? I guess the answer is
>   yes since an evil ubdsrv can hang I/O and corrupt data in hopes of
>   triggering file system bugs.

Yes, I think so.

UBD should be in same position with NBD which does require
capable(CAP_SYS_ADMIN).

> - Can multiple processes that don't trust each other use UBD at the same
>   time? I guess not since ubd_index_idr is global.

Only single process can open /dev/ubdcN for communicating with ubd
driver, see ubd_ch_open().

> - What about containers and namespaces? They currently have (write)
>   access to the same global ubd_index_idr.

I understand contrainers/namespaces only need to see /dev/ubdbN, and
the usage model should be same with kernel loop: the global ubd_index_idr
is same with loop's loop_index_idr too.

Or can you explain a bit in detail if I misunderstood your point.

> - Maybe there should be a struct ubd_device "owner" (struct
>   task_struct *) so only devices created by the current process can be
>   modified?

I guess it isn't needed since /dev/ubdcN is opened by single process.

> 
> 2. io_uring_cmd design
> 
> The rationale for the io_uring_cmd design is not explained in the cover
> letter. I think it's worth explaining the design. Here are my guesses:
> 
> The same thing can be achieved with just file_operations and io_uring.
> ubdsrv could read I/O submissions with IORING_OP_READ and write I/O
> completions with IORING_OP_WRITE. That would require 2 sqes per
> roundtrip instead of 1, but the same number of io_uring_enter(2) calls
> since multiple sqes/cqes can be batched per syscall:
> 
> - IORING_OP_READ, addr=(struct ubdsrv_io_desc*) (for submission)
> - IORING_OP_WRITE, addr=(struct ubdsrv_io_cmd*) (for completion)
> 
> Both operations require a copy_to/from_user() to access the command
> metadata.

Yes, but it can't be efficient as io_uring command.

Two OPs require two long code path for read and write which are supposed
for handling fs io, so reusing complicated FS IO interface for sending
command via cha dev is really overkill, and nvme passthrough has shown
better IOPS than read/write interface with io_uring command, and extra
copy_to/from_user() may fault with extra meta copy, which can slow down
the ubd server.

Also for IORING_OP_READ, copy_to_user() has to be done in the ubq daemon
context, even though that isn't a big deal, but with extra cost(cpu utilization)
in the ubq deamon context or sleep for handling page fault, that is
really what should be avoided, we need to save more CPU for handling user
space IO logic in that context.

> 
> The io_uring_cmd approach works differently. The IORING_OP_URING_CMD sqe
> carries a 40-byte payload so it's possible to embed struct ubdsrv_io_cmd
> inside it. The struct ubdsrv_io_desc mmap gets around the fact that
> io_uring cqes contain no payload. The driver therefore needs a
> side-channel to transfer the request submission details to ubdsrv. I
> don't see much of a difference between IORING_OP_READ and the mmap
> approach though.

At least the performance difference, ->uring_cmd() requires much less
code path(single simple o_uring command) than read/write, without any copy
on command data, without fault in copy_to/from_user(), without two long/
complicated FS IO code path.

Single command of UBD_IO_COMMIT_AND_FETCH_REQ can handle both fetching
io request desc and commit command result in one trip.

> 
> It's not obvious to me how much more efficient the io_uring_cmd approach
> is, but taking fewer trips around the io_uring submission/completion
> code path is likely to be faster. Something similar can be done with
> file_operations ->ioctl(), but I guess the point of using io_uring is
> that is composes. If ubdsrv itself wants to use io_uring for other I/O
> activity (e.g. networking, disk I/O, etc) then it can do so and won't be
> stuck in a blocking ioctl() syscall.

ioctl can't be a choice, since we will lose the benefit of batching
handling.

> 
> It would be nice if you could write 2 or 3 paragraphs explaining why the
> io_uring_cmd design and the struct ubdsrv_io_desc mmap was chosen.

Fine, I guess most are the above inline comment?

> 
> 3. Miscellaneous stuff
> 
> - There isn't much in the way of memory ordering in the code. I worry a
>   little that changes to the struct ubdsrv_io_desc mmap may not be
>   visible at the expected time with respect to the io_uring cq ring.

I believe io_uring_cmd_done() with userspace cqe helper implies enough memory
barrier, once the cqe is observed in userspace, any memory OP done before
io_uring_cmd_done() should be observed by user side cqe handling code,
otherwise it can be thought as io_uring bug.

If it isn't this way, we still can avoid any barrier by moving
setting io desc into ubq daemon context(ubd_rq_task_work_fn), but I
really want to save cpu in that context, and don't think it is needed.


Thanks, 
Ming