linux-kernel - Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ywx0oUcqvE9JcpnT@T590>
Date:   Mon, 29 Aug 2022 16:11:13 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     Ziyang Zhang <ZiyangZhang@...ux.alibaba.com>
Cc:     axboe@...nel.dk, xiaoguang.wang@...ux.alibaba.com,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        joseph.qi@...ux.alibaba.com
Subject: Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and
 aborting machenism

On Mon, Aug 29, 2022 at 02:13:12PM +0800, Ziyang Zhang wrote:
> On 2022/8/29 13:40, Ming Lei wrote:
> > On Wed, Aug 24, 2022 at 01:47:39PM +0800, ZiyangZhang wrote:
> >> If one rq is handled by io_uring_cmd_complete_in_task(), after a crash
> >> this rq is actually handled by an io_uring fallback wq. We have to
> >> end(abort) this rq since this fallback wq is a task other than the
> >> crashed task. However, current code does not call io_uring_cmd_done()
> >> at the same time but do it in ublk_cancel_queue(). With current design,
> >> this does work because ublk_cancel_queue() is called AFTER del_gendisk(),
> >> which waits for the rq ended(aborted) in fallback wq. This implies that
> >> fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done()
> >> on the corresponding ioucmd in ublk_cancel_queue().
> > 
> > Right.
> > 
> >>
> >> However, while considering recovery feature, we cannot rely on
> >> del_gendisk() or blk_mq_freeze_queue() to wait for completion of all
> >> rqs because we may not want any aborted rq. Besides, io_uring does not
> >> provide "flush fallback" machenism so we cannot trace this ioucmd.
> > 
> > Why not?
> > 
> > If user recovery is enabled, del_gendisk() can be replaced with
> > blk_mq_quiesce_queue(), then let abort work function do:
> > 
> > - cancel all in-flight requests by holding them into requeue list
> >   instead of finishing them as before, and this way is safe because
> >   abort worker does know the ubq daemon is dying
> > - cancel pending commands as before, because the situation is same
> >   with disk deleted or queue frozen
> 
> The problem is: we cannot control when fallback wq is scheduled.
> So we are unsafe to call io_uring_cmd_done() in another process.

What is the other process?

It can't be fallback wq since any ublk request is aborted at the beginning
of __ublk_rq_task_work().

It shouldn't be the process calling ublk_cancel_dev(), since it is
safe to call io_uring_cmd_done() if ubq->nr_io_ready > 0.

Or others?

> Otherwise, there is a UAF, just as
> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> re-issued request aborted previously to ioucmd's task_work).

As I mentioned, del_gendisk() can be replaced with
blk_mq_quiesce_queue() in case of user recovery, then no any new
request can be queued after blk_mq_quiesce_queue() returns.

> 
> Yeah I know the answer is very simple: flush the fallback wq.
> But here are two more questions:

I don't see why we need to flush fallback wq, care to provide some
details?

> 
> (1) Should ublk_drv rely on the fallback wq machenism?
>     IMO, ublk_drv should not know detail of io_uring_cmd_complete_in_task()
>     because its implementation may change in the future.
>     BTW, I think current ublk_rq_task_work_cb() is not correct because
>     it does not always call io_uring_cmd_done() before returning.
>     nvme_uring_cmd_end_io() always calls io_uring_cmd_done() for each ioucmd
>     no matter the rq succeeds or fails.
> 
> (2) Suppose io_uring does export the symbol 'flush_fallback_work', should we call
>     it before starting a new process(recovery)?
>     What if fallback wq is not scheduled immediately if there are many processes
>     running and the system overhead is heavy. In this case the recovery process
>     may wait for too long. Really we should not depend on fallback wq and please
>     let the fallback wq complete the ioucmd itself.
> 
> > 
> > With this way, the current abort logic won't be changed much.
> > 
> > And user recovery should only be started _after_ ublk device is found
> > as aborted.
> 
> START_RECOVERY will check if all ubq_daemons(the process) are PF_EXITING.

That is different. If START_RECOVERY is only run on aborted device, the
recovery handler could be simplified.

> 
> > 
> >>
> >> The recovery machenism needs to complete all ioucmds of a dying ubq
> >> to avoid leaking io_uring ctx. But as talked above, we are unsafe
> >> to call io_uring_cmd_done() in the recovery task if fallback wq happens
> >> to run simultaneously. This is a UAF case because io_uring ctx may be
> >> freed. Actually a similar case happens in
> >> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> >> re-issued request aborted previously to ioucmd's task_work).
> > 
> > If you take the above approach, I guess there isn't such problem because
> > abort can handle the case well as before.
> 
> Ming, we did think this approach(quiesce, requeue rq/complete ioucmd)
> at the very beginning. But we decided to drop it because we don not want
> rely on 'flush fallback wq' machenism, which
> makes ublk_drv rely on io_uring's internal implementation.

Then the focus is 'flush fallback wq', please see my above question.


Thanks,
Ming