linux-kernel - Re: ublk-qcow2: ublk-qcow2 is available

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Yz7vvNKSNRyBVObo@T590>
Date:   Thu, 6 Oct 2022 23:09:48 +0800
From:   Ming Lei <tom.leiming@...il.com>
To:     Stefan Hajnoczi <stefanha@...hat.com>
Cc:     "Denis V. Lunev" <den@...tuozzo.com>, io-uring@...r.kernel.org,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        Kirill Tkhai <kirill.tkhai@...nvz.org>,
        Manuel Bentele <development@...uel-bentele.de>,
        qemu-devel@...gnu.org, Kevin Wolf <kwolf@...hat.com>,
        rjones@...hat.com, Xie Yongji <xieyongji@...edance.com>,
        Stefano Garzarella <sgarzare@...hat.com>,
        Josef Bacik <josef@...icpanda.com>
Subject: Re: ublk-qcow2: ublk-qcow2 is available

On Thu, Oct 06, 2022 at 09:59:40AM -0400, Stefan Hajnoczi wrote:
> On Thu, Oct 06, 2022 at 06:26:15PM +0800, Ming Lei wrote:
> > On Wed, Oct 05, 2022 at 11:11:32AM -0400, Stefan Hajnoczi wrote:
> > > On Tue, Oct 04, 2022 at 01:57:50AM +0200, Denis V. Lunev wrote:
> > > > On 10/3/22 21:53, Stefan Hajnoczi wrote:
> > > > > On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
> > > > > > ublk-qcow2 is available now.
> > > > > Cool, thanks for sharing!
> > > > yep
> > > > 
> > > > > > So far it provides basic read/write function, and compression and snapshot
> > > > > > aren't supported yet. The target/backend implementation is completely
> > > > > > based on io_uring, and share the same io_uring with ublk IO command
> > > > > > handler, just like what ublk-loop does.
> > > > > > 
> > > > > > Follows the main motivations of ublk-qcow2:
> > > > > > 
> > > > > > - building one complicated target from scratch helps libublksrv APIs/functions
> > > > > >    become mature/stable more quickly, since qcow2 is complicated and needs more
> > > > > >    requirement from libublksrv compared with other simple ones(loop, null)
> > > > > > 
> > > > > > - there are several attempts of implementing qcow2 driver in kernel, such as
> > > > > >    ``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
> > > > > >    might useful be for covering requirement in this field
> > > > There is one important thing to keep in mind about all partly-userspace
> > > > implementations though:
> > > > * any single allocation happened in the context of the
> > > >    userspace daemon through try_to_free_pages() in
> > > >    kernel has a possibility to trigger the operation,
> > > >    which will require userspace daemon action, which
> > > >    is inside the kernel now.
> > > > * the probability of this is higher in the overcommitted
> > > >    environment
> > > > 
> > > > This was the main motivation of us in favor for the in-kernel
> > > > implementation.
> > > 
> > > CCed Josef Bacik because the Linux NBD driver has dealt with memory
> > > reclaim hangs in the past.
> > > 
> > > Josef: Any thoughts on userspace block drivers (whether NBD or ublk) and
> > > how to avoid hangs in memory reclaim?
> > 
> > If I remember correctly, there isn't new report after the last NBD(TCMU) deadlock
> > in memory reclaim was addressed by 8d19f1c8e193 ("prctl: PR_{G,S}ET_IO_FLUSHER
> > to support controlling memory reclaim").
> 
> Denis: I'm trying to understand the problem you described. Is this
> correct:
> 
> Due to memory pressure, the kernel reclaims pages and submits a write to
> a ublk block device. The userspace process attempts to allocate memory
> in order to service the write request, but it gets stuck because there
> is no memory available. As a result reclaim gets stuck, the system is
> unable to free more memory and therefore it hangs?

The process should be killed in this situation if PR_SET_IO_FLUSHER
is applied since the page allocation is done in VM fault handler.

Firstly in theory the userspace part should provide forward progress
guarantee in code path for handling IO, such as reserving/mlock pages
for such situation. However, this issue isn't unique for nbd or ublk,
all userspace block device should have such potential risk, and vduse
is no exception, IMO.

Secondly with proper/enough swap space, I think it is hard to trigger
such kind of issue.

Finally ublk driver has added user recovery commands for recovering from
crash, and ublksrv will support it soon.

Thanks,
Ming