linux-kernel - Re: [PATCH v3 2/2] loop: Better discard support for block devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAE=gft7AXQ1t_oeJRpytD26A=hAPp3G-g5SY0MvxzZAGYJGaSA@mail.gmail.com>
Date:   Thu, 28 Mar 2019 12:53:06 -0700
From:   Evan Green <evgreen@...omium.org>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     Jens Axboe <axboe@...nel.dk>,
        Martin K Petersen <martin.petersen@...cle.com>,
        Bart Van Assche <bvanassche@....org>,
        Gwendal Grignou <gwendal@...omium.org>,
        Alexis Savery <asavery@...omium.org>,
        linux-block <linux-block@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 2/2] loop: Better discard support for block devices

On Wed, Mar 27, 2019 at 7:37 PM Ming Lei <ming.lei@...hat.com> wrote:
>
> On Wed, Mar 27, 2019 at 03:28:41PM -0700, Evan Green wrote:
...
> > @@ -854,6 +854,25 @@ static void loop_config_discard(struct loop_device *lo)
> >       struct file *file = lo->lo_backing_file;
> >       struct inode *inode = file->f_mapping->host;
> >       struct request_queue *q = lo->lo_queue;
> > +     struct request_queue *backingq;
> > +
> > +     /*
> > +      * If the backing device is a block device, mirror its discard
> > +      * capabilities.
> > +      */
> > +     if (S_ISBLK(inode->i_mode)) {
> > +             backingq = bdev_get_queue(inode->i_bdev);
> > +             blk_queue_max_discard_sectors(q,
> > +                     backingq->limits.max_discard_sectors);
> > +
> > +             blk_queue_max_write_zeroes_sectors(q,
> > +                     backingq->limits.max_write_zeroes_sectors);
> > +
> > +             q->limits.discard_granularity =
> > +                     backingq->limits.discard_granularity;
> > +
> > +             q->limits.discard_alignment =
> > +                     backingq->limits.discard_alignment;
>
> Loop usually doesn't mirror backing queue's limits, and I believe
> it isn't necessary for this case too, just wondering why the
> following simple setting can't work?
>
>         if (S_ISBLK(inode->i_mode)) {
>                 backingq = bdev_get_queue(inode->i_bdev);
>
>                 q->limits.discard_alignment = 0;
>                 if (!blk_queue_discard(backingq)) {
>                         q->limits.discard_granularity = 0;
>                         blk_queue_max_discard_sectors(q, 0);
>                         blk_queue_max_write_zeroes_sectors(q, 0);
>                         blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
>                 } else {
>                         q->limits.discard_granularity = inode->i_sb->s_blocksize;
>                         blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
>                         blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
>                         blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
>                 }
>         } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
>                 ...
>         }
>
> I remembered you mentioned the above code doesn't work in some of your
> tests, but never explain the reason. However, it is supposed to work
> given bio splitting does handle/respect the discard limits. Or is there
> bug in bio splitting on discard IO?

I've done some more digging, and I think I have an answer for you,
with some proposed changes to the patch.

My original answer was going to be that REQ_OP_DISCARD and
REQ_OP_WRITE_ZEROES are different. So I have an NVMe device that does
support discard, but does not support write_zeroes, and should mirror
those capabilities individually to most accurately reflect the
underlying block device. But then I noticed that this device still
prints the error log I was trying to get rid of when doing mkfs.ext4,
so my fix is incomplete.

The reason is that I have the following translation between REQ_OP_*
and FALLOC_FL_*:
REQ_OP_DISCARD ==> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE
REQ_OP_WRITE_ZEROES ==> FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE

This makes sense for loop devices backed by regular files, and I think
is the right mapping. But for loop devices backed by block devices,
blkdev_fallocate() translates both of these sets of flags into
blkdev_issue_zeroout(), rather than blkdev_issue_discard() for
REQ_OP_DISCARD (since I wasn't setting FALLOC_FL_NO_HIDE_STALE).

I think this set of flags still makes sense for block devices, since
it keeps a consistent behavior for loop devices backed by files and
block devices (namely, that the discarded space is always zeroed).
However it means that for my NVMe that supports discard (never used)
but not write_zeroes (always tried), loop devices backed directly by
this NVMe should not set the discard flag.

So I think what I should actually have is this:

if (S_ISBLK(inode->i_mode)) {
        backingq = bdev_get_queue(inode->i_bdev);
        blk_queue_max_discard_sectors(q,
                backingq->limits.max_write_zeroes_sectors);  /// Note
the difference here.

        blk_queue_max_write_zeroes_sectors(q,
                backingq->limits.max_write_zeroes_sectors);
} else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) { ... }
...
if (q->limits.max_write_zeroes_sectors)
        blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
else
        blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);

I can confirm that this fixes the errors for my NVMe as well.

What do you think?
-Evan