linux-kernel - Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJVOszBYMyx5ODyCAF6ZczC+aU_zpYS9SSfMcOzp+G38HN7BLg@mail.gmail.com>
Date:   Mon, 22 Aug 2016 19:22:36 -0500
From:   Shaun Tancheff <shaun.tancheff@...gate.com>
To:     Damien Le Moal <damien.lemoal@...t.com>
Cc:     Shaun Tancheff <shaun@...cheff.com>, linux-block@...r.kernel.org,
        linux-scsi@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
        Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
        "James E . J . Bottomley" <jejb@...ux.vnet.ibm.com>,
        "Martin K . Petersen" <martin.petersen@...cle.com>,
        Hannes Reinecke <hare@...e.de>,
        Josh Bingaman <josh.bingaman@...gate.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Sagi Grimberg <sagig@...lanox.com>,
        Mike Christie <mchristi@...hat.com>,
        Toshi Kani <toshi.kani@....com>,
        Ming Lei <ming.lei@...onical.com>
Subject: Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same

On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal <damien.lemoal@...t.com> wrote:
>
> Shaun,
>
> On 8/22/16 13:31, Shaun Tancheff wrote:
> [...]
>> -int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>> -                      sector_t sector, unsigned int num_sectors)
>> +int sd_zbc_setup_discard(struct scsi_cmnd *cmd)
>>  {
>> -     struct blk_zone *zone;
>> +     struct request *rq = cmd->request;
>> +     struct scsi_device *sdp = cmd->device;
>> +     struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
>> +     sector_t sector = blk_rq_pos(rq);
>> +     unsigned int nr_sectors = blk_rq_sectors(rq);
>>       int ret = BLKPREP_OK;
>> +     struct blk_zone *zone;
>>       unsigned long flags;
>> +     u32 wp_offset;
>> +     bool use_write_same = false;
>>
>>       zone = blk_lookup_zone(rq->q, sector);
>> -     if (!zone)
>> +     if (!zone) {
>> +             /* Test for a runt zone before giving up */
>> +             if (sdp->type != TYPE_ZBC) {
>> +                     struct request_queue *q = rq->q;
>> +                     struct rb_node *node;
>> +
>> +                     node = rb_last(&q->zones);
>> +                     if (node)
>> +                             zone = rb_entry(node, struct blk_zone, node);
>> +                     if (zone) {
>> +                             spin_lock_irqsave(&zone->lock, flags);
>> +                             if ((zone->start + zone->len) <= sector)
>> +                                     goto out;
>> +                             spin_unlock_irqrestore(&zone->lock, flags);
>> +                             zone = NULL;
>> +                     }
>> +             }
>>               return BLKPREP_KILL;
>> +     }
>
> I do not understand the point of this code here to test for the runt
> zone. As long as sector is within the device maximum capacity (in 512B
> unit), blk_lookup_zone will return the pointer to the zone structure
> containing that sector (the RB-tree does not have any constraint
> regarding zone size). The only case where NULL would be returned is if
> discard is issued super early right after the disk is probed and before
> the zone refresh work has completed. We can certainly protect against
> that by delaying the discard.

As you can see I am not including Host Managed in the
runt check.

Also you may note that in my patch to get Host Aware working
with the zone cache I do not include the runt zone in the cache.
So as it sits I need this fallback otherwise doing blkdiscard over
the whole device ends in a error, as well as mkfs.f2fs et. al.

>>       spin_lock_irqsave(&zone->lock, flags);
>> -
>>       if (zone->state == BLK_ZONE_UNKNOWN ||
>>           zone->state == BLK_ZONE_BUSY) {
>>               sd_zbc_debug_ratelimit(sdkp,
>> -                                    "Discarding zone %zu state %x, deferring\n",
>> +                                    "Discarding zone %zx state %x, deferring\n",
>
> Sector values are usually displayed in decimal. Why use Hex here ? At
> least "0x" would be needed to avoid confusion I think.

Yeah, my brain is lazy about converting very large
numbers to powers of 2. So it's much easier to spot
zone alignment here.



>>                                      zone->start, zone->state);
>>               ret = BLKPREP_DEFER;
>>               goto out;
>> @@ -406,46 +428,80 @@ int sd_zbc_setup_discard(struct scsi_disk *sdkp, struct request *rq,
>>       if (zone->state == BLK_ZONE_OFFLINE) {
>>               /* let the drive fail the command */
>>               sd_zbc_debug_ratelimit(sdkp,
>> -                                    "Discarding offline zone %zu\n",
>> +                                    "Discarding offline zone %zx\n",
>>                                      zone->start);
>>               goto out;
>>       }
>> -
>> -     if (!blk_zone_is_smr(zone)) {
>> +     if (blk_zone_is_cmr(zone)) {
>> +             use_write_same = true;
>>               sd_zbc_debug_ratelimit(sdkp,
>> -                                    "Discarding %s zone %zu\n",
>> -                                    blk_zone_is_cmr(zone) ? "CMR" : "unknown",
>> +                                    "Discarding CMR zone %zx\n",
>>                                      zone->start);
>> -             ret = BLKPREP_DONE;
>>               goto out;
>>       }
>
> Some 10TB host managed disks out there have 1% conventional zone space,
> that is 100GB of capacity. When issuing a "reset all", doing a write
> same in these zones will take forever... If the user really wants zeroes
> in those zones, let it issue a zeroout.
>
> I think that it would a better choice to simply not report
> discard_zeroes_data as true and do nothing for conventional zones reset.

I think that would be unfortunate for Host Managed but I think it's
the right choice for Host Aware at this time. So either we base
it on disk type or we have some other config flag added to sysfs.

>> -     if (blk_zone_is_empty(zone)) {
>> -             sd_zbc_debug_ratelimit(sdkp,
>> -                                    "Discarding empty zone %zu\n",
>> -                                    zone->start);
>> -             ret = BLKPREP_DONE;
>> +     if (zone->start != sector || zone->len < nr_sectors) {
>> +             sd_printk(KERN_ERR, sdkp,
>> +                       "Misaligned RESET WP %zx/%x on zone %zx/%zx\n",
>> +                       sector, nr_sectors, zone->start, zone->len);
>> +             ret = BLKPREP_KILL;
>>               goto out;
>>       }
>> -
>> -     if (zone->start != sector ||
>> -         zone->len < num_sectors) {
>> +     /* Protect against Reset WP when more data had been written to the
>> +      * zone than is being discarded.
>> +      */
>> +     wp_offset = zone->wp - zone->start;
>> +     if (wp_offset > nr_sectors) {
>>               sd_printk(KERN_ERR, sdkp,
>> -                       "Misaligned RESET WP, start %zu/%zu "
>> -                       "len %zu/%u\n",
>> -                       zone->start, sector, zone->len, num_sectors);
>> +                       "Will Corrupt RESET WP %zx/%x/%x on zone %zx/%zx/%zx\n",
>> +                       sector, wp_offset, nr_sectors,
>> +                       zone->start, zone->wp, zone->len);
>>               ret = BLKPREP_KILL;
>>               goto out;
>>       }

---
Shaun Tancheff