lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 20 Aug 2020 10:14:43 +0200
From:   Javier Gonzalez <javier@...igon.com>
To:     Kanchan Joshi <joshiiitr@...il.com>
Cc:     Keith Busch <kbusch@...nel.org>,
        David Fugate <david.fugate@...ux.intel.com>,
        "axboe@...nel.dk" <axboe@...nel.dk>,
        "Damien.LeMoal@....com" <Damien.LeMoal@....com>,
        SelvaKumar S <selvakuma.s1@...sung.com>,
        "sagi@...mberg.me" <sagi@...mberg.me>,
        Kanchan Joshi <joshi.k@...sung.com>,
        "johannes.thumshirn@....com" <johannes.thumshirn@....com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
        Nitesh Shetty <nj.shetty@...sung.com>,
        Christoph Hellwig <hch@....de>
Subject: Re: [PATCH 2/2] nvme: add emulation for zone-append

On 20.08.2020 13:07, Kanchan Joshi wrote:
>On Thu, Aug 20, 2020 at 3:22 AM Keith Busch <kbusch@...nel.org> wrote:
>>
>> On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
>> > Intel does not support making *optional* NVMe spec features *required*
>> > by the NVMe driver.
>>
>> This is inaccurate. As another example, the spec optionally allows a
>> zone size to be a power of 2, but linux requires a power of 2 if you
>> want to use it with this driver.
>>
>> > Provided there's no glaring technical issues
>>
>> There are many. Some may be improved through a serious review process,
>> but the mess it makes out of the fast path is measurably harmful to
>> devices that don't subscribe to this. That issue is not so easily
>> remedied.
>>
>> Since this patch is a copy of the scsi implementation, the reasonable
>> thing is take this fight to the generic block layer for a common
>> solution. We're not taking this in the nvme driver.
>
>I sincerely want to minimize any adverse impact to the fast-path of
>non-zoned devices.
>My understanding of that aspect is (I feel it is good to confirm,
>irrespective of the future of this patch):
>
>1. Submission path:
>There is no extra code for non-zoned device IO. For append, there is
>this "ns->append_emulate = 1" check.
>Snippet -
>        case REQ_OP_ZONE_APPEND:
>-               ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
>+               if (!nvme_is_append_emulated(ns))
>+                       ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
>+               else {
>+                       /* prepare append like write, and adjust lba
>afterwards */
>
>2. Completion:
> Not as clean as submission for sure.
>The extra check is "ns && ns->append_emulate == 1" for completions if
>CONFIG_ZONED is enabled.
>A slightly better completion-handling version (than the submitted patch) is -
>
>-       } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
>-                  req_op(req) == REQ_OP_ZONE_APPEND) {
>-               req->__sector = nvme_lba_to_sect(req->q->queuedata,
>-                       le64_to_cpu(nvme_req(req)->result.u64));
>+       } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
>+               struct nvme_ns *ns = req->q->queuedata;
>+               /* append-emulation requires wp update for some cmds*/
>+               if (ns && nvme_is_append_emulated(ns)) {
>+                       if (nvme_need_zone_wp_update(req))
>+                               nvme_zone_wp_update(ns, req, status);
>+               }
>+               else if (req_op(req) == REQ_OP_ZONE_APPEND)
>+                       req->__sector = nvme_lba_to_sect(ns,
>+                                       le64_to_cpu(nvme_req(req)->result.u64));
>
>Am I missing any other fast-path pain-points?
>
>A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows :
>before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec)
>after:  IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec)

It is good to use the QEMU "simulation" path that we implemented to test
performance with different delays, etc., but for these numbers to make
sense we need to put them in contrast to the simulated NAND speed, etc.

>
>This maynot be the best test to see the cost, and I am happy to
>conduct more and improvise.
>
>As for the volume of the code - it is same as SCSI emulation. And I
>can make efforts to reduce that by moving common code to blk-zone, and
>reuse in SCSI/NVMe emulation.
>In the patch I tried to isolate append-emulation by keeping everything
>into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns,
>which can be turned into "driver_data".
>
>+#ifdef CONFIG_BLK_DEV_ZONED
>+struct nvme_za_emul {
>+       unsigned int nr_zones;
>+       spinlock_t zones_wp_offset_lock;
>+       u32 *zones_wp_offset;
>+       u32 *rev_wp_offset;
>+       struct work_struct zone_wp_offset_work;
>+       char *zone_wp_update_buf;
>+       struct mutex rev_mutex;
>+       struct nvme_ns *ns;
>+};
>+#endif
>
>Will that be an acceptable line of further work/discussions?

I know we spent time enabling this path, but I don't think that moving
the discussion to the block layer will have much more benefit.

Let's keep the support for these non-append devices in xNVMe and focus
on the support for the append-enabled ones in Linux. We have a lot of
good stuff in the backlog that we can start pushing.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ