linux-kernel - Re: [PATCH] iosched: Add i10 I/O Scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5a954c4e-aa84-834d-7d04-0ce3545d45c9@kernel.dk>
Date:   Thu, 12 Nov 2020 11:02:19 -0700
From:   Jens Axboe <axboe@...nel.dk>
To:     Rachit Agarwal <rach4x0r@...il.com>, Christoph Hellwig <hch@....de>
Cc:     linux-block@...r.kernel.org, linux-nvme@...ts.infradead.org,
        linux-kernel@...r.kernel.org, Keith Busch <kbusch@...nel.org>,
        Ming Lei <ming.lei@...hat.com>,
        Jaehyun Hwang <jaehyun.hwang@...nell.edu>,
        Qizhe Cai <qc228@...nell.edu>,
        Midhul Vuppalapati <mvv25@...nell.edu>,
        Rachit Agarwal <ragarwal@...cornell.edu>,
        Sagi Grimberg <sagi@...htbitslabs.com>,
        Rachit Agarwal <ragarwal@...nell.edu>
Subject: Re: [PATCH] iosched: Add i10 I/O Scheduler

On 11/12/20 7:07 AM, Rachit Agarwal wrote:
> From: Rachit Agarwal <rach4x0r@...il.com>>
> 
> Hi All,
> 
> I/O batching is beneficial for optimizing IOPS and throughput for
> various applications. For instance, several kernel block drivers would
> benefit from batching, including mmc [1] and tcp-based storage drivers
> like nvme-tcp [2,3]. While we have support for batching dispatch [4],
> we need an I/O scheduler to efficiently enable batching. Such a
> scheduler is particularly interesting for disaggregated storage, where
> the access latency of remote disaggregated storage may be higher than
> local storage access; thus, batching can significantly help in
> amortizing the remote access latency while increasing the throughput.
> 
> This patch introduces the i10 I/O scheduler, which performs batching
> per hctx in terms of #requests, #bytes, and timeouts (at microseconds
> granularity). i10 starts dispatching only when #requests or #bytes is
> larger than a default threshold or when a timer expires. After that,
> batching dispatch [3] would happen, allowing batching at device
> drivers along with "bd->last" and ".commit_rqs".
> 
> The i10 I/O scheduler builds upon recent work on [6]. We have tested
> the i10 I/O scheduler with nvme-tcp optimizaitons [2,3] and batching
> dispatch [4], varying number of cores, varying read/write ratios, and
> varying request sizes, and with NVMe SSD and RAM block device. For
> NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in terms
> of IOPS per core over "noop" I/O scheduler. These results are
> available at [5], and many additional results are presented in [6].
> 
> While other schedulers may also batch I/O (e.g., mq-deadline), the
> optimization target in the i10 I/O scheduler is throughput
> maximization. Hence there is no latency target nor a need for a global
> tracking context, so a new scheduler is needed rather than to build
> this functionality to an existing scheduler.
> 
> We currently use fixed default values as batching thresholds (e.g., 16
> for #requests, 64KB for #bytes, and 50us for timeout). These default
> values are based on sensitivity tests in [6]. For our future work, we
> plan to support adaptive batching according to system load and to
> extend the scheduler to support isolation in multi-tenant deployments
> (to simultaneously achieve low tail latency for latency-sensitive
> applications and high throughput for throughput-bound applications).

I haven't taken a close look at the code yet so far, but one quick note
that patches like this should be against the branches for 5.11. In fact,
this one doesn't even compile against current -git, as
blk_mq_bio_list_merge is now called blk_bio_list_merge.

In any case, I did run this through some quick peak testing as I was
curious, and I'm seeing about 20% drop in peak IOPS over none running
this. Perf diff:

    10.71%     -2.44%  [kernel.vmlinux]  [k] read_tsc
     2.33%     -1.99%  [kernel.vmlinux]  [k] _raw_spin_lock

Also:

> [5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf

Was curious and wanted to look it up, but it doesn't exist.

-- 
Jens Axboe