linux-kernel - Re: [PATCH 1/2] blk-throtl: make latency= absolute

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171113112710.GG983427@devbig577.frc2.facebook.com>
Date:   Mon, 13 Nov 2017 03:27:10 -0800
From:   Tejun Heo <tj@...nel.org>
To:     Shaohua Li <shli@...nel.org>
Cc:     Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org,
        kernel-team@...com
Subject: Re: [PATCH 1/2] blk-throtl: make latency= absolute

Hello, Shoahua.

On Sun, Nov 12, 2017 at 08:29:40PM -0800, Shaohua Li wrote:
> Didn't get this. What did you mean 'queueing time on the host side'? You mean
> the application think time delay?
> 
> My point is absolute latency doen't protect as we expected. Let me have an
> example. Say 4k latency is 60us, BW is 100MB/s. When 4k BW is 50MB/s, the
> latency is 200us. 1M latency is 500us. If you set the absolute latency to
> 600us, you can't protect the 4k BW to above 50MB/s. To do the protection, you
> really want to set the absolute latency below 500us, which doesn't work for the
> 1M IO.

What I'm trying to say is that the latency is defined as "from bio
issue to completion", not "in-flight time on device".  Whether the
on-device latency is 50us or 500us, the host side queueing latency can
be in orders of magnitude higher.

For things like starvation protection for managerial workloads which
work fine on rotating disks, the only thing we need to protect against
is excessive host side queue overflowing leading to starvation of such
workloads.  IOW, we're talking about latency target in tens or lower
hundreds of millisecs.  Whether the on-device time is 50 or 500us
doesn't matter that much.

> We don't overload the meaning of "N". Untill your next patch, the "N" actually
> means "+N".
> 
> Ponder a little bit, I think 4ms base latency for HD actually is reasonable. We
> have LATENCY_FILTERED_HD to filter out small latency bios, which come from
> sequential IO. So remaining IO is random IO. 4k base latency for HD random IO
> should be ok. Probably something else is wrong. I think we need understand
> what's wrong for HD throttling first before we make any change.

So, even purely from user-interface perspective, I think it can be
very confusing to use "N" to mean "base + N".  Explicitly saying
what's going on through "+N" or "N%" is a lot more straight-forward.
I mean, we can decide to change the config syntax but not support abs
targets but I think abs targets are useful and it's not like this adds
significant overhead / complexity.

As for why it isn't working well for disks, I think some of it is
coming from buffering behaviors and not handling merges properly.

Write latencies aren't evenly spread across commands.  Most of them
really fast and then one of them or the flush take a really long time.
Filtering based on LATENCY_FILTERED_HD simply ignores those fast
completions which means that the eventual latency spike is attributed
arbitrarily, which doesn't really work.

The other part is that blk-throtl was seeing wildly different IO
numbers than the underlying device does when there are a lot of
merges, which still happens.  This means that iops limits were just
badly broken.

Thanks.

-- 
tejun