lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161004182811.GA76949@anikkar-mbp.local.dhcp.thefacebook.com>
Date:   Tue, 4 Oct 2016 11:28:12 -0700
From:   Shaohua Li <shli@...com>
To:     Paolo Valente <paolo.valente@...more.it>
CC:     Tejun Heo <tj@...nel.org>, Vivek Goyal <vgoyal@...hat.com>,
        <linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        Jens Axboe <axboe@...com>, <Kernel-team@...com>,
        <jmoyer@...hat.com>, Mark Brown <broonie@...nel.org>,
        Linus Walleij <linus.walleij@...aro.org>,
        Ulf Hansson <ulf.hansson@...aro.org>
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit

On Tue, Oct 04, 2016 at 07:43:48PM +0200, Paolo Valente wrote:
> 
> > Il giorno 04 ott 2016, alle ore 19:28, Shaohua Li <shli@...com> ha scritto:
> > 
> > On Tue, Oct 04, 2016 at 07:01:39PM +0200, Paolo Valente wrote:
> >> 
> >>> Il giorno 04 ott 2016, alle ore 18:27, Tejun Heo <tj@...nel.org> ha scritto:
> >>> 
> >>> Hello,
> >>> 
> >>> On Tue, Oct 04, 2016 at 06:22:28PM +0200, Paolo Valente wrote:
> >>>> Could you please elaborate more on this point?  BFQ uses sectors
> >>>> served to measure service, and, on the all the fast devices on which
> >>>> we have tested it, it accurately distributes
> >>>> bandwidth as desired, redistributes excess bandwidth with any issue,
> >>>> and guarantees high responsiveness and low latency at application and
> >>>> system level (e.g., ~0 drop rate in video playback, with any background
> >>>> workload tested).
> >>> 
> >>> The same argument as before.  Bandwidth is a very bad measure of IO
> >>> resources spent.  For specific use cases (like desktop or whatever),
> >>> this can work but not generally.
> >>> 
> >> 
> >> Actually, we have already discussed this point, and IMHO the arguments
> >> that (apparently) convinced you that bandwidth is the most relevant
> >> service guarantee for I/O in desktops and the like, prove that
> >> bandwidth is the most important service guarantee in servers too.
> >> 
> >> Again, all the examples I can think of seem to confirm it:
> >> . file hosting: a good service must guarantee reasonable read/write,
> >> i.e., download/upload, speeds to users
> >> . file streaming: a good service must guarantee low drop rates, and
> >> this can be guaranteed only by guaranteeing bandwidth and latency
> >> . web hosting: high bandwidth and low latency needed here too
> >> . clouds: high bw and low latency needed to let, e.g., users of VMs
> >> enjoy high responsiveness and, for example, reasonable file-copy
> >> time
> >> ...
> >> 
> >> To put in yet another way, with packet I/O in, e.g., clouds, there are
> >> basically the same issues, and the main goal is again guaranteeing
> >> bandwidth and low latency among nodes.
> >> 
> >> Could you please provide a concrete server example (assuming we still
> >> agree about desktops), where I/O bandwidth does not matter while time
> >> does?
> > 
> > I don't think IO bandwidth does not matter. The problem is bandwidth can't
> > measure IO cost. For example, you can't say 8k IO costs 2x IO resource than 4k
> > IO.
> > 
> 
> For what goal do you need to be able to say this, once you succeeded
> in guaranteeing bandwidth and low latency to each
> process/client/group/node/user?

I think we are discussing if bandwidth should be used to measure IO for
propotional IO scheduling. Since bandwidth can't measure the cost and you are
using it to do arbitration, you will either have low latency but unfair
bandwidth, or fair bandwidth but some workloads have unexpected high latency.
But it might be ok depending on the latency target (for example, you can set
the latency target high, so low latency is guaranteed*) and workload
characteristics. I think the bandwidth based proporional scheduling will only
work for workloads disk isn't fully utilized.
 
> >>>> Could you please suggest me some test to show how sector-based
> >>>> guarantees fails?
> >>> 
> >>> Well, mix 4k random and sequential workloads and try to distribute the
> >>> acteual IO resources.
> >>> 
> >> 
> >> 
> >> If I'm not mistaken, we have already gone through this example too,
> >> and I thought we agreed on what service scheme worked best, again
> >> focusing only on desktops.  To make a long story short(er), here is a
> >> snippet from one of our last exchanges.
> >> 
> >> ----------
> >> 
> >> On Sat, Apr 16, 2016 at 12:08:44AM +0200, Paolo Valente wrote:
> >>> Maybe the source of confusion is the fact that a simple sector-based,
> >>> proportional share scheduler always distributes total bandwidth
> >>> according to weights. The catch is the additional BFQ rule: random
> >>> workloads get only time isolation, and are charged for full budgets,
> >>> so as to not affect the schedule of quasi-sequential workloads. So,
> >>> the correct claim for BFQ is that it distributes total bandwidth
> >>> according to weights (only) when all competing workloads are
> >>> quasi-sequential. If some workloads are random, then these workloads
> >>> are just time scheduled. This does break proportional-share bandwidth
> >>> distribution with mixed workloads, but, much more importantly, saves
> >>> both total throughput and individual bandwidths of quasi-sequential
> >>> workloads.
> >>> 
> >>> We could then check whether I did succeed in tuning timeouts and
> >>> budgets so as to achieve the best tradeoffs. But this is probably a
> >>> second-order problem as of now.
> > 
> > I don't see why random/sequential matters for SSD. what really matters is
> > request size and IO depth. Time scheduling is skeptical too, as workloads can
> > dispatch all IO within almost 0 time in high queue depth disks.
> > 
> 
> That's an orthogonal issue.  If what matter is, e.g., size, then it is
> enough to replace "sequential I/O" with "large-request I/O".  In case
> I have been too vague, here is an example: I mean that, e.g, in an I/O
> scheduler you replace the function that computes whether a queue is
> seeky based on request distance, with a function based on
> request size.  And this is exactly what has been already done, for
> example, in CFQ:
> 
> 	if (blk_queue_nonrot(cfqd->queue))
> 		cfqq->seek_history |= (n_sec < CFQQ_SECT_THR_NONROT);
> 	else
> 		cfqq->seek_history |= (sdist > CFQQ_SEEK_THR);

CFQ is known not fair for SSD especially high queue depth SSD, so this doesn't
mean correctness. And based on request size for idle detection (so let cfqq
backlog the disk) isn't very good. iodepth 1 4k workload could be idle, but
iodepth 128 4k workload likely isn't idle (and the workload can dispatch 128
requests in almost 0 time in high queue depth disk).

Thanks,
Shaohua

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ