lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160122175653.GA2023129@devbig084.prn1.facebook.com>
Date:	Fri, 22 Jan 2016 09:57:10 -0800
From:	Shaohua Li <shli@...com>
To:	Tejun Heo <tj@...nel.org>
CC:	<linux-kernel@...r.kernel.org>, <axboe@...nel.dk>,
	<vgoyal@...hat.com>, <jmoyer@...hat.com>, <Kernel-team@...com>
Subject: Re: [RFC 0/3] block: proportional based blk-throttling

On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > The thing is that most of the possible contentions can be removed by
> > > implementing per-cpu cache which shouldn't be too difficult.  10%
> > > extra cost on current gen hardware is already pretty high.
> > 
> > I did think about this. per-cpu cache does sound straightforward, but it
> > could severely impact fairness. For example, we give each cpu a budget,
> > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > breaks fairness very much. I have no idea how this can be fixed.
> 
> Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> worth of IO cost (or bandwidth or iops) available to the cgroup.  In
> practice, this may have to be adjusted down depending on the number of
> cgroups performing active IOs.  For a given cgroup, B can be
> distributed among the CPUs that are actively issuing IOs in that
> cgroup.  It will degenerate to round robin of small budget if there
> are too many active for the budget available but for most cases this
> will cut down most of cross-CPU traffic.

The cgroup could be a single thread. It uses cpu0's per-cpu budget B-1,
move to cpu1 and use another B - 1, and so on
 
> > > They're way more predictable than rotational devices when measured
> > > over a period.  I don't think we'll be able to measure anything
> > > meaningful at individual command level but aggregate numbers should be
> > > fairly stable.  A simple approximation of IO cost such as fixed cost
> > > per IO + cost proportional to IO size would do a far better job than
> > > just depending on bandwidth or iops and that requires approximating
> > > two variables over time.  I'm not sure how easy / feasible that
> > > actually would be tho.
> > 
> > It still sounds like IO time, otherwise I can't imagine we can measure
> > the cost. If we use some sort of aggregate number, it likes a variation
> > of bandwidth. eg cost = bandwidth/ios.
> 
> I think cost of an IO can be approxmiated by a fixed per-IO cost +
> cost proportional to the size, so
> 
>  cost = F + R * size

F could be IOPS. and the real cost becomes R. How do you get R? We can't
simply use R(4k) = 1, R(8k) = 2 .... I tried the idea several years ago:
https://lwn.net/Articles/474164/
The idea is the same. But the reality is we can't get R. I don't want to
have a random math working for one SSD but not for another.

One possible solution is we benchmark the device at startup and get
corresponding proportion of size. That would only work for IO read. And
how to choose the benchmark is another challenge.

Thanks,
Shaohua

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ