lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y7xKfl7gGt+wb/I2@slm.duckdns.org>
Date:   Mon, 9 Jan 2023 07:10:22 -1000
From:   Tejun Heo <tj@...nel.org>
To:     Jan Kara <jack@...e.cz>
Cc:     Michal Koutný <mkoutny@...e.com>,
        Jinke Han <hanjinke.666@...edance.com>, josef@...icpanda.com,
        axboe@...nel.dk, cgroups@...r.kernel.org,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        yinxin.x@...edance.com
Subject: Re: [PATCH v3] blk-throtl: Introduce sync and async queues for
 blk-throtl

Hello, Jan.

On Mon, Jan 09, 2023 at 11:59:16AM +0100, Jan Kara wrote:
> Yeah, I agree there's no way back :). But actually I think a lot of the
> functionality of IO schedulers is not needed (by you ;)) only because the
> HW got performant enough and so some issues became less visible. And that
> is all fine but if you end up in a configuration where your cgroup's IO
> limits and IO demands are similar to how the old rotational disks were
> underprovisioned for the amount of IO needed to be done by the system
> (i.e., you can easily generate amount of IO that then takes minutes or tens
> of minutes for your IO subsystem to crunch through), you hit all the same
> problems IO schedulers were trying to solve again. And maybe these days we
> incline more towards the answer "buy more appropriate HW / buy higher
> limits from your infrastructure provider" but it is not like the original
> issues in such configurations disappeared.

Yeah, but I think there's a better way out as there's still a difference
between the two situations. W/ hard disks, you're actually out of bandwidth.
With SSDs, we know that there are capacity that we can borrow to get out of
the tough spot. e.g. w/ iocost, you can constrain a cgroup to a point where
its throughput gets to a simliar level of hard disks; however, that still
doesn't (or at least shouldn't) cause noticeable priority inversions outside
of that cgroup because issue_as_root promotes the IOs which can be waited
upon by other cgroups to root charging the cost to the cgroup as debts and
further slowing it down afterwards.

There's a lot to be improved - e.g. the debt accounting and payback, and
propagation to originator throttling isn't very accurate leading to usually
over-throttling and under-utilization in some cases. The coupling between IO
control and dirty throttling is there and kinda works but it seems like it's
pretty easy to make it misbehave under heavy control and so on. But, even
with all those shortcomings, at least iocost is feature complete and already
works (not perfectly but still) in most cases - it can actually distribute
IO bandwidth across the cgroups with arbitrary weights without causing
noticeable priority inversions across cgroups.

blk-throttle unfortunately doesn't have issue_as_root and the issuer delay
mechanism hooked up and we found that it's near impossible to configure
properly in any scalable manner. Raw bw and iops limits just can't capture
application behavior variances well enough. Often, the valid parameter space
becomes null when trying to cover varied behaviors. Given the problem is
pretty fundamental for the control scheme, I largely gave up on it with the
long term goal of implementing io.max on top of iocost down the line.

> > Another layering problem w/ controlling from elevators is that that's after
> > request allocation and the issuer has already moved on. We used to have
> > per-cgroup rq pools but ripped that out, so it's pretty easy to cause severe
> > priority inversions by depleting the shared request pool, and the fact that
> > throttling takes place after the issuing task returned from issue path makes
> > propagating the throttling operation upwards more challenging too.
> 
> Well, we do have .limit_depth IO scheduler callback these days so BFQ uses
> that to solve the problem of exhaustion of shared request pool but I agree
> it's a bit of a hack on the side.

Ah didn't know about that. Yeah, that'd help the situation to some degree.

> > My bet is that inversion issues are a lot more severe with blk-throttle
> > because it's not work-conserving and not doing things like issue-as-root or
> > other measures to alleviate issues which can arise from inversions.
> 
> Yes, I agree these features of blk-throttle make the problems much more
> likely to happen in practice.

As I wrote above, I largely gave up on blk-throttle and things like tweaking
sync write priority doesn't address most of its problems (e.g. it's still
gonna be super easy to stall the whole system with a heavily throttled
cgroup). However, it can still be useful for some use cases and if it can be
tweaked to become a bit better, I don't see a reason to not do that.

Thanks.

-- 
tejun

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ