linux-kernel - Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y2Bf+CeQ8x2jKQ3S@slm.duckdns.org>
Date:   Mon, 31 Oct 2022 13:53:28 -1000
From:   Tejun Heo <tj@...nel.org>
To:     Josh Don <joshdon@...gle.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        linux-kernel@...r.kernel.org,
        Joel Fernandes <joel@...lfernandes.org>
Subject: Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

Hello,

On Mon, Oct 31, 2022 at 04:15:54PM -0700, Josh Don wrote:
> On Mon, Oct 31, 2022 at 2:50 PM Tejun Heo <tj@...nel.org> wrote:
> Yes, but schemes such as shares and idle can still end up creating
> some severe inversions. For example, a SCHED_IDLE thread on a cpu with
> many other threads. Eventually the SCHED_IDLE thread will get run, but
> the round robin times can easily get pushes out to several hundred ms
> (or even into the seconds range), due to min granularity. cpusets
> combined with the load balancer's struggle to find low weight tasks
> exacerbates such situations.

Yeah, especially with narrow cpuset (or task cpu affinity) configurations,
it can get pretty bad. Outside that tho, at least I haven't seen a lot of
problematic cases as long as the low priority one isn't tightly entangled
with high priority tasks, mostly because 1. if the resource the low pri one
is holding affects large part of the system, the problem is self-solving as
the system quickly runs out of other things to do 2. if the resource isn't
affecting large part of the system, their blast radius is usually reasonably
confined to things tightly coupled with it. I'm sure there are exceptions
and we definitely wanna improve the situation where it makes sense.

> > > chatted with the folks working on the proxy execution patch series,
> > > and it seems like that could be a better generic solution to these
> > > types of issues.
> >
> > Care to elaborate?
> 
> https://lwn.net/Articles/793502/ gives some historical context, see
> also https://lwn.net/Articles/910302/.

Ah, full blown priority inheritance. They're great to pursue but I think we
wanna fix cpu bw control regardless. It's such an obvious and basic issue
and given how much problem we have with actually understanding resource and
control dependencies with all the custom synchronization contstructs in the
kernel, fixing it will be useful even in the future where we have a better
priority inheritance mechanism.

> > I don't follow. If you only throttle at predefined safe spots, the easiest
> > place being the kernel-user boundary, you cannot get system-wide stalls from
> > BW restrictions, which is something the kernel shouldn't allow userspace to
> > cause. In your example, a thread holding a kernel mutex waking back up into
> > a hierarchy that is currently throttled should keep running in the kernel
> > until it encounters such safe throttling point where it would have released
> > the kernel mutex and then throttle.
> 
> Agree except that for the task waking back up, it isn't on cpu, so
> there is no "wait to throttle it until it returns to user", since
> throttling happens in the context of the entire cfs_rq. We'd have to

Oh yeah, we'd have to be able to allow threads running in kernel regardless
of cfq_rq throttled state and then force charge the cpu cycles to be paid
later. It would definitely require quite a bit of work.

> treat threads in a bandwidth hierarchy that are also in kernel mode
> specially. Mechanically, it is more straightforward to implement the
> mechanism to wait to throttle until the cfs_rq has no more threads in
> kernel mode, than it is to exclude a woken task from the currently
> throttled period of its cfs_rq, though this is incomplete.

My hunch is that bunching them together is likely gonna create too many
escape scenarios and control artifacts and it'd be better to always push
throttling decisions to the leaves (tasks) so that each task can be
controlled separately. That'd involve architectural changes but the eventual
behavior would be a lot better.

> What you're suggesting would also require that we find a way to
> preempt the current thread to start running the thread that woke up in
> kernel (and this becomes more complex when the current thread is also
> in kernel, or if there are n other waiting threads that are also in
> kernel).

I don't think it needs that. What allows userspace to easily trigger
pathological scenarios is the ability to force the machine idle when there's
something which is ready to run in the kernel. If you take that away, most
of the problems disappear. It's not perfect but reasonable enough and not
worse than a system without cpu bw control.

Thanks.

-- 
tejun