[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211102160228.GA57072@blackbody.suse.cz>
Date: Tue, 2 Nov 2021 17:02:28 +0100
From: Michal Koutný <mkoutny@...e.com>
To: Vincent Guittot <vincent.guittot@...aro.org>,
Odin Ugedal <odin@...d.al>
Cc: linux-kernel <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>
Subject: task_group unthrottling and removal race (was Re: [PATCH]
sched/fair: Use rq->lock when checking cfs_rq list) presence
Hello.
(Getting back to this after some more analysis.)
On Wed, Oct 13, 2021 at 04:26:43PM +0200, Michal Koutný <mkoutny@...e.com> wrote:
> On Wed, Oct 13, 2021 at 09:57:17AM +0200, Vincent Guittot <vincent.guittot@...aro.org> wrote:
> > This seems to closes the race window in your case but this could still
> > happen AFAICT.
>
> You seem to be right.
> Hopefully, I'll be able to collect more data evaluating this.
I've observed that the window between unregister_fair_sched_group() and
free_fair_sched_group() is commonly around 15 ms (based on kprobe
tracing).
I have a reproducer (attached) that can hit this window quite easily
after tuning. I can observe consequences of it even with a recent 5.15
kernel. (And I also have reports from real world workloads failing due
to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
unthrottle").)
My original patch was really an uninformed attempt given the length of
the window.
[snip]
On Wed, Oct 13, 2021 at 07:45:59PM +0100, Odin Ugedal <odin@...d.al> wrote:
> Ref. your comment about reverting a7b359fc6a37
> ("sched/fair: Correctly insert cfs_rq's to list on unthrottle"), I
> think that is fine as long as we revert the commit it fixes as well,
> to avoid a regression of that (but yeah, that regression itself is
> less bad than your discovery).
I say no to reverting 31bc6aeaab1d ("sched/fair: Optimize
update_blocked_averages()") (it solves reported performance issues, it's
way too old :-).
> set cfs_rq->on_list=2 inside that lock under your code change? If we
> then treat on_list=2
> as "not on list, and do not add"?
The possibilities for the current problem:
1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
(Not exclusive with the other suggestions, rather a stop-gap for the
time being.)
2) Don't add offlined task_groups into the undecayed list
- Your proposal with overloaded on_list=2 could serve as mark of that,
but it's a hack IMO.
- Proper way (tm) would be to use css_tryget_online() and css_put() when
dealing with the list (my favorite at the moment).
3) Narrowing the race-window dramatically
- that is by moving list removal from unregister_fair_sched_group() to
free_fair_sched_group(),
- <del>or use list_empty(tg->list) as indicator whether we're working
with onlined task_group.</del> (won't work for RCU list)
4) Rework how throttled load is handled (hand waving)
- There is remove_entity_load_avg() that moves the load to parent upon
final removal. Maybe it could be generalized for temporary removals by
throttling (so that unthrottling could again add only non-empty
cfs_rqs to the list and undecayed load won't skew fairness).
- or the way of [1].
5) <your ideas>
Opinions?
Thanks,
Michal
[1] https://lore.kernel.org/lkml/CAFpoUr1AO_qStNOYrFWGnFfc=uSFrXSYD8A5cQ8h0t2pioQzDA@mail.gmail.com/
Download attachment "run2.sh" of type "application/x-sh" (1600 bytes)
Powered by blists - more mailing lists