linux-kernel - task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20211102160228.GA57072@blackbody.suse.cz>
Date:   Tue, 2 Nov 2021 17:02:28 +0100
From:   Michal Koutný <mkoutny@...e.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>,
        Odin Ugedal <odin@...d.al>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>
Subject: task_group unthrottling and removal race (was Re: [PATCH]
 sched/fair: Use rq->lock when checking cfs_rq list) presence

Hello.

(Getting back to this after some more analysis.)

On Wed, Oct 13, 2021 at 04:26:43PM +0200, Michal Koutný <mkoutny@...e.com> wrote:
> On Wed, Oct 13, 2021 at 09:57:17AM +0200, Vincent Guittot <vincent.guittot@...aro.org> wrote:
> > This seems to closes the race window in your case but this could still
> > happen AFAICT.
> 
> You seem to be right.
> Hopefully, I'll be able to collect more data evaluating this.

I've observed that the window between unregister_fair_sched_group() and
free_fair_sched_group() is commonly around 15 ms (based on kprobe
tracing).

I have a reproducer (attached) that can hit this window quite easily
after tuning.  I can observe consequences of it even with a recent 5.15
kernel. (And I also have reports from real world workloads failing due
to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
unthrottle").)

My original patch was really an uninformed attempt given the length of
the window.

[snip]

On Wed, Oct 13, 2021 at 07:45:59PM +0100, Odin Ugedal <odin@...d.al> wrote:
> Ref. your comment about reverting a7b359fc6a37
> ("sched/fair: Correctly insert cfs_rq's to list on unthrottle"), I
> think that is fine as long as we revert the commit it fixes as well,
> to avoid a regression of that (but yeah, that regression itself is
> less bad than your discovery).

I say no to reverting 31bc6aeaab1d ("sched/fair: Optimize
update_blocked_averages()") (it solves reported performance issues, it's
way too old :-).

> set cfs_rq->on_list=2 inside that lock under your code change? If we
> then treat on_list=2
> as "not on list, and do not add"?

The possibilities for the current problem:

1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
(Not exclusive with the other suggestions, rather a stop-gap for the
time being.)

2) Don't add offlined task_groups into the undecayed list
- Your proposal with overloaded on_list=2 could serve as mark of that,
  but it's a hack IMO.
- Proper way (tm) would be to use css_tryget_online() and css_put() when
  dealing with the list (my favorite at the moment).

3) Narrowing the race-window dramatically
- that is by moving list removal from unregister_fair_sched_group() to
  free_fair_sched_group(),
- <del>or use list_empty(tg->list) as indicator whether we're working
  with onlined task_group.</del> (won't work for RCU list)

4) Rework how throttled load is handled (hand waving)
- There is remove_entity_load_avg() that moves the load to parent upon
  final removal. Maybe it could be generalized for temporary removals by
  throttling (so that unthrottling could again add only non-empty
  cfs_rqs to the list and undecayed load won't skew fairness).
- or the way of [1].

5) <your ideas>

Opinions?

Thanks,
Michal

[1] https://lore.kernel.org/lkml/CAFpoUr1AO_qStNOYrFWGnFfc=uSFrXSYD8A5cQ8h0t2pioQzDA@mail.gmail.com/

Download attachment "run2.sh" of type "application/x-sh" (1600 bytes)