linux-kernel - Re: [PATCH] sched/fair: fix broken bandwidth control with nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xm26wngduccv.fsf@google.com>
Date:   Mon, 28 Mar 2022 12:05:20 -0700
From:   Benjamin Segall <bsegall@...gle.com>
To:     Chengming Zhou <zhouchengming@...edance.com>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, mgorman@...e.de, bristot@...hat.com,
        linux-kernel@...r.kernel.org, duanxiongchun@...edance.com,
        songmuchun@...edance.com
Subject: Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full

Chengming Zhou <zhouchengming@...edance.com> writes:

> With nohz_full enabled on cpu, the scheduler_tick() will be stopped
> when only one CFS task left on rq.
>
> scheduler_tick()
>   task_tick_fair()
>     entity_tick()
>       update_curr()
>         account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped
>
> So that running task can't account its runtime periodically, but
> the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime()
> periodically. Later in one period, the task would account very
> big delta_exec, which cause the cfs_rq to be throttled for a
> long time.
>
> There are two solutions for the problem, the first is that we
> can check in sched_can_stop_tick() if current task's cfs_rq
> have runtime_enabled, in which case we don't stop tick. But
> it will make nohz_full almost useless in cloud environment
> that every container has the cpu bandwidth control setting.
>
> The other is what this patch implemented, cfs_bandwidth hrtimer
> would sync unaccounted runtime from all running cfs_rqs with
> tick stopped, just before __refill_cfs_bandwidth_runtime().
> Also do the same thing in tg_set_cfs_bandwidth().

>
> A testcase to reproduce:
> ```
> cd /sys/fs/cgroup
> echo "+cpu" > cgroup.subtree_control
>
> mkdir test
> echo "105000 100000" > test/cpu.max


The other half of this problem I think would be that it also won't
throttle when it /is/ out of bandwidth. That is, 'echo "50000 100000" >
test/cpu.max' would not stop after 50ms of runtime is spent, it would
only stop (with this patch) after 100ms. It would then correctly need to
repay that debt, so the (very) long-run ratio should be correct, but a
misbehaving task could be use practically arbitrarily more than their
supposed quota over the ~100s of milliseconds timescales cfsb is
generally supposed to work on.

However, I don't really see a way for cfsb's current design to avoid
that on nohz_full (without vetoing sched_can_stop_tick). It might be
possible to try and guess when quota will run out by tracking the number
of running or runnable threads, but that would be a nontrivial
undertaking and involve a bunch of hrtimer reprogramming (particularly
in cases when some cgroups actually have tasks that wake/sleep frequently).


>
> echo $$ > test/cgroup.procs
> taskset -c 1 bash -c "while true; do let i++; done"
> ```
> Ctrl-C and cat test/cpu.stat to see if nr_throttled > 0.
>
> The above testcase uses period 100ms and quota 105ms, would
> only see nr_throttled > 0 on nohz_full system. The problem
> is gone in test with this patch.
>
> Signed-off-by: Chengming Zhou <zhouchengming@...edance.com>
> ---
>  kernel/sched/core.c  |  4 ++++
>  kernel/sched/fair.c  | 30 ++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |  3 +++
>  3 files changed, 37 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d575b4914925..17b5e3d27401 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10443,6 +10443,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>  	 */
>  	if (runtime_enabled && !runtime_was_enabled)
>  		cfs_bandwidth_usage_inc();
> +
> +	if (runtime_was_enabled)
> +		sync_cfs_bandwidth_runtime(cfs_b);
> +
>  	raw_spin_lock_irq(&cfs_b->lock);
>  	cfs_b->period = ns_to_ktime(period);
>  	cfs_b->quota = quota;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ee0664c9d291..ebda70a0e3a8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5339,6 +5339,34 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>  	return HRTIMER_NORESTART;
>  }
>  
> +#ifdef CONFIG_NO_HZ_FULL
> +void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> +{
> +	unsigned int cpu;
> +	struct rq *rq;
> +	struct rq_flags rf;
> +	struct cfs_rq *cfs_rq;
> +	struct task_group *tg;
> +
> +	tg = container_of(cfs_b, struct task_group, cfs_bandwidth);
> +
> +	for_each_online_cpu(cpu) {
> +		if (!tick_nohz_tick_stopped_cpu(cpu))
> +			continue;
> +
> +		rq = cpu_rq(cpu);
> +		cfs_rq = tg->cfs_rq[cpu];
> +
> +		rq_lock_irqsave(rq, &rf);

A for-every-nohz-cpu rqlock is definitely not great when the cgroup is
likely to be running on much fewer cpus. I'm not sure what the current
opinion would be on doing some sort of optimistic locking like

if (!READ_ONCE(cfs_rq->curr)) continue;
rq_lock_irqsave(...);
if (cfs_rq->curr) { ... }

but solutions to the other problem I mentioned earlier might provide
help here anyways.


> +		if (cfs_rq->curr) {
> +			update_rq_clock(rq);
> +			update_curr(cfs_rq);
> +		}
> +		rq_unlock_irqrestore(rq, &rf);
> +	}
> +}
> +#endif
> +
>  extern const u64 max_cfs_quota_period;
>  
>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> @@ -5350,6 +5378,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  	int idle = 0;
>  	int count = 0;
>  
> +	sync_cfs_bandwidth_runtime(cfs_b);
> +
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  	for (;;) {
>  		overrun = hrtimer_forward_now(timer, cfs_b->period);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 58263f90c559..57f9da9c50c1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2351,9 +2351,12 @@ static inline void sched_update_tick_dependency(struct rq *rq)
>  	else
>  		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>  }
> +
> +extern void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>  #else
>  static inline int sched_tick_offload_init(void) { return 0; }
>  static inline void sched_update_tick_dependency(struct rq *rq) { }
> +static inline void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) {}
>  #endif
>  
>  static inline void add_nr_running(struct rq *rq, unsigned count)