linux-kernel - Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64a903d9-0398-2953-8a85-fd0cd238989e@bytedance.com>
Date:   Tue, 29 Mar 2022 11:36:34 +0800
From:   Chengming Zhou <zhouchengming@...edance.com>
To:     Benjamin Segall <bsegall@...gle.com>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, mgorman@...e.de, bristot@...hat.com,
        linux-kernel@...r.kernel.org, duanxiongchun@...edance.com,
        songmuchun@...edance.com
Subject: Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control
 with nohz_full

Hi,

On 2022/3/29 03:05, Benjamin Segall wrote:
> Chengming Zhou <zhouchengming@...edance.com> writes:
> 
>> With nohz_full enabled on cpu, the scheduler_tick() will be stopped
>> when only one CFS task left on rq.
>>
>> scheduler_tick()
>>   task_tick_fair()
>>     entity_tick()
>>       update_curr()
>>         account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped
>>
>> So that running task can't account its runtime periodically, but
>> the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime()
>> periodically. Later in one period, the task would account very
>> big delta_exec, which cause the cfs_rq to be throttled for a
>> long time.
>>
>> There are two solutions for the problem, the first is that we
>> can check in sched_can_stop_tick() if current task's cfs_rq
>> have runtime_enabled, in which case we don't stop tick. But
>> it will make nohz_full almost useless in cloud environment
>> that every container has the cpu bandwidth control setting.
>>
>> The other is what this patch implemented, cfs_bandwidth hrtimer
>> would sync unaccounted runtime from all running cfs_rqs with
>> tick stopped, just before __refill_cfs_bandwidth_runtime().
>> Also do the same thing in tg_set_cfs_bandwidth().
> 
>>
>> A testcase to reproduce:
>> ```
>> cd /sys/fs/cgroup
>> echo "+cpu" > cgroup.subtree_control
>>
>> mkdir test
>> echo "105000 100000" > test/cpu.max
> 
> 
> The other half of this problem I think would be that it also won't
> throttle when it /is/ out of bandwidth. That is, 'echo "50000 100000" >
> test/cpu.max' would not stop after 50ms of runtime is spent, it would
> only stop (with this patch) after 100ms. It would then correctly need to
> repay that debt, so the (very) long-run ratio should be correct, but a
> misbehaving task could be use practically arbitrarily more than their
> supposed quota over the ~100s of milliseconds timescales cfsb is
> generally supposed to work on.

Right, it's a problem with this patch's solution. I can't think of a
better way to make NOHZ_FULL work with bandwidth control. But it should
be no problem for most use-case? If it used over quota, it would need
to repay that debt before it can run again, so it just misbehave in that
period.

The main problem this patch solved is that task accumulated a very long
delta_exec accounted in one period, then cause very long time throttled.

> 
> However, I don't really see a way for cfsb's current design to avoid
> that on nohz_full (without vetoing sched_can_stop_tick). It might be
> possible to try and guess when quota will run out by tracking the number
> of running or runnable threads, but that would be a nontrivial
> undertaking and involve a bunch of hrtimer reprogramming (particularly
> in cases when some cgroups actually have tasks that wake/sleep frequently).
> 
> 
>>
>> echo $$ > test/cgroup.procs
>> taskset -c 1 bash -c "while true; do let i++; done"
>> ```
>> Ctrl-C and cat test/cpu.stat to see if nr_throttled > 0.
>>
>> The above testcase uses period 100ms and quota 105ms, would
>> only see nr_throttled > 0 on nohz_full system. The problem
>> is gone in test with this patch.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@...edance.com>
>> ---
>>  kernel/sched/core.c  |  4 ++++
>>  kernel/sched/fair.c  | 30 ++++++++++++++++++++++++++++++
>>  kernel/sched/sched.h |  3 +++
>>  3 files changed, 37 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index d575b4914925..17b5e3d27401 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -10443,6 +10443,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>>  	 */
>>  	if (runtime_enabled && !runtime_was_enabled)
>>  		cfs_bandwidth_usage_inc();
>> +
>> +	if (runtime_was_enabled)
>> +		sync_cfs_bandwidth_runtime(cfs_b);
>> +
>>  	raw_spin_lock_irq(&cfs_b->lock);
>>  	cfs_b->period = ns_to_ktime(period);
>>  	cfs_b->quota = quota;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ee0664c9d291..ebda70a0e3a8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5339,6 +5339,34 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>>  	return HRTIMER_NORESTART;
>>  }
>>  
>> +#ifdef CONFIG_NO_HZ_FULL
>> +void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>> +{
>> +	unsigned int cpu;
>> +	struct rq *rq;
>> +	struct rq_flags rf;
>> +	struct cfs_rq *cfs_rq;
>> +	struct task_group *tg;
>> +
>> +	tg = container_of(cfs_b, struct task_group, cfs_bandwidth);
>> +
>> +	for_each_online_cpu(cpu) {
>> +		if (!tick_nohz_tick_stopped_cpu(cpu))
>> +			continue;
>> +
>> +		rq = cpu_rq(cpu);
>> +		cfs_rq = tg->cfs_rq[cpu];
>> +
>> +		rq_lock_irqsave(rq, &rf);
> 
> A for-every-nohz-cpu rqlock is definitely not great when the cgroup is
> likely to be running on much fewer cpus. I'm not sure what the current
> opinion would be on doing some sort of optimistic locking like
> 
> if (!READ_ONCE(cfs_rq->curr)) continue;
> rq_lock_irqsave(...);
> if (cfs_rq->curr) { ... }

I think it's better, will changed to this if Peter is ok with this patch.

Thanks.

> 
> but solutions to the other problem I mentioned earlier might provide
> help here anyways.
> 
> 
>> +		if (cfs_rq->curr) {
>> +			update_rq_clock(rq);
>> +			update_curr(cfs_rq);
>> +		}
>> +		rq_unlock_irqrestore(rq, &rf);
>> +	}
>> +}
>> +#endif
>> +
>>  extern const u64 max_cfs_quota_period;
>>  
>>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> @@ -5350,6 +5378,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>>  	int idle = 0;
>>  	int count = 0;
>>  
>> +	sync_cfs_bandwidth_runtime(cfs_b);
>> +
>>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
>>  	for (;;) {
>>  		overrun = hrtimer_forward_now(timer, cfs_b->period);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 58263f90c559..57f9da9c50c1 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2351,9 +2351,12 @@ static inline void sched_update_tick_dependency(struct rq *rq)
>>  	else
>>  		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>>  }
>> +
>> +extern void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>>  #else
>>  static inline int sched_tick_offload_init(void) { return 0; }
>>  static inline void sched_update_tick_dependency(struct rq *rq) { }
>> +static inline void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) {}
>>  #endif
>>  
>>  static inline void add_nr_running(struct rq *rq, unsigned count)