linux-kernel - Re: [PATCH v3] sched/deadline: Fix sched_dl_global

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200924155909.GA7951@iZj6chx1xj0e0buvshuecpZ>
Date:   Thu, 24 Sep 2020 23:59:09 +0800
From:   Peng Liu <iwtbavbm@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, valentin.schneider@....com, raistlin@...ux.it,
        iwtbavbm@...il.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] sched/deadline: Fix sched_dl_global_validate()

Hi Peter,

Thanks for looking at this.

On Thu, Sep 24, 2020 at 12:57:14PM +0200, Peter Zijlstra wrote:
> On Sat, Sep 19, 2020 at 09:42:49AM +0800, Peng Liu wrote:
> > When user changes sched_rt_{runtime, period}_us, then
> > 
> >   sched_rt_handler()
> >     -->	sched_dl_bandwidth_validate()
> > 	{
> > 		new_bw = global_rt_runtime()/global_rt_period();
> > 
> > 		for_each_possible_cpu(cpu) {
> > 			dl_b = dl_bw_of(cpu);
> > 			if (new_bw < dl_b->total_bw)
> > 				ret = -EBUSY;
> > 		}
> > 	}
> > 
> > Under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
> > dl_b->total_bw is the allocated bandwidth of the whole root domain.
> > we should compare dl_b->total_bw against cpus*new_bw, where 'cpus'
> > is the number of CPUs of the root domain.
> 
> Is there an actual problem there? Spell it out.

I created another root domain(contains 2 CPUs) besides the default
one, and the global default rt bandwidth is 95%. Then I launched a
DL process which need 25% bandwidth and moved it to the new root
domain, so far so good.

Then I tried to change global rt bandwidth to 20% with cmd:
	echo 200000 > /proc/sys/kernel/sched_rt_runtime_us
but ending with the "device busy" error. Only values greater than
250000 could work.

The new root domain contains two CPUs, thus should could provide
totally 2*20%(>25%) bandwidth. So the error is strange.

Finally I found it is the sched_dl_global_validate() mistakenly
do the validation. It doesn't multiply the root domain weight.

way to reproduce:
cd /sys/fs/cgroup/cpuset/
mkdir cluster
echo 0 > cpuset.sched_load_balance
cd cluster
echo 10-11 > cpuset.cpus
echo 0 > cpuset.mems
echo 1 > cpuset.cpu_exclusive
echo pid-of-dl25 > tasks

> > 
> > [1] https://lkml.org/lkml/2010/2/28/119
> 
> Don't use lkml.org links, use lkml.kernel.org/r/$MsgID instead.

OK, I will.

> 
> > [!CONFIG_SMP build error]
> > Reported-by: kernel test robot <lkp@...el.com>
> > Signed-off-by: Peng Liu <iwtbavbm@...il.com>
> 
> Quite frankly this patch is horrible #ifdef soup.

Frankly speaking, I also hate the ugly #ifdef guys, but I have no
idea how to eliminate them until seeing your method. Indeed, quite
clear. I will refine the patch according your suggestion. Thanks.

> 
> Can't you make something like the below work?
> 
> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 3862a28cd05d..3f309e0f69f5 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -97,6 +97,17 @@ static inline unsigned long dl_bw_capacity(int i)
>  		return __dl_bw_capacity(i);
>  	}
>  }
> +
> +static inline bool dl_bw_visited(int cpu, u64 gen)
> +{
> +	struct root_domain *rd = cpu_rq(i)->rd;
> +
> +	if (rd->visit_gen == gen)
> +		return true;
> +
> +	rd->visit_gen = gen;
> +	return false;
> +}
>  #else
>  static inline struct dl_bw *dl_bw_of(int i)
>  {
> @@ -112,6 +123,11 @@ static inline unsigned long dl_bw_capacity(int i)
>  {
>  	return SCHED_CAPACITY_SCALE;
>  }
> +
> +static inline bool dl_bw_visited(int cpu, u64 gen)
> +{
> +	return false;
> +}
>  #endif
>  
>  static inline
> @@ -2513,31 +2529,35 @@ const struct sched_class dl_sched_class
>  
>  int sched_dl_global_validate(void)
>  {
> +	static u64 generation = 0;
>  	u64 runtime = global_rt_runtime();
>  	u64 period = global_rt_period();
>  	u64 new_bw = to_ratio(period, runtime);
> -	struct dl_bw *dl_b;
> -	int cpu, ret = 0;
> +	int cpu, cpus, ret = 0;
>  	unsigned long flags;
> +	struct dl_bw *dl_b;
> +	u64 gen = ++generation;
>  
>  	/*
>  	 * Here we want to check the bandwidth not being set to some
>  	 * value smaller than the currently allocated bandwidth in
>  	 * any of the root_domains.
> -	 *
> -	 * FIXME: Cycling on all the CPUs is overdoing, but simpler than
> -	 * cycling on root_domains... Discussion on different/better
> -	 * solutions is welcome!
>  	 */
>  	for_each_possible_cpu(cpu) {
> +
>  		rcu_read_lock_sched();
> +		if (dl_bw_visited(cpu, gen))
> +			goto next;
> +
>  		dl_b = dl_bw_of(cpu);
> +		cpus = dl_bw_cpus(cpu);
>  
>  		raw_spin_lock_irqsave(&dl_b->lock, flags);
> -		if (new_bw < dl_b->total_bw)
> +		if (new_bw * cpus < dl_b->total_bw)
>  			ret = -EBUSY;
>  		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
>  
> +	next:
>  		rcu_read_unlock_sched();
>  
>  		if (ret)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 28709f6b0975..7f0947db6e2c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -801,6 +801,8 @@ struct root_domain {
>  	struct dl_bw		dl_bw;
>  	struct cpudl		cpudl;
>  
> +	u64			visit_gen;
> +
>  #ifdef HAVE_RT_PUSH_IPI
>  	/*
>  	 * For IPI pull requests, loop across the rto_mask.